Fix bug with jobs not being marked as completed. #501

jlewi · 2018-03-23T21:54:31Z

Fix several bugs with the job controller.

A bug was introduced with getting the replica status in Create Pod instead of Job #344 which
switched to creating pods directly.
Our presubmits/postsubmits were failing but this went unnoticed because
the git status check was improperly reported as succeeded.
One bug is because we try to get the pod status by name but the name
doesn't include the random salt in the pod name.
The code in question is a legacy of when we were using job controllers and
we first got the status of the job controller. We incorrectly changed that
code to get the pod. The correct thing is to just list pods by label; we
already do that in the code below so we just need to delete some code.
A second bug is a problem with deleting resources.
- Once a job is marked for deletion we shouldn't create any more resources. This would block
  deletion because we do foreground deletion so the TFJob won't be deleted until all child resources
  have been deleted.
- When the job is deleted the DeletionTimestamp will be set on the object and we can use that
  inside the syncFunction
- But reconcile needs to update the TFJob stored inside TrainingJob so that we pick up changes
  to TFJob made external to the operator; e.g. by the user issuing a delete request.
A third problem is resetting the rate limiter on the work queue
- I think we want to reset the rateLimiter by calling forget after every successful processing
  of a work item. This way if we receive another event we can process it immediately.
Increase the timeout we wait for the job to finish.
Use logrus to log fields providing useful metadata like the job that a log entry goes with.
Fix E2E tests timing out; job appears to remain in running state even though job is done. #500

This change is

coveralls · 2018-03-23T22:04:48Z

Coverage decreased (-0.09%) to 45.323% when pulling 5f5d859 on jlewi:fix_pod into eec56b5 on kubeflow:master.

jlewi · 2018-03-23T22:31:24Z

/retest

jlewi · 2018-03-24T00:07:10Z

GPU test is passing but not simple TFJob.

gaocegege · 2018-03-24T02:43:16Z

https://console.cloud.google.com/kubernetes/pod/us-east1-d/kubeflow-testing/kubeflow-test-infra/kubeflow-tf-operator-presubmit-tfjob-e2e-501-5957879-250-3775-1429787538?project=kubeflow-ci&authuser=1&organizationId=714441643818&tab=logs&duration=PT1H&container_summary_list_tablesize=10&log_entry_list_tablesize=50

It seems that simple TFJob times out

gaocegege · 2018-03-24T02:59:03Z

INFO\|2018-03-24T01:32:54\|py/tf_job_client.py\|96\| Job simple-tfjob in namespace default; uid=346f5fcd-2f03-11e8-9b8b-42010a8e008e; phase=Done, state=Succeeded,
INFO\|2018-03-24T01:32:54\|py/tf_job_client.py\|66\| Deleted job default.simple-tfjob
INFO\|2018-03-24T01:32:54\|py/tf_job_client.py\|63\| Deleting job default.simple-tfjob

The job has been deleted but we can get it via crd_api.get_namespaced_custom_object, that's why timesout.

gaocegege · 2018-03-24T03:00:52Z

@jlewi I think we could merge the PR then I could try to file a new PR to fix simple tfjob test.

gaocegege · 2018-03-24T03:03:56Z

I think it is caused by https://github.com/kubeflow/tf-operator/blob/0759f7ae5142ed2e78a6971e9703fdc86b7307cd/py/tf_job_client.py#L64:28

We do not check the response of the delete request.

jlewi · 2018-03-24T19:00:22Z

Interested. Seems strange that it's only happening now and not with GPU jobs.

* A bug was introduced with getting the replica status in kubeflow#344 which switched to creating pods directly. * Our presubmits/postsubmits were failing but this went unnoticed because the git status check was improperly reported as succeeded. * The bug is because we try to get the pod status by name but the name doesn't include the random salt in the pod name. * The code in question is a legacy of when we were using job controllers and we first got the status of the job controller. We incorrectly changed that code to get the pod. The correct thing is to just list pods by label; we already do that in the code below so we just need to delete some code. * Don't create any resources if the DeletionTimestamp is set. Creating resources at this point would end blocking deletion of the object because the controller would create resources while we are trying to delete them. * Use logrus in controller.go, trainer.go, and replicas.go to log with fields providing information about the job and repliac. This makes it easy to filter logs for a particular job. * Use logrus to log the name of the job in a field.

jlewi · 2018-03-25T01:50:20Z

/unassign @zjj2wry
/assign @gaocegege

Use the Phase to determine whether we should create resources.

jlewi · 2018-03-25T03:35:14Z

@gaocegege Any idea what the lint failure means

The command "go build -o tf-operator github.com/kubeflow/tf-operator/cmd/tf-operator" exited with 0.
56.67s$ gometalinter --config=linter_config.json ./pkg/...
pkg/trainer/replicas.go:1::warning: file is not goimported (goimports)

jlewi · 2018-03-25T03:35:48Z

/retest

gaocegege · 2018-03-25T04:36:56Z

It seems that you already fixed the lint errors in travis.

jlewi · 2018-03-25T17:27:41Z

For the most recent E2E failure.

The test submits multiple TFJobs with the same name.
The first job completes.
We then resubmit.
Pods are created and this job enters the running state.
TFOperator stops calling reconcile for the new job at 21:00:08 (shortly after the pods are created)

This could explain why the job is stuck in the running state.
The logs for the master pod indicate it ran to completion successfully.

The logs indicate that between 21:00:12 and 21:07 (when cluster is deleted because of test-runner timeout) reconcile is never called for the TFJob.
* My expectation though is that the informer should be calling reconcile every 30 seconds. Why isn't that happening?

jlewi · 2018-03-25T17:37:34Z

My conjecture is that the problem is the ratelimiting work queue.

We use the defaults which uses exponential backoff with a max delay per item of 1000 seconds.

So every 30 seconds the informer is queuing an update event but these are then being rate limited and processed with a max delay of 1000 seconds.

The log also indicates that we call forget on the workqueue only once and thats for the gpu job.
{"filename":"controller/controller.go:195","job":"default/gpu-tfjob","msg":"WorkQueue forgetting key default/gpu-tfjob","level":"info"}

I think this is a bug because I think forget causes the ratelimiter to reset so we should be calling it after every successful call so that we don't invoke rate limits.

* Otherwise the ratelimiter will end up delaying processing subsequent events which isn't what we want.

…ures TrainingJob has an up to date representation of the job. * Otherwise changes made to the spec won't be available to TrainingJob. For example, if the job is deleted by the user, the deletion timestamp will be set. But if we don't update the TFJob stored in TrainingJob this change won't be propogated.

jlewi · 2018-03-25T22:24:11Z

Looks like the test passed. Lets make sure its not a fluke.

jlewi · 2018-03-25T22:24:16Z

/test all

jlewi · 2018-03-25T23:22:10Z

@gaocegege PTAL

gaocegege · 2018-03-26T01:10:11Z

/lgtm

I think the test-and-forget logic is copied from job controller and I think it is not suitable for tfjob.

/cc @ScorpioCPH

k8s-ci-robot · 2018-03-26T01:10:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gaocegege

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [gaocegege]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gaocegege · 2018-03-26T01:10:51Z

Thanks for the fix!

* Fix bug with jobs not being marked as completed. * A bug was introduced with getting the replica status in kubeflow#344 which switched to creating pods directly. * Our presubmits/postsubmits were failing but this went unnoticed because the git status check was improperly reported as succeeded. * The bug is because we try to get the pod status by name but the name doesn't include the random salt in the pod name. * The code in question is a legacy of when we were using job controllers and we first got the status of the job controller. We incorrectly changed that code to get the pod. The correct thing is to just list pods by label; we already do that in the code below so we just need to delete some code. * Don't create any resources if the DeletionTimestamp is set. Creating resources at this point would end blocking deletion of the object because the controller would create resources while we are trying to delete them. * Use logrus in controller.go, trainer.go, and replicas.go to log with fields providing information about the job and repliac. This makes it easy to filter logs for a particular job. * Use logrus to log the name of the job in a field. * Checking the deletiontime stamp doesn't appear to be sufficient. Use the Phase to determine whether we should create resources. * Run gofmt. * * Reset the rate limiter after every successful sync. * Otherwise the ratelimiter will end up delaying processing subsequent events which isn't what we want. * Run goimports to fix lint issues. * * Reconcile needs to update the TFJob stored in TrainingJob. This ensures TrainingJob has an up to date representation of the job. * Otherwise changes made to the spec won't be available to TrainingJob. For example, if the job is deleted by the user, the deletion timestamp will be set. But if we don't update the TFJob stored in TrainingJob this change won't be propogated. * * TrainingJob.update should log the value of the job not the pointer. * Add more comments to the code.

k8s-ci-robot requested review from jimexist and zjj2wry March 23, 2018 21:54

k8s-ci-robot added the size/S label Mar 23, 2018

k8s-ci-robot added size/M size/L and removed size/S size/M labels Mar 24, 2018

jlewi force-pushed the fix_pod branch from 86bfcaf to 63f4248 Compare March 25, 2018 01:47

k8s-ci-robot assigned gaocegege Mar 25, 2018

Checking the deletiontime stamp doesn't appear to be sufficient.

46daa68

Use the Phase to determine whether we should create resources.

Run gofmt.

35f9c8d

jlewi added 3 commits March 25, 2018 10:56

* Reset the rate limiter after every successful sync.

63e1843

* Otherwise the ratelimiter will end up delaying processing subsequent events which isn't what we want.

Run goimports to fix lint issues.

c6f0f57

jlewi added 2 commits March 25, 2018 15:26

* TrainingJob.update should log the value of the job not the pointer.

477bc5e

Add more comments to the code.

5f5d859

k8s-ci-robot added the lgtm label Mar 26, 2018

k8s-ci-robot requested a review from ScorpioCPH March 26, 2018 01:10

k8s-ci-robot added the approved label Mar 26, 2018

k8s-ci-robot merged commit b72f47e into kubeflow:master Mar 26, 2018

jlewi mentioned this pull request Mar 27, 2018

v1alpha2: Implementation #492

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug with jobs not being marked as completed. #501

Fix bug with jobs not being marked as completed. #501

jlewi commented Mar 23, 2018 •

edited

coveralls commented Mar 23, 2018 •

edited

jlewi commented Mar 23, 2018

jlewi commented Mar 24, 2018

gaocegege commented Mar 24, 2018

gaocegege commented Mar 24, 2018

gaocegege commented Mar 24, 2018

gaocegege commented Mar 24, 2018

jlewi commented Mar 24, 2018

jlewi commented Mar 25, 2018

jlewi commented Mar 25, 2018

jlewi commented Mar 25, 2018

gaocegege commented Mar 25, 2018

jlewi commented Mar 25, 2018 •

edited

jlewi commented Mar 25, 2018 •

edited

jlewi commented Mar 25, 2018

jlewi commented Mar 25, 2018

jlewi commented Mar 25, 2018

gaocegege commented Mar 26, 2018

k8s-ci-robot commented Mar 26, 2018

gaocegege commented Mar 26, 2018

Fix bug with jobs not being marked as completed. #501

Fix bug with jobs not being marked as completed. #501

Conversation

jlewi commented Mar 23, 2018 • edited

coveralls commented Mar 23, 2018 • edited

jlewi commented Mar 23, 2018

jlewi commented Mar 24, 2018

gaocegege commented Mar 24, 2018

gaocegege commented Mar 24, 2018

gaocegege commented Mar 24, 2018

gaocegege commented Mar 24, 2018

jlewi commented Mar 24, 2018

jlewi commented Mar 25, 2018

jlewi commented Mar 25, 2018

jlewi commented Mar 25, 2018

gaocegege commented Mar 25, 2018

jlewi commented Mar 25, 2018 • edited

jlewi commented Mar 25, 2018 • edited

jlewi commented Mar 25, 2018

jlewi commented Mar 25, 2018

jlewi commented Mar 25, 2018

gaocegege commented Mar 26, 2018

k8s-ci-robot commented Mar 26, 2018

gaocegege commented Mar 26, 2018

jlewi commented Mar 23, 2018 •

edited

coveralls commented Mar 23, 2018 •

edited

jlewi commented Mar 25, 2018 •

edited

jlewi commented Mar 25, 2018 •

edited