Add Job awaiter #633

lblackstone · 2019-07-11T21:55:45Z

Fixes #449.

TODO:

Report errors
Add tests
Await deletion

Here's how things look at this point:

Failed Job:

[0/2] Waiting for Job "foo" to start
[1/2] Waiting for Job "foo" to succeed
warning: [Pod foo-vk6wg]: containers with unready status: [pi]

warning: [Pod foo-vk6wg]: containers with unready status: [pi] -- [ContainerCannotRun] OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"perly\": executable file not found in $PATH": unknown

error: Job has reached the specified backoff limit -- [BackoffLimitExceeded] Job has reached the specified backoff limit
 
error: Plan apply failed: 4 errors occurred:
	* resource foo was successfully created, but the Kubernetes API server reported that it failed to fully initialize or become live: Resource 'foo' was created but failed to initialize
	* [Pod foo-vk6wg]: containers with unready status: [pi]
	* [Pod foo-vk6wg]: containers with unready status: [pi] -- [ContainerCannotRun] OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"perly\": executable file not found in $PATH": unknown
	* Job has reached the specified backoff limit -- [BackoffLimitExceeded] Job has reached the specified backoff limit

Successful Job:

[0/2] Waiting for Job "foo" to start
[1/2] Waiting for Job "foo" to succeed
warning: [Pod foo-wsq8r]: containers with unready status: [pi]
Job ready

nesl247 · 2019-09-19T00:57:11Z

Looking forward to this. Should really simplify our code base. Any idea when this is targeted to be completed?

lblackstone · 2019-09-19T15:11:49Z

Any idea when this is targeted to be completed?

I'd guess that it will be in a dev build within the next week.

nesl247 · 2019-09-19T15:49:20Z

Awesome, so glad to hear that. This is a HUGE feature for us, and pretty much anyone using pulumi to deploy applications that require stuff like DB migrations, etc.

lblackstone · 2019-09-19T23:34:18Z

Here's how things look at this point:

Failed Job:

[0/2] Waiting for Job "foo" to start
[1/2] Waiting for Job "foo" to succeed
warning: [Pod foo-vk6wg]: containers with unready status: [pi]

warning: [Pod foo-vk6wg]: containers with unready status: [pi] -- [ContainerCannotRun] OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"perly\": executable file not found in $PATH": unknown

error: Job has reached the specified backoff limit -- [BackoffLimitExceeded] Job has reached the specified backoff limit
 
error: Plan apply failed: 4 errors occurred:
	* resource foo was successfully created, but the Kubernetes API server reported that it failed to fully initialize or become live: Resource 'foo' was created but failed to initialize
	* [Pod foo-vk6wg]: containers with unready status: [pi]
	* [Pod foo-vk6wg]: containers with unready status: [pi] -- [ContainerCannotRun] OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"perly\": executable file not found in $PATH": unknown
	* Job has reached the specified backoff limit -- [BackoffLimitExceeded] Job has reached the specified backoff limit

Successful Job:

[0/2] Waiting for Job "foo" to start
[1/2] Waiting for Job "foo" to succeed
warning: [Pod foo-wsq8r]: containers with unready status: [pi]
Job ready

pkg/await/batch_job.go

pkg/await/states/job.go

metral · 2019-09-23T19:19:48Z

pkg/await/states/job_test.go

+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+


Really digging the test scenarios here 🙂. I know we do similar, extended coverage for all other resources with fixtures, but am not sure to what extent. How is the parity in this space across all resources?

This is similar to the way we're testing Pod, and is the general direction I'm taking the test coverage. Existing tests for other resources are basically black box tests, and generally make it harder to reason about correctness.

There's room for another layer of tests to make sure the awaiter channels/timeouts are wired up properly, but I think that's a lot less critical and error prone than the state checking logic I'm testing here.

metral

Overall LGTM - A couple of logical changes would be nice to see to reduce on complexity

metral

LGTM

cc @hausdorff - PTAL and review.

hausdorff

Adding await semantics to Job simply enriches the existing functionality, so I think this is the right move.

But, before we click the merge button, I just want to make sure we're all on the same page about the (very weird) implications of using Job in Pulumi. All of these were true before, but we never had this conversation before, so let me state some things I believe to be true about the Job, and if any of them are wrong, please correct me:

If you run pulumi up with a Job, it will stick around until you delete it. So subsequent runs of pulumi up will not cause the job to re-run.
Users should be very cautious of including Job in Pulumi programs! Unlike other resource types, Job is intended to run once (e.g., for a DB schema migration), so when and how it runs really matters. Once you add a Job to your Pulumi project, ordering suddenly matters a lot—so if you run a fresh pulumi up and your Job does not run exactly when it is supposed to, it could fail the whole deployment.
We make no attempt to be smart about automated cleanup from the TTL controller. So, if a user sets .spec.ttlSecondsAfterFinished and the Job gets cleaned up, another run of pulumi up after the TTL will re-deploy the Job.

Like I said, I think all of this is fine, especially since we support it all already, but I just want us to go in with eyes open.

hausdorff · 2019-09-30T17:07:56Z

pkg/await/batch_job.go

+// A Job is a construct that allows users to run a workload as a Pod that terminates with a
+// success or failure.
+//
+// A Job is considered "ready" if the following conditions are true:


We say that these are the conditions required to determine a job is "ready", but it sounds below like we're describing jobs that have completed?

Yes, I meant "ready" in the sense that we're done waiting on the resource. For Job, that would mean it is complete.

lukehoban · 2019-09-30T20:51:40Z

If you run pulumi up with a Job, it will stick around until you delete it. So subsequent runs of pulumi up will not cause the job to re-run.

Yes - that is expected and desired - unless you do something to force it to replace.

Users should be very cautious of including Job in Pulumi programs! Unlike other resource types, Job is intended to run once (e.g., for a DB schema migration), so when and how it runs really matters. Once you add a Job to your Pulumi project, ordering suddenly matters a lot—so if you run a fresh pulumi up and your Job does not run exactly when it is supposed to, it could fail the whole deployment.

I think this behaves exactly as you want for scenarios where it is useful - as long as you can force it to replace when the thing that should trigger it to re-run changes.

hausdorff

This is simple enough we can probably merge now. I'm fully confident we'll find more bugs, but I think we're well within our risk tolerance here.

I left a couple comments. The biggest thing that is missing is the error reporting is not super great in interactive mode. We can follow up with that though.

pkg/await/batch_job.go

pkg/await/states/job.go

lblackstone mentioned this pull request Aug 20, 2019

Document status of Job readiness check #723

Merged

lblackstone force-pushed the lblackstone/job-await branch 3 times, most recently from 07aed3d to 59f6ea7 Compare September 4, 2019 17:51

lblackstone force-pushed the lblackstone/job-await branch from 59f6ea7 to 1c223e7 Compare September 18, 2019 21:39

lblackstone force-pushed the lblackstone/job-await branch from 1c223e7 to 04ccc3a Compare September 19, 2019 23:30

lblackstone force-pushed the lblackstone/job-await branch 2 times, most recently from 377ce21 to 9759d2f Compare September 20, 2019 16:12

lblackstone marked this pull request as ready for review September 20, 2019 16:16

lblackstone requested review from metral and hausdorff September 20, 2019 16:16

metral approved these changes Sep 23, 2019

View reviewed changes

metral suggested changes Sep 23, 2019

View reviewed changes

lblackstone force-pushed the lblackstone/job-await branch from 9759d2f to 191b54b Compare September 23, 2019 22:14

lblackstone requested a review from metral September 23, 2019 22:17

metral approved these changes Sep 23, 2019

View reviewed changes

lblackstone added 2 commits September 24, 2019 10:02

Add logic to check for Job readiness

ce8f9ef

Address feedback

de9b5fd

lblackstone force-pushed the lblackstone/job-await branch from 191b54b to de9b5fd Compare September 24, 2019 16:03

hausdorff reviewed Sep 30, 2019

View reviewed changes

hausdorff approved these changes Sep 30, 2019

View reviewed changes

pkg/await/batch_job.go Show resolved Hide resolved

pkg/await/states/job.go Show resolved Hide resolved

This was referenced Sep 30, 2019

Improve Job awaiter error and warning reporting #822

Closed

Process Pod events in Job awaiter Read() func #823

Closed

metral merged commit 5e311f8 into master Sep 30, 2019

pulumi-bot deleted the lblackstone/job-await branch September 30, 2019 23:38

lblackstone mentioned this pull request Feb 1, 2023

Job awaiter status regression #2298

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Job awaiter #633

Add Job awaiter #633

lblackstone commented Jul 11, 2019 •

edited

Loading

nesl247 commented Sep 19, 2019

lblackstone commented Sep 19, 2019

nesl247 commented Sep 19, 2019

lblackstone commented Sep 19, 2019

metral Sep 23, 2019

lblackstone Sep 23, 2019

metral left a comment

metral left a comment

hausdorff left a comment

hausdorff Sep 30, 2019

lblackstone Oct 3, 2019

lukehoban commented Sep 30, 2019

hausdorff left a comment

Add Job awaiter #633

Add Job awaiter #633

Conversation

lblackstone commented Jul 11, 2019 • edited Loading

nesl247 commented Sep 19, 2019

lblackstone commented Sep 19, 2019

nesl247 commented Sep 19, 2019

lblackstone commented Sep 19, 2019

metral Sep 23, 2019

Choose a reason for hiding this comment

lblackstone Sep 23, 2019

Choose a reason for hiding this comment

metral left a comment

Choose a reason for hiding this comment

metral left a comment

Choose a reason for hiding this comment

hausdorff left a comment

Choose a reason for hiding this comment

hausdorff Sep 30, 2019

Choose a reason for hiding this comment

lblackstone Oct 3, 2019

Choose a reason for hiding this comment

lukehoban commented Sep 30, 2019

hausdorff left a comment

Choose a reason for hiding this comment

lblackstone commented Jul 11, 2019 •

edited

Loading