Allow disabling Savepoint during updates #184

maghamravi · 2020-03-06T19:36:31Z

The PR

Adds support for a new field SavepointDisabled to the spec to govern if Savepoints are needed during an update.
Renames the state / phase FlinkApplicationSavepointing to FlinkApplicationCancelling.
Recovery workflow (if and when Cancel fails) remain the same.
No changes to the Delete workflow. Users will have to set deleteMode: ForceCancel if they prefer not having Savepoints taken during Delete/cancel.

maghamravi · 2020-03-06T20:01:09Z

/ptal (👀) @mwylde @glaksh100 @anandswaminathan @kumare3 @tweise

anandswaminathan · 2020-03-06T20:19:21Z

docs/state_machine.mmd

 ClusterStarting -- Create fails --> DeployFailed

-Savepointing --> SubmittingJob
-Savepointing -- Savepoint fails --> Recovering
+Cancelling --> SubmittingJob


@maghamravi Not a fan of this approach.

Ideally I would not change the existing statemachine as much as possible. If just cancel is needed without savepointing, my recommendation would be to introduce a new state, say "Cancelling" that is reached from Savepointing if savepoint is disabled in spec. I believe "Recovering" is also not associated with Cancelling.

Honestly, I think we clubbed "Cancel" and "Savepoint" (two independent actions) into one called "Savepointing" which IMO should be "Cancelling".

Dunno if it's just me(I think?), but when I see the state "Savepointing', I am under the impression that the operator is triggering a "Savepoint" on the job (https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs-jobid-savepoints ) and not Cancelling it (with the cancels option)

If the consensus is to have "Cancel" be called from Savepointing in cases when "savepoint" is disabled, though I disagree, I am happy to commit!

Agreed. There are two things here:

Better name for the Savepointing state.

Separating the intention of this PR to a separate phase/state.

I merely am recommending (2) now. We can always revisit (1) and update the state names later.

@anandswaminathan Made the necessary changes!

maghamravi · 2020-03-10T16:06:46Z

Updated the PR with the following

Created a new state "Cancelling".
From "ClusterStarting" state, based on "SavepointDisabled" field in the spec, the state machine transitions to "Cancelling" or "Savepointing" state.
Though cancelling a job (without save point) rarely fails, I have it go through the max retries. If we are still unlucky, we move to "Submitting" phase as by then current running job(if existing) is already dysfunctional.

anandswaminathan · 2020-03-10T18:47:15Z

cc @tweise

@glaksh100 This change is small but you might need to add a check for your Green-Blue update.

@maghamravi This looks good.

You explanation for (3) is not always true. The ForceCancel call can fail if Jobmanager is restarting or for several other reasons. We should never assume and make progression during failure. Always rollback, and fail the deployment on failure.

Add to

https://github.com/lyft/flinkk8soperator/blob/master/docs/crd.md
Validation here: https://github.com/lyft/flinkk8soperator/blob/master/deploy/crd.yaml#L23
Integration test for this change - as you will be the first using this.

anandswaminathan · 2020-03-10T18:48:39Z

pkg/controller/flinkapplication/flink_state_machine.go

+			fmt.Sprintf("Could not cancel existing job: %s", reason))
+		application.Status.RetryCount = 0
+		application.Status.JobStatus.JobID = ""
+		s.updateApplicationPhase(application, v1beta1.FlinkApplicationSubmittingJob)


This should be FlinkApplicationRollingBackJob

We can't move ahead and submit job in the second cluster, if the first cluster is still running the job.

Good point! Definitely better to err on the side of caution.

mwylde · 2020-03-10T20:19:20Z

pkg/controller/flinkapplication/flink_state_machine_test.go

@@ -81,6 +81,132 @@ func TestHandleStartingClusterStarting(t *testing.T) {
 	assert.Nil(t, err)
 }

+func TestHandleNewOrCreateWithSavepointDisabled(t *testing.T) {


This is great, but can you also add an integration test? Given this is a completely different flow, would be good to fully test out the successful and failure paths.

mwylde · 2020-03-10T20:22:37Z

pkg/controller/flinkapplication/flink_state_machine.go

+		return statusChanged, nil
+	}
+
+	err := s.flinkController.ForceCancel(ctx, application, application.Status.DeployHash)


I think there's a potential case here where we get an error back from this call (e.g., a timeout), but the job does end up cancelled. We can handle that by first checking whether the job is running, and if not, moving on to SubmittingJob.

In my testing, I saw two kinds of errors

status code 404 when the job is already in Cancelled status

status code 400 for a bad request like the job with Id doesn't exist.

It definitely doesn't hurt to make that extra call to check if the job is running or not.

Yeah, in general for operators the pattern is (1) query the current state of the world; (2) make calls to update the world to the desired state. On each iteration of the reconciliation loop you don't really know what the state is until you query it (even the status might out of date or missing updates).

mwylde · 2020-03-10T20:24:30Z

pkg/apis/app/v1beta1/types.go

@@ -220,6 +221,7 @@ const (
 	FlinkApplicationSubmittingJob   FlinkApplicationPhase = "SubmittingJob"
 	FlinkApplicationRunning         FlinkApplicationPhase = "Running"
 	FlinkApplicationSavepointing    FlinkApplicationPhase = "Savepointing"
+	FlinkApplicationCancelling      FlinkApplicationPhase = "Cancelling"


Can you also update state_machine.md with the details of this new state?

maghamravi · 2020-03-12T19:01:34Z

@anandswaminathan @mwylde Addressed all review comments!

anandswaminathan · 2020-03-16T18:19:41Z

+1 LGTM
cc @mwylde @glaksh100

docs/crd.md

addressed review comments!

allow disabling savepoint during updates

79bca8e

maghamravi requested review from anandswaminathan, glaksh100, kumare3 and mwylde as code owners March 6, 2020 19:36

maghamravi closed this Mar 6, 2020

fix for the lint error

de931d6

maghamravi reopened this Mar 6, 2020

anandswaminathan previously requested changes Mar 6, 2020

View reviewed changes

refactored to have cancelling as a new stage

6dd8702

lint error fix

07760cb

anandswaminathan reviewed Mar 10, 2020

View reviewed changes

mwylde suggested changes Mar 10, 2020

View reviewed changes

maghamravi added 2 commits March 12, 2020 10:24

adding documentation and integration tests

fca3710

lint errors

b4dfec5

adding another integration test

a8e13a7

maghamravi requested review from mwylde and anandswaminathan March 13, 2020 21:24

lint errors again..

ac4861a

mwylde previously approved these changes Mar 16, 2020

View reviewed changes

docs/crd.md Outdated Show resolved Hide resolved

adding an extra clause

c4baa49

maghamravi dismissed mwylde’s stale review via c4baa49 March 17, 2020 02:26

anandswaminathan approved these changes Mar 17, 2020

View reviewed changes

maghamravi requested a review from mwylde March 17, 2020 16:13

mwylde approved these changes Mar 17, 2020

View reviewed changes

maghamravi merged commit 8dfdef5 into master Mar 17, 2020

tweise mentioned this pull request Mar 21, 2020

Non-streaming jobs, Beam and Checkpointing #127

Open

davidbirdsong mentioned this pull request Apr 29, 2020

Question: is savepointing required? #57

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow disabling Savepoint during updates #184

Allow disabling Savepoint during updates #184

maghamravi commented Mar 6, 2020

maghamravi commented Mar 6, 2020

anandswaminathan Mar 6, 2020

maghamravi Mar 6, 2020

anandswaminathan Mar 6, 2020

maghamravi Mar 10, 2020

maghamravi commented Mar 10, 2020

anandswaminathan commented Mar 10, 2020 •

edited

anandswaminathan Mar 10, 2020 •

edited

maghamravi Mar 10, 2020

maghamravi Mar 12, 2020

mwylde Mar 10, 2020

maghamravi Mar 12, 2020

mwylde Mar 10, 2020

maghamravi Mar 10, 2020

mwylde Mar 10, 2020

mwylde Mar 10, 2020

maghamravi Mar 12, 2020

maghamravi commented Mar 12, 2020

anandswaminathan commented Mar 16, 2020

Allow disabling Savepoint during updates #184

Allow disabling Savepoint during updates #184

Conversation

maghamravi commented Mar 6, 2020

maghamravi commented Mar 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maghamravi commented Mar 10, 2020

anandswaminathan commented Mar 10, 2020 • edited

anandswaminathan Mar 10, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maghamravi commented Mar 12, 2020

anandswaminathan commented Mar 16, 2020

anandswaminathan commented Mar 10, 2020 •

edited

anandswaminathan Mar 10, 2020 •

edited