Gate statemachine progress on taskmanager availability #144

mwylde · 2019-12-12T23:37:05Z

Currently, the only condition for moving from ClusterStarting to Savepointing (at which point the job goes down) is that all JM/TM pods are up according to the deployment. However, various issues with the TM process or configuration can prevent them from actually registering with the JobManager and becoming available to run tasks. This can lead to extended downtime and can require manual intervention to fix. I've also added a check from the SubmittingJob -> Running transition that the tasks are actually running, which gives us a chance to automatically roll back if the job never successfully starts.

This PR also adds some more visibility into task-level status, so that users can tell if the job is really running (in Flink, a job can be in the Running state even if none of its tasks are running). I've added two new fields to the JobStatus (TotalTasks and RunningTasks) and updated the JobHealth logic to take into account whether tasks are actually running.

anandswaminathan · 2019-12-12T23:44:44Z

pkg/controller/flinkapplication/flink_state_machine.go

-	if !ready {
+
+	// ignore the error, we just care whether it's ready or not
+	serviceReady, _ := s.flinkController.IsServiceReady(ctx, application, flink.HashForApplication(application))


Since you do this here, you do not need to check IsServiceReady check in SubmittingJob state right?

Yeah, that check is probably not necessary any more

Removed the check in SubmittingJob

anandswaminathan · 2019-12-13T22:47:40Z

pkg/apis/app/v1beta1/types.go

@@ -155,6 +155,9 @@ type FlinkJobStatus struct {
 	RestorePath              string       `json:"restorePath,omitempty"`
 	RestoreTime              *metav1.Time `json:"restoreTime,omitempty"`
 	LastFailingTime          *metav1.Time `json:"lastFailingTime,omitempty"`
+
+	RunningTasks int32 `json:"runningTasks,omitempty"`
+	TotalTasks   int32 `json:"totalTasks,omitempty"`


This is what golangci-lint --fix produces. I think it only aligns stuff if there is no line break (see other examples in this file for similar formatting).

mwylde requested review from anandswaminathan, glaksh100 and kumare3 as code owners December 12, 2019 23:37

anandswaminathan reviewed Dec 12, 2019

View reviewed changes

Micah Wylde added 6 commits December 13, 2019 13:38

Job health should not be green when tasks are not running

4b6add0

Do not move out of ClusterStarting until taskslots are available

f6d396a

Gate SubmittingJob -> Running transition on tasks actually running

64d2e96

Remove service ready check for job submission

cafe9db

Increase maxErrDuration for tests

7ed66b3

Fix test

8f07929

mwylde force-pushed the micah_better_job_health branch from 129eb39 to 8f07929 Compare December 13, 2019 21:38

anandswaminathan reviewed Dec 13, 2019

View reviewed changes

anandswaminathan approved these changes Dec 13, 2019

View reviewed changes

mwylde merged commit a42556a into master Dec 13, 2019

mwylde deleted the micah_better_job_health branch December 13, 2019 22:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gate statemachine progress on taskmanager availability #144

Gate statemachine progress on taskmanager availability #144

mwylde commented Dec 12, 2019

anandswaminathan Dec 12, 2019 •

edited

mwylde Dec 12, 2019

mwylde Dec 13, 2019

anandswaminathan Dec 13, 2019

mwylde Dec 13, 2019

Gate statemachine progress on taskmanager availability #144

Gate statemachine progress on taskmanager availability #144

Conversation

mwylde commented Dec 12, 2019

anandswaminathan Dec 12, 2019 • edited

Choose a reason for hiding this comment

mwylde Dec 12, 2019

Choose a reason for hiding this comment

mwylde Dec 13, 2019

Choose a reason for hiding this comment

anandswaminathan Dec 13, 2019

Choose a reason for hiding this comment

mwylde Dec 13, 2019

Choose a reason for hiding this comment

anandswaminathan Dec 12, 2019 •

edited