Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gate statemachine progress on taskmanager availability #144

Merged
merged 6 commits into from
Dec 13, 2019

Conversation

mwylde
Copy link
Contributor

@mwylde mwylde commented Dec 12, 2019

Currently, the only condition for moving from ClusterStarting to Savepointing (at which point the job goes down) is that all JM/TM pods are up according to the deployment. However, various issues with the TM process or configuration can prevent them from actually registering with the JobManager and becoming available to run tasks. This can lead to extended downtime and can require manual intervention to fix. I've also added a check from the SubmittingJob -> Running transition that the tasks are actually running, which gives us a chance to automatically roll back if the job never successfully starts.

This PR also adds some more visibility into task-level status, so that users can tell if the job is really running (in Flink, a job can be in the Running state even if none of its tasks are running). I've added two new fields to the JobStatus (TotalTasks and RunningTasks) and updated the JobHealth logic to take into account whether tasks are actually running.

if !ready {

// ignore the error, we just care whether it's ready or not
serviceReady, _ := s.flinkController.IsServiceReady(ctx, application, flink.HashForApplication(application))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you do this here, you do not need to check IsServiceReady check in SubmittingJob state right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that check is probably not necessary any more

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the check in SubmittingJob

@@ -155,6 +155,9 @@ type FlinkJobStatus struct {
RestorePath string `json:"restorePath,omitempty"`
RestoreTime *metav1.Time `json:"restoreTime,omitempty"`
LastFailingTime *metav1.Time `json:"lastFailingTime,omitempty"`

RunningTasks int32 `json:"runningTasks,omitempty"`
TotalTasks int32 `json:"totalTasks,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gofmt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what golangci-lint --fix produces. I think it only aligns stuff if there is no line break (see other examples in this file for similar formatting).

@mwylde mwylde merged commit a42556a into master Dec 13, 2019
@mwylde mwylde deleted the micah_better_job_health branch December 13, 2019 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants