-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gate statemachine progress on taskmanager availability #144
Conversation
if !ready { | ||
|
||
// ignore the error, we just care whether it's ready or not | ||
serviceReady, _ := s.flinkController.IsServiceReady(ctx, application, flink.HashForApplication(application)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you do this here, you do not need to check IsServiceReady
check in SubmittingJob state right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that check is probably not necessary any more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the check in SubmittingJob
129eb39
to
8f07929
Compare
@@ -155,6 +155,9 @@ type FlinkJobStatus struct { | |||
RestorePath string `json:"restorePath,omitempty"` | |||
RestoreTime *metav1.Time `json:"restoreTime,omitempty"` | |||
LastFailingTime *metav1.Time `json:"lastFailingTime,omitempty"` | |||
|
|||
RunningTasks int32 `json:"runningTasks,omitempty"` | |||
TotalTasks int32 `json:"totalTasks,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gofmt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what golangci-lint --fix
produces. I think it only aligns stuff if there is no line break (see other examples in this file for similar formatting).
Currently, the only condition for moving from ClusterStarting to Savepointing (at which point the job goes down) is that all JM/TM pods are up according to the deployment. However, various issues with the TM process or configuration can prevent them from actually registering with the JobManager and becoming available to run tasks. This can lead to extended downtime and can require manual intervention to fix. I've also added a check from the SubmittingJob -> Running transition that the tasks are actually running, which gives us a chance to automatically roll back if the job never successfully starts.
This PR also adds some more visibility into task-level status, so that users can tell if the job is really running (in Flink, a job can be in the Running state even if none of its tasks are running). I've added two new fields to the JobStatus (TotalTasks and RunningTasks) and updated the JobHealth logic to take into account whether tasks are actually running.