New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1916554: failing if rsync client pods are stuck #958
Conversation
|
valid bug 1916554 |
|
I was out of the office today. My gut feeling is that this bug has manifested in many forms, we need to create a generic solution and get some unit tests in. I will take a detailed look tomorrow morning |
| case !containerStatus.Ready && containerStatus.State.Waiting != nil && containerStatus.State.Waiting.Reason == ContainerCreating: | ||
| // if pod is not in running state after 10 mins of its creation, assume the pod is in failed state | ||
| if time.Now().Sub(pod.CreationTimestamp.Time) > 10*time.Minute { | ||
| pvProgress.Status.PodPhase = kapi.PodFailed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really want this?
Ideally, we should raise a warning right? that says that a pod is in creating the state since 10 mins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to rasing warning
fb974ed
to
95f430f
Compare
| pvProgress.Status.PodPhase = kapi.PodPending | ||
| pvProgress.Status.LogMessage = fmt.Sprintf("Pod %s/%s is stuck in ContainerCreating for more than 10 mins", pod.Name, pod.Namespace) | ||
| pvProgress.Status.ContainerElapsedTime = &metav1.Duration{Duration: time.Now().Sub(pod.CreationTimestamp.Time).Round(time.Second)} | ||
| //pvProgress.Status.SetCondition(migapi.Condition{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason we are keeping this commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to delete it
| @@ -452,6 +452,15 @@ func (t *Task) Run() error { | |||
| return liberr.Wrap(err) | |||
| } | |||
| } else { | |||
| //if pending { | |||
| // t.Owner.Status.SetCondition(migapi.Condition{ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed removing
| LastObservedProgressPercent: dvmp.Status.LastObservedProgressPercent, | ||
| LastObservedTransferRate: dvmp.Status.LastObservedTransferRate, | ||
| }) | ||
| pendingMessage += dvmp.Status.LogMessage + " " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is grabbing the logs from the status field and setting it as a warning message in the status condition? is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Since there could be multiple pods (each for one dvmp), that might be in the pending state for DVM, to me aggregating logMessage from each such dvmp and raising one warning in DVM makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, my only concern is wouldn't this mean if I have multiple pods that are pending it will look like one long log message but will really be multiple pods logs?
Could we separate the logs with a line indicating its a new pod? Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! starting new log from new line. There is only one logMessage per pod for this situation, so starting each from new line would make it less confusing.
| @@ -1166,11 +1166,11 @@ func getMD5Hash(s string) string { | |||
| return hex.EncodeToString(hash[:]) | |||
| } | |||
|
|
|||
| func (t *Task) haveRsyncClientPodsCompletedOrFailed() (bool, bool, error) { | |||
| func (t *Task) haveRsyncClientPodsCompletedOrFailed() (bool, bool, bool, error) { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider renaming function. This is now returning 3 booleans so name doesn't make much sense any longer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, updated the name
95f430f
to
62714a6
Compare
62714a6
to
343a37f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| @@ -276,6 +284,13 @@ func (r *ReconcileDirectVolumeMigrationProgress) reportContainerStatus(pvProgres | |||
| exitCode := containerStatus.LastTerminationState.Terminated.ExitCode | |||
| pvProgress.Status.ExitCode = &exitCode | |||
| pvProgress.Status.ContainerElapsedTime = &metav1.Duration{Duration: containerStatus.LastTerminationState.Terminated.FinishedAt.Sub(containerStatus.LastTerminationState.Terminated.StartedAt.Time).Round(time.Second)} | |||
| case !containerStatus.Ready && pod.Status.Phase == kapi.PodPending && containerStatus.State.Waiting != nil && containerStatus.State.Waiting.Reason == ContainerCreating: | |||
| // if pod is not in running state after 10 mins of its creation, raise a | |||
| if time.Now().Sub(pod.CreationTimestamp.Time) > TimeLimit { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: do we want time.Now().UTC().Sub(pod.CreationTimestamp.Time.UTC()) instead of time.Now().Sub(pod.CreationTimestamp.Time) just to be sure the timezone is common?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| Status: migapi.True, | ||
| Reason: "PodStuckInContainerCreating", | ||
| Category: migapi.Warn, | ||
| Message: pendingMessage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is some amount of redundancy here?
Reason: PodStuckInContainerCreating message: fmt.Sprintf("Pod %s/%s is stuck in ContainerCreating for more than 10 mins\n", pod.Namespace, pod.Name)
I expect just the list of name of the pods in the message or something like 9/15 pods are stuck, something more general, less repetitive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a message with Name/namespace of each pod that is in the pending state is more user friendly I think
7543d48
to
0d98d4e
Compare
| LastObservedProgressPercent: dvmp.Status.LastObservedProgressPercent, | ||
| LastObservedTransferRate: dvmp.Status.LastObservedTransferRate, | ||
| }) | ||
| pendingPods = append(pendingPods, strings.Split(dvmp.Status.LogMessage, " ")[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failing to understand why this is not as simple as obj.Name and object.Namespace? What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, missed that obj has these two pieces of info.
| // if pod is not in running state after 10 mins of its creation, raise a | ||
| if time.Now().UTC().Sub(pod.CreationTimestamp.Time.UTC()) > TimeLimit { | ||
| pvProgress.Status.PodPhase = kapi.PodPending | ||
| pvProgress.Status.LogMessage = fmt.Sprintf("Pod %s/%s is stuck in ContainerCreating for more than 10 mins", pod.Namespace, pod.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the above do we really need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't, but there is no harm in keeping this information in dvmp as well.
0d98d4e
to
04ea2b4
Compare
| if err != nil { | ||
| return liberr.Wrap(err) | ||
| } | ||
| if completed { | ||
| if completed && !pending { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need pending here? if the pods are in a pending state doesn't that automatically mean they are not completed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yip You are on the right track. removed the additional variable since it is not neeed
ae4abc5
to
d0b0aee
Compare
…ner for more than 10 mins adding support to warn user if client pods are in pending state using a variable to define time limit adding warning condition in dvm removing comments and renaming function adding new line and following naming conventions scrapping repeatetive stmts removing not needed variables making the message more generalized
d0b0aee
to
9f01aa4
Compare
…ner for more than 10 mins (migtools#958) adding support to warn user if client pods are in pending state using a variable to define time limit adding warning condition in dvm removing comments and renaming function adding new line and following naming conventions scrapping repeatetive stmts removing not needed variables making the message more generalized (cherry picked from commit c1162b5)
…ner for more than 10 mins (#958) (#972) adding support to warn user if client pods are in pending state using a variable to define time limit adding warning condition in dvm removing comments and renaming function adding new line and following naming conventions scrapping repeatetive stmts removing not needed variables making the message more generalized (cherry picked from commit c1162b5)
Failing the migration if Rsync client pods are stuck in "container creating" phase for more than 10 mins.