Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1916554: failing if rsync client pods are stuck #958

Merged
merged 1 commit into from Feb 24, 2021

Conversation

JaydipGabani
Copy link
Contributor

Failing the migration if Rsync client pods are stuck in "container creating" phase for more than 10 mins.

@github-actions
Copy link

valid bug 1916554

@alaypatel07
Copy link
Contributor

alaypatel07 commented Feb 16, 2021

I was out of the office today. My gut feeling is that this bug has manifested in many forms, we need to create a generic solution and get some unit tests in. I will take a detailed look tomorrow morning

case !containerStatus.Ready && containerStatus.State.Waiting != nil && containerStatus.State.Waiting.Reason == ContainerCreating:
// if pod is not in running state after 10 mins of its creation, assume the pod is in failed state
if time.Now().Sub(pod.CreationTimestamp.Time) > 10*time.Minute {
pvProgress.Status.PodPhase = kapi.PodFailed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want this?

Ideally, we should raise a warning right? that says that a pod is in creating the state since 10 mins?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to rasing warning

pvProgress.Status.PodPhase = kapi.PodPending
pvProgress.Status.LogMessage = fmt.Sprintf("Pod %s/%s is stuck in ContainerCreating for more than 10 mins", pod.Name, pod.Namespace)
pvProgress.Status.ContainerElapsedTime = &metav1.Duration{Duration: time.Now().Sub(pod.CreationTimestamp.Time).Round(time.Second)}
//pvProgress.Status.SetCondition(migapi.Condition{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we are keeping this commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to delete it

@@ -452,6 +452,15 @@ func (t *Task) Run() error {
return liberr.Wrap(err)
}
} else {
//if pending {
// t.Owner.Status.SetCondition(migapi.Condition{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed removing

LastObservedProgressPercent: dvmp.Status.LastObservedProgressPercent,
LastObservedTransferRate: dvmp.Status.LastObservedTransferRate,
})
pendingMessage += dvmp.Status.LogMessage + " "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is grabbing the logs from the status field and setting it as a warning message in the status condition? is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Since there could be multiple pods (each for one dvmp), that might be in the pending state for DVM, to me aggregating logMessage from each such dvmp and raising one warning in DVM makes more sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, my only concern is wouldn't this mean if I have multiple pods that are pending it will look like one long log message but will really be multiple pods logs?

Could we separate the logs with a line indicating its a new pod? Or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! starting new log from new line. There is only one logMessage per pod for this situation, so starting each from new line would make it less confusing.

@@ -1166,11 +1166,11 @@ func getMD5Hash(s string) string {
return hex.EncodeToString(hash[:])
}

func (t *Task) haveRsyncClientPodsCompletedOrFailed() (bool, bool, error) {
func (t *Task) haveRsyncClientPodsCompletedOrFailed() (bool, bool, bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider renaming function. This is now returning 3 booleans so name doesn't make much sense any longer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, updated the name

Copy link
Contributor

@dymurray dymurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -276,6 +284,13 @@ func (r *ReconcileDirectVolumeMigrationProgress) reportContainerStatus(pvProgres
exitCode := containerStatus.LastTerminationState.Terminated.ExitCode
pvProgress.Status.ExitCode = &exitCode
pvProgress.Status.ContainerElapsedTime = &metav1.Duration{Duration: containerStatus.LastTerminationState.Terminated.FinishedAt.Sub(containerStatus.LastTerminationState.Terminated.StartedAt.Time).Round(time.Second)}
case !containerStatus.Ready && pod.Status.Phase == kapi.PodPending && containerStatus.State.Waiting != nil && containerStatus.State.Waiting.Reason == ContainerCreating:
// if pod is not in running state after 10 mins of its creation, raise a
if time.Now().Sub(pod.CreationTimestamp.Time) > TimeLimit {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: do we want time.Now().UTC().Sub(pod.CreationTimestamp.Time.UTC()) instead of time.Now().Sub(pod.CreationTimestamp.Time) just to be sure the timezone is common?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Status: migapi.True,
Reason: "PodStuckInContainerCreating",
Category: migapi.Warn,
Message: pendingMessage,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is some amount of redundancy here?

Reason: PodStuckInContainerCreating message: fmt.Sprintf("Pod %s/%s is stuck in ContainerCreating for more than 10 mins\n", pod.Namespace, pod.Name)

I expect just the list of name of the pods in the message or something like 9/15 pods are stuck, something more general, less repetitive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a message with Name/namespace of each pod that is in the pending state is more user friendly I think

LastObservedProgressPercent: dvmp.Status.LastObservedProgressPercent,
LastObservedTransferRate: dvmp.Status.LastObservedTransferRate,
})
pendingPods = append(pendingPods, strings.Split(dvmp.Status.LogMessage, " ")[1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing to understand why this is not as simple as obj.Name and object.Namespace? What am I missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, missed that obj has these two pieces of info.

// if pod is not in running state after 10 mins of its creation, raise a
if time.Now().UTC().Sub(pod.CreationTimestamp.Time.UTC()) > TimeLimit {
pvProgress.Status.PodPhase = kapi.PodPending
pvProgress.Status.LogMessage = fmt.Sprintf("Pod %s/%s is stuck in ContainerCreating for more than 10 mins", pod.Namespace, pod.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the above do we really need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't, but there is no harm in keeping this information in dvmp as well.

if err != nil {
return liberr.Wrap(err)
}
if completed {
if completed && !pending {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need pending here? if the pods are in a pending state doesn't that automatically mean they are not completed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yip You are on the right track. removed the additional variable since it is not neeed

…ner for more than 10 mins

adding support to warn user if client pods are in pending state

using a variable to define time limit

adding warning condition in dvm

removing comments and renaming function

adding new line and following naming conventions

scrapping repeatetive stmts

removing not needed variables

making the message more generalized
@JaydipGabani JaydipGabani merged commit c1162b5 into migtools:master Feb 24, 2021
JaydipGabani added a commit to JaydipGabani/mig-controller that referenced this pull request Feb 24, 2021
…ner for more than 10 mins (migtools#958)

adding support to warn user if client pods are in pending state

using a variable to define time limit

adding warning condition in dvm

removing comments and renaming function

adding new line and following naming conventions

scrapping repeatetive stmts

removing not needed variables

making the message more generalized

(cherry picked from commit c1162b5)
JaydipGabani added a commit that referenced this pull request Feb 24, 2021
…ner for more than 10 mins (#958) (#972)

adding support to warn user if client pods are in pending state

using a variable to define time limit

adding warning condition in dvm

removing comments and renaming function

adding new line and following naming conventions

scrapping repeatetive stmts

removing not needed variables

making the message more generalized

(cherry picked from commit c1162b5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants