Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: Solve backoff tests flakiness #75952

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
38 changes: 33 additions & 5 deletions test/e2e/common/pods.go
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,8 @@ func startPodAndGetBackOffs(podClient *framework.PodClient, pod *v1.Pod, sleepAm

func getRestartDelay(podClient *framework.PodClient, podName string, containerName string) (time.Duration, error) {
beginTime := time.Now()
var previousRestartCount int32 = -1
var previousFinishedAt time.Time
for time.Since(beginTime) < (2 * maxBackOffTolerance) { // may just miss the 1st MaxContainerBackOff delay
time.Sleep(time.Second)
pod, err := podClient.Get(podName, metav1.GetOptions{})
Expand All @@ -119,11 +121,37 @@ func getRestartDelay(podClient *framework.PodClient, podName string, containerNa
continue
}

if status.State.Waiting == nil && status.State.Terminated != nil && status.LastTerminationState.Terminated != nil && status.State.Terminated.StartedAt.Time.After(beginTime) {
startedAt := status.State.Terminated.StartedAt.Time
finishedAt := status.LastTerminationState.Terminated.FinishedAt.Time
framework.Logf("getRestartDelay: restartCount = %d, finishedAt=%s restartedAt=%s (%s)", status.RestartCount, finishedAt, startedAt, startedAt.Sub(finishedAt))
return startedAt.Sub(finishedAt), nil
// the only case this happens is if this is the first time the Pod is running and there is no "Last State".
if status.LastTerminationState.Terminated == nil {
framework.Logf("Container's last state is not \"Terminated\".")
continue
}

if previousRestartCount == -1 {
if status.State.Running != nil {
// container is still Running, there is no "FinishedAt" time.
continue
} else if status.State.Terminated != nil {
previousFinishedAt = status.State.Terminated.FinishedAt.Time
} else {
previousFinishedAt = status.LastTerminationState.Terminated.FinishedAt.Time
}
previousRestartCount = status.RestartCount
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you just use status.RestartCount below and push a portion of this logic below? This seems unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this function basically measures the amount of time that passes between the Nth-1 run and the Nth run. For that, need 4 bits of information:

  • RestartCount for the Nth-1 run and RestartCount for the Nth run. We need that information in order to detect when the RestartCount is incremented, and thus, we know that the Pod restarted.
  • the moment in which the Nth-1 run ended. This block of code is getting that exact information (+ the Nth-1 RestartCount)
  • the moment in which the Nth run started. This information we can get once the RestartCount incremented, as you can see below.

Indeed, we can get the last 2 pieces of information when the RestartCount increments and the Pod's status is either Running or Terminated, but once the state transition Terminated -> Waiting occurs, it overwrites the LastTerminationState, needed information which is then lost.

I know that the code is basically duplicated, but in this manner we get all the information required without any risk of losing any. The loss of information was basically the source of the flakiness.

}

// when the RestartCount is changed, the Containers will be in one of the following states:
//Running, Terminated, Waiting (it already is waiting for the backoff period to expire, and the last state details have been stored into status.LastTerminationState).
if status.RestartCount > previousRestartCount {
var startedAt time.Time
if status.State.Running != nil {
startedAt = status.State.Running.StartedAt.Time
} else if status.State.Terminated != nil {
startedAt = status.State.Terminated.StartedAt.Time
} else {
startedAt = status.LastTerminationState.Terminated.StartedAt.Time
}
framework.Logf("getRestartDelay: restartCount = %d, finishedAt=%s restartedAt=%s (%s)", status.RestartCount, previousFinishedAt, startedAt, startedAt.Sub(previousFinishedAt))
return startedAt.Sub(previousFinishedAt), nil
}
}
return 0, fmt.Errorf("timeout getting pod restart delay")
Expand Down