-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests: Solve backoff tests flakiness #75952
tests: Solve backoff tests flakiness #75952
Conversation
The container status is not constant, and can change over time in the following order: - Running: When kubelet reports the Pod as running. This state is missable if the container finishes its command faster than kubelet getting to report this state. - Terminated: After the Container finished its command, it will enter the Terminated state, in which will remain for a short period of time, before kubelet will try to restart it. - Waiting: When kubelet has to wait for the backoff period to expire before actually restarting the container. Treating and handling each of these states when calculating the backoff period between container restarts will make the tests more reliable.
} else { | ||
previousFinishedAt = status.LastTerminationState.Terminated.FinishedAt.Time | ||
} | ||
previousRestartCount = status.RestartCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't you just use status.RestartCount below and push a portion of this logic below? This seems unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this function basically measures the amount of time that passes between the Nth-1 run and the Nth run. For that, need 4 bits of information:
- RestartCount for the Nth-1 run and RestartCount for the Nth run. We need that information in order to detect when the RestartCount is incremented, and thus, we know that the Pod restarted.
- the moment in which the Nth-1 run ended. This block of code is getting that exact information (+ the Nth-1 RestartCount)
- the moment in which the Nth run started. This information we can get once the RestartCount incremented, as you can see below.
Indeed, we can get the last 2 pieces of information when the RestartCount increments and the Pod's status is either Running or Terminated, but once the state transition Terminated -> Waiting occurs, it overwrites the LastTerminationState, needed information which is then lost.
I know that the code is basically duplicated, but in this manner we get all the information required without any risk of losing any. The loss of information was basically the source of the flakiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
I assigned @yujuhong to assign someone. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bclau, yujuhong The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Review the full test history for this PR. Silence the bot with an |
What type of PR is this?
/kind flake
/sig testing
/area conformance
What this PR does / why we need it:
The container status is not constant, and can change over time in the
following order:
the container finishes its command faster than kubelet getting to report this
state.
state, in which will remain for a short period of time, before kubelet will try
to restart it.
restarting the container.
Treating and handling each of these states when calculating the backoff period between
container restarts will make the tests more reliable.
Which issue(s) this PR fixes:
Related #71949
Special notes for your reviewer:
Does this PR introduce a user-facing change?: