Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.
Sign uptests: Solve backoff tests flakiness #75952
Conversation
k8s-ci-robot
added
release-note-none
kind/flake
size/M
sig/testing
area/conformance
cncf-cla: yes
needs-priority
area/test
labels
Apr 1, 2019
k8s-ci-robot
requested review from
dashpole and
pmorie
Apr 1, 2019
brahmaroutu
added this to In Progress
in cncf-k8s-conformance-wg
Apr 1, 2019
timothysc
self-assigned this
Apr 2, 2019
timothysc
added
the
priority/important-soon
label
Apr 9, 2019
k8s-ci-robot
removed
the
needs-priority
label
Apr 9, 2019
timothysc
added this to the v1.15 milestone
Apr 9, 2019
timothysc
reviewed
Apr 12, 2019
| } else { | ||
| previousFinishedAt = status.LastTerminationState.Terminated.FinishedAt.Time | ||
| } | ||
| previousRestartCount = status.RestartCount |
This comment has been minimized.
This comment has been minimized.
timothysc
Apr 12, 2019
Member
Why can't you just use status.RestartCount below and push a portion of this logic below? This seems unnecessary.
This comment has been minimized.
This comment has been minimized.
bclau
Apr 12, 2019
•
Author
Contributor
So, this function basically measures the amount of time that passes between the Nth-1 run and the Nth run. For that, need 4 bits of information:
- RestartCount for the Nth-1 run and RestartCount for the Nth run. We need that information in order to detect when the RestartCount is incremented, and thus, we know that the Pod restarted.
- the moment in which the Nth-1 run ended. This block of code is getting that exact information (+ the Nth-1 RestartCount)
- the moment in which the Nth run started. This information we can get once the RestartCount incremented, as you can see below.
Indeed, we can get the last 2 pieces of information when the RestartCount increments and the Pod's status is either Running or Terminated, but once the state transition Terminated -> Waiting occurs, it overwrites the LastTerminationState, needed information which is then lost.
I know that the code is basically duplicated, but in this manner we get all the information required without any risk of losing any. The loss of information was basically the source of the flakiness.
k8s-ci-robot
added
the
lgtm
label
Apr 12, 2019
timothysc
moved this from In Progress
to Needs Approval
in cncf-k8s-conformance-wg
Apr 12, 2019
This comment has been minimized.
This comment has been minimized.
|
@kubernetes/sig-node-pr-reviews can someone from sig-node please review? @sjenning @yujuhong |
k8s-ci-robot
added
the
sig/node
label
Apr 12, 2019
bgrant0607
assigned
yujuhong
Apr 15, 2019
This comment has been minimized.
This comment has been minimized.
|
I assigned @yujuhong to assign someone. |
bgrant0607
moved this from Needs Approval
to In Review
in cncf-k8s-conformance-wg
Apr 15, 2019
timothysc
moved this from In Review
to Needs Approval
in cncf-k8s-conformance-wg
Apr 23, 2019
This comment has been minimized.
This comment has been minimized.
|
/lgtm |
This comment has been minimized.
This comment has been minimized.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bclau, yujuhong The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
k8s-ci-robot
added
the
approved
label
May 3, 2019
This comment has been minimized.
This comment has been minimized.
fejta-bot
commented
May 4, 2019
|
/retest Review the full test history for this PR. Silence the bot with an |
bclau commentedApr 1, 2019
What type of PR is this?
/kind flake
/sig testing
/area conformance
What this PR does / why we need it:
The container status is not constant, and can change over time in the
following order:
the container finishes its command faster than kubelet getting to report this
state.
state, in which will remain for a short period of time, before kubelet will try
to restart it.
restarting the container.
Treating and handling each of these states when calculating the backoff period between
container restarts will make the tests more reliable.
Which issue(s) this PR fixes:
Related #71949
Special notes for your reviewer:
Does this PR introduce a user-facing change?: