tests: Solve backoff tests flakiness #75952

claudiubelu · 2019-04-01T09:50:55Z

What type of PR is this?

/kind flake

/sig testing

/area conformance

What this PR does / why we need it:

The container status is not constant, and can change over time in the
following order:

Running: When kubelet reports the Pod as running. This state is missable if
the container finishes its command faster than kubelet getting to report this
state.
Terminated: After the Container finished its command, it will enter the Terminated
state, in which will remain for a short period of time, before kubelet will try
to restart it.
Waiting: When kubelet has to wait for the backoff period to expire before actually
restarting the container.

Treating and handling each of these states when calculating the backoff period between
container restarts will make the tests more reliable.

Which issue(s) this PR fixes:

Related #71949

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

The container status is not constant, and can change over time in the following order: - Running: When kubelet reports the Pod as running. This state is missable if the container finishes its command faster than kubelet getting to report this state. - Terminated: After the Container finished its command, it will enter the Terminated state, in which will remain for a short period of time, before kubelet will try to restart it. - Waiting: When kubelet has to wait for the backoff period to expire before actually restarting the container. Treating and handling each of these states when calculating the backoff period between container restarts will make the tests more reliable.

timothysc · 2019-04-12T17:17:54Z

test/e2e/common/pods.go

+			} else {
+				previousFinishedAt = status.LastTerminationState.Terminated.FinishedAt.Time
+			}
+			previousRestartCount = status.RestartCount


Why can't you just use status.RestartCount below and push a portion of this logic below? This seems unnecessary.

So, this function basically measures the amount of time that passes between the Nth-1 run and the Nth run. For that, need 4 bits of information:

RestartCount for the Nth-1 run and RestartCount for the Nth run. We need that information in order to detect when the RestartCount is incremented, and thus, we know that the Pod restarted.

the moment in which the Nth-1 run ended. This block of code is getting that exact information (+ the Nth-1 RestartCount)

the moment in which the Nth run started. This information we can get once the RestartCount incremented, as you can see below.

Indeed, we can get the last 2 pieces of information when the RestartCount increments and the Pod's status is either Running or Terminated, but once the state transition Terminated -> Waiting occurs, it overwrites the LastTerminationState, needed information which is then lost.

I know that the code is basically duplicated, but in this manner we get all the information required without any risk of losing any. The loss of information was basically the source of the flakiness.

timothysc

/lgtm

smarterclayton · 2019-04-12T19:13:41Z

@kubernetes/sig-node-pr-reviews can someone from sig-node please review? @sjenning @yujuhong

bgrant0607 · 2019-04-15T23:28:22Z

I assigned @yujuhong to assign someone.

yujuhong · 2019-05-03T21:08:21Z

/lgtm
/approve

k8s-ci-robot · 2019-05-03T21:08:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bclau, yujuhong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/e2e/common/OWNERS~~ [yujuhong]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fejta-bot · 2019-05-04T02:30:39Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

…upstream-release-1.14 Automated cherry pick of #75952: tests: Solve backoff tests flakiness Cherry pick of #75952 on release-1.14. #75952: tests: Solve backoff tests flakiness

k8s-ci-robot requested review from dashpole and pmorie April 1, 2019 09:52

brahmaroutu added this to In Progress in conformance-definition Apr 1, 2019

timothysc self-assigned this Apr 2, 2019

timothysc added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 9, 2019

k8s-ci-robot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Apr 9, 2019

timothysc added this to the v1.15 milestone Apr 9, 2019

timothysc reviewed Apr 12, 2019

View reviewed changes

timothysc approved these changes Apr 12, 2019

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 12, 2019

timothysc moved this from In Progress to Needs Approval in conformance-definition Apr 12, 2019

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Apr 12, 2019

bgrant0607 assigned yujuhong Apr 15, 2019

bgrant0607 moved this from Needs Approval to In Review in conformance-definition Apr 15, 2019

timothysc moved this from In Review to Needs Approval in conformance-definition Apr 23, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 3, 2019

k8s-ci-robot merged commit 22cf3ca into kubernetes:master May 4, 2019

conformance-definition automation moved this from Needs Approval to Done May 4, 2019

claudiubelu deleted the tests/max-backoff-tests-flakiness branch May 6, 2019 12:43

claudiubelu restored the tests/max-backoff-tests-flakiness branch June 10, 2019 14:37

tpepper mentioned this pull request Jun 20, 2019

Automated cherry pick of #75952: tests: Solve backoff tests flakiness Cherry pick of #75952 on release-1.14. #75952: tests: Solve backoff tests flakiness #78858

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: Solve backoff tests flakiness #75952

tests: Solve backoff tests flakiness #75952

claudiubelu commented Apr 1, 2019

timothysc Apr 12, 2019

claudiubelu Apr 12, 2019 •

edited

timothysc left a comment

smarterclayton commented Apr 12, 2019

bgrant0607 commented Apr 15, 2019

yujuhong commented May 3, 2019

k8s-ci-robot commented May 3, 2019

fejta-bot commented May 4, 2019

tests: Solve backoff tests flakiness #75952

tests: Solve backoff tests flakiness #75952

Conversation

claudiubelu commented Apr 1, 2019

timothysc Apr 12, 2019

Choose a reason for hiding this comment

claudiubelu Apr 12, 2019 • edited

Choose a reason for hiding this comment

timothysc left a comment

Choose a reason for hiding this comment

smarterclayton commented Apr 12, 2019

bgrant0607 commented Apr 15, 2019

yujuhong commented May 3, 2019

k8s-ci-robot commented May 3, 2019

fejta-bot commented May 4, 2019

claudiubelu Apr 12, 2019 •

edited