Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: Solve backoff tests flakiness #75952

Merged

Conversation

@bclau
Copy link
Contributor

commented Apr 1, 2019

What type of PR is this?

/kind flake

/sig testing

/area conformance

What this PR does / why we need it:

The container status is not constant, and can change over time in the
following order:

  • Running: When kubelet reports the Pod as running. This state is missable if
    the container finishes its command faster than kubelet getting to report this
    state.
  • Terminated: After the Container finished its command, it will enter the Terminated
    state, in which will remain for a short period of time, before kubelet will try
    to restart it.
  • Waiting: When kubelet has to wait for the backoff period to expire before actually
    restarting the container.

Treating and handling each of these states when calculating the backoff period between
container restarts will make the tests more reliable.

Which issue(s) this PR fixes:

Related #71949

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE
tests: Solve backoff tests flakiness
The container status is not constant, and can change over time in the
following order:

- Running: When kubelet reports the Pod as running. This state is missable if
  the container finishes its command faster than kubelet getting to report this
  state.
- Terminated: After the Container finished its command, it will enter the Terminated
  state, in which will remain for a short period of time, before kubelet will try
  to restart it.
- Waiting: When kubelet has to wait for the backoff period to expire before actually
  restarting the container.

Treating and handling each of these states when calculating the backoff period between
container restarts will make the tests more reliable.
} else {
previousFinishedAt = status.LastTerminationState.Terminated.FinishedAt.Time
}
previousRestartCount = status.RestartCount

This comment has been minimized.

Copy link
@timothysc

timothysc Apr 12, 2019

Member

Why can't you just use status.RestartCount below and push a portion of this logic below? This seems unnecessary.

This comment has been minimized.

Copy link
@bclau

bclau Apr 12, 2019

Author Contributor

So, this function basically measures the amount of time that passes between the Nth-1 run and the Nth run. For that, need 4 bits of information:

  • RestartCount for the Nth-1 run and RestartCount for the Nth run. We need that information in order to detect when the RestartCount is incremented, and thus, we know that the Pod restarted.
  • the moment in which the Nth-1 run ended. This block of code is getting that exact information (+ the Nth-1 RestartCount)
  • the moment in which the Nth run started. This information we can get once the RestartCount incremented, as you can see below.

Indeed, we can get the last 2 pieces of information when the RestartCount increments and the Pod's status is either Running or Terminated, but once the state transition Terminated -> Waiting occurs, it overwrites the LastTerminationState, needed information which is then lost.

I know that the code is basically duplicated, but in this manner we get all the information required without any risk of losing any. The loss of information was basically the source of the flakiness.

@timothysc
Copy link
Member

left a comment

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Apr 12, 2019

@timothysc timothysc moved this from In Progress to Needs Approval in cncf-k8s-conformance-wg Apr 12, 2019

@smarterclayton

This comment has been minimized.

Copy link
Contributor

commented Apr 12, 2019

@kubernetes/sig-node-pr-reviews can someone from sig-node please review? @sjenning @yujuhong

@bgrant0607

This comment has been minimized.

Copy link
Member

commented Apr 15, 2019

I assigned @yujuhong to assign someone.

@bgrant0607 bgrant0607 moved this from Needs Approval to In Review in cncf-k8s-conformance-wg Apr 15, 2019

@timothysc timothysc moved this from In Review to Needs Approval in cncf-k8s-conformance-wg Apr 23, 2019

@yujuhong

This comment has been minimized.

Copy link
Member

commented May 3, 2019

/lgtm
/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 3, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bclau, yujuhong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot

This comment has been minimized.

Copy link

commented May 4, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 22cf3ca into kubernetes:master May 4, 2019

20 checks passed

cla/linuxfoundation bclau authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Context retired. Status moved to "pull-kubernetes-dependencies".
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

cncf-k8s-conformance-wg automation moved this from Needs Approval to Done May 4, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.