Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: Solve backoff tests flakiness #75952

Conversation

claudiubelu
Copy link
Contributor

What type of PR is this?

/kind flake

/sig testing

/area conformance

What this PR does / why we need it:

The container status is not constant, and can change over time in the
following order:

  • Running: When kubelet reports the Pod as running. This state is missable if
    the container finishes its command faster than kubelet getting to report this
    state.
  • Terminated: After the Container finished its command, it will enter the Terminated
    state, in which will remain for a short period of time, before kubelet will try
    to restart it.
  • Waiting: When kubelet has to wait for the backoff period to expire before actually
    restarting the container.

Treating and handling each of these states when calculating the backoff period between
container restarts will make the tests more reliable.

Which issue(s) this PR fixes:

Related #71949

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

The container status is not constant, and can change over time in the
following order:

- Running: When kubelet reports the Pod as running. This state is missable if
  the container finishes its command faster than kubelet getting to report this
  state.
- Terminated: After the Container finished its command, it will enter the Terminated
  state, in which will remain for a short period of time, before kubelet will try
  to restart it.
- Waiting: When kubelet has to wait for the backoff period to expire before actually
  restarting the container.

Treating and handling each of these states when calculating the backoff period between
container restarts will make the tests more reliable.
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. sig/testing Categorizes an issue or PR as relevant to SIG Testing. area/conformance Issues or PRs related to kubernetes conformance tests cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/test labels Apr 1, 2019
@brahmaroutu brahmaroutu added this to In Progress in conformance-definition Apr 1, 2019
@timothysc timothysc self-assigned this Apr 2, 2019
@timothysc timothysc added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 9, 2019
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Apr 9, 2019
@timothysc timothysc added this to the v1.15 milestone Apr 9, 2019
} else {
previousFinishedAt = status.LastTerminationState.Terminated.FinishedAt.Time
}
previousRestartCount = status.RestartCount
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you just use status.RestartCount below and push a portion of this logic below? This seems unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this function basically measures the amount of time that passes between the Nth-1 run and the Nth run. For that, need 4 bits of information:

  • RestartCount for the Nth-1 run and RestartCount for the Nth run. We need that information in order to detect when the RestartCount is incremented, and thus, we know that the Pod restarted.
  • the moment in which the Nth-1 run ended. This block of code is getting that exact information (+ the Nth-1 RestartCount)
  • the moment in which the Nth run started. This information we can get once the RestartCount incremented, as you can see below.

Indeed, we can get the last 2 pieces of information when the RestartCount increments and the Pod's status is either Running or Terminated, but once the state transition Terminated -> Waiting occurs, it overwrites the LastTerminationState, needed information which is then lost.

I know that the code is basically duplicated, but in this manner we get all the information required without any risk of losing any. The loss of information was basically the source of the flakiness.

Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 12, 2019
@timothysc timothysc moved this from In Progress to Needs Approval in conformance-definition Apr 12, 2019
@smarterclayton
Copy link
Contributor

@kubernetes/sig-node-pr-reviews can someone from sig-node please review? @sjenning @yujuhong

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Apr 12, 2019
@bgrant0607
Copy link
Member

I assigned @yujuhong to assign someone.

@bgrant0607 bgrant0607 moved this from Needs Approval to In Review in conformance-definition Apr 15, 2019
@timothysc timothysc moved this from In Review to Needs Approval in conformance-definition Apr 23, 2019
@yujuhong
Copy link
Contributor

yujuhong commented May 3, 2019

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bclau, yujuhong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 3, 2019
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 22cf3ca into kubernetes:master May 4, 2019
conformance-definition automation moved this from Needs Approval to Done May 4, 2019
@claudiubelu claudiubelu deleted the tests/max-backoff-tests-flakiness branch May 6, 2019 12:43
@claudiubelu claudiubelu restored the tests/max-backoff-tests-flakiness branch June 10, 2019 14:37
k8s-ci-robot added a commit that referenced this pull request Sep 13, 2019
…upstream-release-1.14

Automated cherry pick of #75952: tests: Solve backoff tests flakiness Cherry pick of #75952 on release-1.14. #75952: tests: Solve backoff tests flakiness
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/conformance Issues or PRs related to kubernetes conformance tests area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

None yet

7 participants