Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert #104308 to bring back LockContention tests #104334

Merged

Conversation

ipochi
Copy link
Contributor

@ipochi ipochi commented Aug 12, 2021

This reverts commit 9d09c9d

This E2E test was reverted because the test was failing continuously.
More on the issue here #104307

What type of PR is this?

/kind feature
/sig node

What this PR does / why we need it:

This PR re-reverts and brings back the LockContention test, with
the addition of [Serial] tag to the test.

Which issue(s) this PR fixes:

This E2E test was reverted because the test was failing continuously.
More on the issue here #104307

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Aug 12, 2021
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 12, 2021
@SergeyKanzhelev
Copy link
Member

/approve cancel

it is failing on serial tab. Can you please check why?

@k8s-ci-robot k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 12, 2021
@SergeyKanzhelev SergeyKanzhelev moved this from Triage to PRs Waiting on Author in SIG Node CI/Test Board Aug 12, 2021
@ehashman ehashman added this to Triage in SIG Node PR Triage Aug 12, 2021
@ipochi
Copy link
Contributor Author

ipochi commented Aug 13, 2021

/approve cancel

it is failing on serial tab. Can you please check why?

@SergeyKanzhelev

Lock contention flags are not added to the the tests which are failing. Created a PR to update the job.
kubernetes/test-infra#23243

Question: If I want to re-run the failing test but this time I want the test to use the job yaml from the PR and not the master branch. How can I do that ?

/cc @ehashman

@endocrimes
Copy link
Member

@ipochi Unfortunately the easiest way to do that is by running them from your machine, which requires having a Google Cloud account configured, with a default application profile and project that won't interfere with any production environments you care about (gcloud auth application-default login)

and then e.g:

make test-e2e-node REMOTE=true FOCUS="LockContention" SKIP="" PARALLELISM=1 IMAGE_CONFIG_FILE="path/to/image-config.yaml" ZONE=us-central1-a

@endocrimes
Copy link
Member

/triage accepted
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Sep 7, 2021
@endocrimes endocrimes moved this from Triage to Waiting on Author in SIG Node PR Triage Sep 7, 2021
SIG Node CI/Test Board automation moved this from PRs Waiting on Author to PRs - Needs Reviewer Sep 9, 2021

const contentionLockFile = "/var/run/kubelet.lock"

var _ = SIGDescribe("Lock contention [Slow] [Disruptive] [Serial] [NodeFeature:LockContention]", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var _ = SIGDescribe("Lock contention [Slow] [Disruptive] [Serial] [NodeFeature:LockContention]", func() {
var _ = SIGDescribe("Lock contention [Slow] [Disruptive] [Feature:LockContention]", func() {

We'll need to update the test-infra manifest. This should avoid this getting picked up by the suites that launch the kubelet without the right command line flags.

@SergeyKanzhelev do we want to consider marking this [NodeSpecialFeature] and skip that in the regular feature suite?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might have to stay serial? - (given that it messes with kubelet health)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehashman I've made it NodeSpecialFeature:Contention, I think that works better than Feature:LockContention.

What do you think ?

@SergeyKanzhelev
Copy link
Member

/assign

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Nov 5, 2021
@ipochi ipochi force-pushed the imran/re-revert-lock-contention-tests branch from 7b3eb73 to a6a558e Compare December 2, 2021 06:18
@ipochi ipochi force-pushed the imran/re-revert-lock-contention-tests branch 2 times, most recently from 435d353 to 61c4732 Compare January 4, 2022 02:03
…608-imran/e2e-lock-contention"

This reverts commit 9d09c9d

This E2E test was reverted becuase the test was failing continously.
More on the issue here kubernetes#104307

This commit re-reverts and brings back the LockContention test, with
the addition of [Serial] tag to the test.
@ipochi
Copy link
Contributor Author

ipochi commented Jan 7, 2022

@SergeyKanzhelev friendly ping :)

var _ = SIGDescribe("Lock contention [Slow] [Disruptive] [Serial] [NodeFeature:LockContention]", func() {

ginkgo.It("Kubelet should stop when the test acquires the lock on lock file and restart once the lock is released", func() {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be useful to check the kubelet was started with the proper flag and skip if not. Or at lease add a comment here explaining that this is needed for test to run

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, beyond file rename and adding a comment or skipper

This commit is to be squashed and merged with the first commit.

Signed-off-by: Imran Pochi <imran@kinvolk.io>
@ipochi
Copy link
Contributor Author

ipochi commented Jan 20, 2022

@SergeyKanzhelev Thank you for reviewing the PR. I think I accidentally forced pushed an older version of the commit which resulted in the name changed (_linux) not appearing and couple of other review comments which I addressed such as the setting the permalinks.

I've ensured this time around by pushing in the latest commit. Please review once again (I've created a secondary commit containing the changes). Sorry for the mess.

@ipochi
Copy link
Contributor Author

ipochi commented Jan 20, 2022

@SergeyKanzhelev @ehashman

The job for the lock contention test was removed in the test-infra repo.

What needs to be done on that repo, to get the job back ?

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jan 20, 2022

@ipochi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-node-kubelet-serial-containerd e45fdc7a74a81a4c933e01dae00e26530f9d6787 link /test pull-kubernetes-node-kubelet-serial-containerd
pull-kubernetes-node-kubelet-serial e45fdc7a74a81a4c933e01dae00e26530f9d6787 link /test pull-kubernetes-node-kubelet-serial

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ipochi
Copy link
Contributor Author

ipochi commented Jan 20, 2022

/retest-required

@ipochi
Copy link
Contributor Author

ipochi commented Feb 10, 2022

@SergeyKanzhelev @ehashman

The job for the lock contention test was removed in the test-infra repo.

What needs to be done on that repo, to get the job back ?

@SergeyKanzhelev @ehashman friendly ping :)

@endocrimes
Copy link
Member

@ipochi We'll need a new PR in test-infra that adds a new version of the job. I'd basically copy the containerd-serial job with a new name and updated filters.

ipochi added a commit to ipochi/test-infra that referenced this pull request Feb 16, 2022
to the test suite.

This test is created due to the need of `--lock-file` and
`--exit-on-lock-contention` be moved as flags and into the Kubelet
configuration file rather than be dropped.

Adds a new tab in TestGrid named `kubelet-gce-e2e-lock-contention`
running tests focused on `NodeSpecialFeature:LockContention`.

Corresponding e2e test is at kubernetes/kubernetes#104334

Signed-off-by: Imran Pochi <imranpochi@microsoft.com>
@ipochi
Copy link
Contributor Author

ipochi commented Feb 16, 2022

@ipochi We'll need a new PR in test-infra that adds a new version of the job. I'd basically copy the containerd-serial job with a new name and updated filters.

kubernetes/test-infra#23243

Copy link
Member

@ehashman ehashman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

this looks fine, we can revert if it's not

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 17, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ehashman, ipochi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 17, 2022
@k8s-ci-robot k8s-ci-robot merged commit 23ccbaf into kubernetes:master Feb 18, 2022
SIG Node CI/Test Board automation moved this from PRs Waiting on Author to Done Feb 18, 2022
SIG Node PR Triage automation moved this from Waiting on Author to Done Feb 18, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Feb 18, 2022
ipochi added a commit to ipochi/kubernetes that referenced this pull request May 4, 2022
This commit moves the `--exit-on-lock-contention` and `--lock-file`
kubelet flags to Kubelet Configuration.

This PR is built on the following PRs:

Corresponding E2E test PRs : kubernetes#103608, kubernetes#104334, kubernetes#108563

Corresponding Job in test-infra :
https://github.com/kubernetes/test-infra/blob/e684255cc8701ef97b6832e3daadb6841c00cc65/config/jobs/kubernetes/sig-node/containerd.yaml#L1315-#L1343

Signed-off-by: Imran Pochi <imran@kinvolk.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Development

Successfully merging this pull request may close these issues.

None yet

5 participants