Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark restart_test as flaky #106359

Merged

Conversation

mmiranda96
Copy link
Contributor

What type of PR is this?

/kind cleanup
/kind flake

What this PR does / why we need it:

Restart test has been flaking for a while now (see testgrid). This PR marks it as flaky, which removes it from the node-kubelet-serial suite.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 11, 2021
@k8s-ci-robot k8s-ci-robot added area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 11, 2021
@mmiranda96
Copy link
Contributor Author

/priority backlog
/triage accepted
/assign @ehashman

@k8s-ci-robot k8s-ci-robot added priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 11, 2021
Copy link
Member

@ehashman ehashman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@ehashman ehashman added this to PRs - Needs Reviewer in SIG Node CI/Test Board Nov 11, 2021
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 11, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ehashman, mmiranda96

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 11, 2021
@aojea
Copy link
Member

aojea commented Nov 11, 2021

The problem is that the pods are still running and seems to not be killed correctly

Logging kubelet events for node n1-standard-2-ubuntu-gke-2004-1-20-v20210401-7ccd444b
Nov 11 15:22:26.859: INFO: 
Logging pods the kubelet thinks is on node n1-standard-2-ubuntu-gke-2004-1-20-v20210401-7ccd444b
Nov 11 15:22:26.874: INFO: test-6675f3d0-e7a2-4fb7-a2f2-2dcafbc1009c started at 2021-11-11 15:08:46 +0000 UTC (0+1 container statuses recorded)
Nov 11 15:22:26.874: INFO: 	Container test-6675f3d0-e7a2-4fb7-a2f2-2dcafbc1009c ready: true, restart count 0
Nov 11 15:22:26.874: INFO: test-6deea3bb-5e34-4896-9f52-ec0bc05df329 started at 2021-11-11 15:08:47 +0000 UTC (0+1 container statuses recorded)
Nov 11 15:22:26.874: INFO: 	Container test-6deea3bb-5e34-4896-9f52-ec0bc05df329 ready: true, restart count 0
Nov 11 15:22:26.874: INFO: test-9ca584b2-1f70-49c5-9aee-2cb340baf08d started at 2021-11-11 15:08:42 +0000 UTC (0+1 container statuses recorded)
Nov 11 15:22:26.874: INFO: 	Container test-9ca584b2-1f70-49c5-9aee-2cb340baf08d ready: true, restart count 0
Nov 11 15:22:26.874: INFO: test-3d47afff-ab8e-4a6e-8fe3-e26760591d58 started at 2021-11-11 15:08:44 +0000 UTC (0+1 container statuses recorded)
Nov 11 15:22:26.874: INFO: 	Container test-3d47afff-ab8e-4a6e-8fe3-e26760591d58 ready: true, restart count 0
Nov 11 15:22:26.874: INFO: test-7b333322-ddf5-4e2b-b3a4-0a815bbac8d3 started at 2021-11-11 15:08:47 +0000 UTC (0+1 container statuses recorded)
Nov 11 15:22:26.874: INFO: 	Container test-7b333322-ddf5-4e2b-b3a4-0a815bbac8d3 ready: true, restart count 0
Nov 11 15:22:26.874: INFO: test-67207b21-e815-4963-8c67-098a8a9a5216 started at 2021-11-11 15:08:44 +0000 UTC (0+1 container statuses recorded)
Nov 11 15:22:26.874: INFO: 	Container test-67207b21-e815-4963-8c67-098a8a9a5216 ready: true, restart count 0
Nov 11 15:22:26.874: INFO: test-6216e797-abfe-4733-a8a9-4a1946a95590 started at 2021-11-11 15:08:38 +0000 UTC (0+1 container statuses recorded)
Nov 11 15:22:26.874: INFO: 	Container test-6216e797-abfe-4733-a8a9-4a1946a95590 ready: true, restart count 0
Nov 11 15:22:26.874: INFO: test-1bb361bf-714d-4a5a-a778-103fb188f9ea started at 2021-11-11 15:08:46 +0000 UTC (0+1 container statuses recorded)
Nov 11 15:22:26.874: INFO: 	Container test-1bb361bf-714d-4a5a-a778-103fb188f9ea ready: true, restart count 0
Nov 11 15:22:26.874: INFO: test-fd262496-b1f1-428d-8603-49fe1a596a71 started at 2021-11-11 15:08:45 +0000 UTC (0+1 container statuses recorded)
Nov 11 15:22:26.874: INFO: 	Container test-fd262496-b1f1-428d-8603-49fe1a596a71 ready: true, restart count 0
Nov 11 15:22:26.874: INFO: test-1da3f521-ec3a-4f82-90d2-03a7b74e994c started at 2021-11-11 15:08:45 +0000 UTC (0+1 container statuses recorded)
Nov 11 15:22:26.874: INFO: 	Container test-1da3f521-ec3a-4f82-90d2-03a7b74e994c ready: true, restart count 0
Nov 11 15:22:26.874: INFO: test-86328ca8-6e3a-4c5b-9f58-213f0efcb402 started at 2021-11-11 15:08:46 +0000 UTC (0+1 container statuses recorded)
Nov 11 15:22:26.874: INFO: 	Container test-86328ca8-6e3a-4c5b-9f58-213f0efcb402 ready: true, restart count 0
Nov 11 15:22:26.874: INFO: test-f934a33d-fa47-4b54-9d2b-ffed4ec006f2 started at 2021-11-11 15:08:44 +0000 UTC (0+1 container statuses recorded)
Nov 11 15:22:26.874: INFO: 	Container test-f934a33d-fa47-4b54-9d2b-ffed4ec006f2 ready: true, restart count 0
Nov 11 15:22:26.874: INFO: test-6ae8f193-dc57-4aab-b7e3-5d03ac416a55 started at 2021-11-11 15:08:44 +0000 UTC (0+1 container statuses recorded)
Nov 11 15:22:26.874: INFO: 	Container test-6ae8f193-dc57-4aab-b7e3-5d03ac416a55 ready: true, restart count 0
Nov 11 15:22:26.874: INFO: test-309f7c4d-fca6-4992-bb9e-aa88caefe438 started at 2021-11-11 15:08:45 +0000 UTC (0+1 container statuses recorded)
Nov 11 15:22:26.874: INFO: 	Container test-309f7c4d-fca6-4992-bb9e-aa88caefe438 ready: true, restart count 0
Nov 11 15:22:26.874: INFO: test-990de4e9-9e28-471f-a92f-623b5a045c0d started at 2021-11-11 15:08:45 +0000 UTC (0+1 container statuses recorded)
Nov 11 15:22:26.874: INFO: 	Container test-990de4e9-9e28-471f-a92f-623b5a045c0d ready: true, restart count 0
Nov 11 15:22:26.902: INFO: 
Latency metrics for node n1-standard-2-ubuntu-gke-2004-1-20-v20210401-7ccd444b
Nov 11 15:22:26.902: INFO: Waiting up to 3m0s for all (but 0) nodes to be ready

is this a known issue? can this hide something important

@aojea
Copy link
Member

aojea commented Nov 11, 2021

actually the test is passing

�[1mSTEP�[0m: Killing container runtime iteration 5
�[1mSTEP�[0m: Checking currently Running/Ready pods
Nov 11 07:42:56.543: INFO: Running pod count 100
�[1mSTEP�[0m: Confirm no containers have terminated
�[1mSTEP�[0m: Container runtime restart test passed with 100 pods
Nov 11 07:42:56.564: INFO: Waiting for pod test-0d2b8016-f793-4bd9-a2af-3a0c93ae8223 to disappear

🤔

@aojea
Copy link
Member

aojea commented Nov 11, 2021

@ehashman @mmiranda96 do you mind if we hold this and I try to see if I can solve the flake?

@mmiranda96
Copy link
Contributor Author

We can try to fix this, SG.
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 11, 2021
@ehashman
Copy link
Member

@aojea we need to mark this as flaky to get it out of our serial lane and get that green for 1.23 release, separately we should fix the test.

@ehashman
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 11, 2021
@ehashman
Copy link
Member

(let me double check that there is an issue for this, I will assign you @aojea)

@ehashman
Copy link
Member

I think it is only an issue with dockershim+ubuntu which is why we previously haven't prioritized!

@mmiranda96
Copy link
Contributor Author

Docker + Ubuntu is getting skipped (source).

@aojea
Copy link
Member

aojea commented Nov 11, 2021

the test fails here , it takes more than 10 mins to delete the pods

gomega.Expect(e2epod.WaitForPodToDisappear(f.ClientSet, f.Namespace.Name, pod.ObjectMeta.Name, labels.Everything(),

checking this occurrence https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-node-kubelet-serial/1458684226500038656

it seems that the kubelet restarts when the test finish, kubelet logs starts after the test failure
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-node-kubelet-serial/1458684226500038656/artifacts/n1-standard-2-ubuntu-gke-2004-1-20-v20210401-d5f27617/kubelet.log

is there some way to get the previous kubelet logs?

@ehashman
Copy link
Member

Are they not written to the same file? We might need to change the test to append rather than overwrite the file if no

@k8s-ci-robot k8s-ci-robot merged commit bbc3a9a into kubernetes:master Nov 11, 2021
SIG Node CI/Test Board automation moved this from PRs - Needs Reviewer to Done Nov 11, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Nov 11, 2021
@mmiranda96 mmiranda96 deleted the fix/mark-restart-test-flaky branch November 12, 2021 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

4 participants