Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test: [sig-network] Services should be able to preserve UDP traffic when server pod cycles for a NodePort service #91236

Closed
jprzychodzen opened this issue May 19, 2020 · 82 comments · Fixed by #91962 or #98305
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@jprzychodzen
Copy link
Contributor

Which jobs are flaking:
ci-kubernetes-e2e-gce-scale-correctness

Which test(s) are flaking:
[sig-network] Services should be able to preserve UDP traffic when server pod cycles for a NodePort service

Testgrid link:
https://k8s-testgrid.appspot.com/sig-release-master-informing#gce-master-scale-correctness

Reason for failure:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/service.go:1243
May 18 12:22:32.868: pod communication failed
Unexpected error:
    <*errors.errorString | 0xc003faab60>: {
        s: "continuous echo was not able to communicate with initial server pod",
    }
    continuous echo was not able to communicate with initial server pod
occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/service.go:1302

/sig network
/assign @thockin

@jprzychodzen jprzychodzen added the kind/flake Categorizes issue or PR as related to a flaky test. label May 19, 2020
@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label May 19, 2020
@athenabot
Copy link

/triage unresolved

Comment /remove-triage unresolved when the issue is assessed and confirmed.

🤖 I am a bot run by vllry. 👩‍🔬

@k8s-ci-robot k8s-ci-robot added the triage/unresolved Indicates an issue that can not or will not be resolved. label May 19, 2020
@haircommander
Copy link
Contributor

CRI-O CI is also hitting this failure. We have been unable to reproduce locally. our CI runs on aws

@aojea
Copy link
Member

aojea commented May 19, 2020

hmm, I only see 1 failure out of 4 and seems that the connectivity didn't work at all 🤔

May 18 12:22:16.516: INFO: The status of Pod server-1 is Running (Ready = true)
�[1mSTEP�[0m: waiting up to 3m0s for service clusterip-test in namespace services-9934 to expose endpoints map[server-1:[80]]
May 18 12:22:16.585: INFO: successfully validated that service clusterip-test in namespace services-9934 exposes endpoints map[server-1:[80]] (68.878243ms elapsed)
�[1mSTEP�[0m: Sending UDP traffic to NodePort service clusterip-test on node with publicIP 34.75.29.74
�[1mSTEP�[0m: creating a second pod for the service clusterip-test
May 18 12:22:16.656: INFO: The status of Pod server-2 is Pending, waiting for it to be Running (with Ready = true)
May 18 12:22:17.620: INFO: Continuous echo ("udp://34.75.29.74:30565"): read udp 10.12.4.84:36135->34.75.29.74:30565: read: connection refused - 1 errors seen so far
May 18 12:22:18.655: INFO: Continuous echo ("udp://34.75.29.74:30565"): read udp 10.12.4.84:36135->34.75.29.74:30565: read: connection refused - 2 errors seen so far
May 18 12:22:18.691: INFO: The status of Pod server-2 is Pending, waiting for it to be Running (with Ready = true)
May 18 12:22:19.690: INFO: Continuous echo ("udp://34.75.29.74:30565"): read udp 10.12.4.84:36135->34.75.29.74:30565: read: connection refused - 3 errors seen so far
May 18 12:22:20.691: INFO: The status of Pod server-2 is Pending, waiting for it to be Running (with Ready = true)
May 18 12:22:20.724: INFO: Continuous echo ("udp://34.75.29.74:30565"): read udp 10.12.4.84:36135->34.75.29.74:30565: read: connection refused - 4 errors seen so far
May 18 12:22:21.759: INFO: Continuous echo ("udp://34.75.29.74:30565"): read udp 10.12.4.84:36135->34.75.29.74:30565: read: connection refused - 5 errors seen so far
May 18 12:22:22.691: INFO: The status of Pod server-2 is Pending, waiting for it to be Running (with Ready = true)
May 18 12:22:22.796: INFO: Continuous echo ("udp://34.75.29.74:30565"): read udp 10.12.4.84:36135->34.75.29.74:30565: read: connection refused - 6 errors seen so far
May 18 12:22:23.831: INFO: Continuous echo ("udp://34.75.29.74:30565"): read udp 10.12.4.84:36135->34.75.29.74:30565: read: connection refused - 7 errors seen so far
May 18 12:22:24.692: INFO: The status of Pod server-2 is Running (Ready = true)
May 18 12:22:24.692: INFO: Cleaning up server-1 pod
�[1mSTEP�[0m: Sending UDP traffic to NodePort service clusterip-test on node with publicIP 34.75.29.74
May 18 12:22:24.867: INFO: Continuous echo ("udp://34.75.29.74:30565"): read udp 10.12.4.84:36135->34.75.29.74:30565: read: connection refused - 8 errors seen so far
May 18 12:22:28.867: INFO: Continuous echo ("udp://34.75.29.74:30565"): read udp 10.12.4.84:36135->34.75.29.74:30565: i/o timeout - 9 errors seen so far
May 18 12:22:32.868: INFO: Continuous echo ("udp://34.75.29.74:30565"): read udp 10.12.4.84:36135->34.75.29.74:30565: i/o timeout - 10 errors seen so far
May 18 12:22:32.868: INFO: Continuous echo waiting for signal to continue
May 18 12:22:32.868: FAIL: pod communication failed
Unexpected error:
    <*errors.errorString | 0xc003faab60>: {
        s: "continuous echo was not able to communicate with initial server pod",
    }
    continuous echo was not able to communicate with initial server pod
occurred

@athenabot
Copy link

@thockin
If this issue has been triaged, please comment /remove-triage unresolved.

If you aren't able to handle this issue, consider unassigning yourself and/or adding the help-wanted label.

🤖 I am a bot run by vllry. 👩‍🔬

@robscott
Copy link
Member

/assign @jayunit100 @JacobTanenbaum

@k8s-ci-robot
Copy link
Contributor

@robscott: GitHub didn't allow me to assign the following users: JacobTanenbaum.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @jayunit100 @JacobTanenbaum

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@thockin thockin removed their assignment May 29, 2020
@saschagrunert
Copy link
Member

/cc

Saw the test flaking recently:
https://storage.googleapis.com/k8s-gubernator/triage/index.html?text=continuous%20echo%20was%20not%20able%20to%20communicate%20with%20initial%20server%20pod

Wondering if this is just a timeout issue. Maybe we can increase the timeout and/or switch to a TCP based test:

read udp 10.60.36.127:51154->34.83.245.240:30791: i/o timeout - 10 errors seen so far

@aojea
Copy link
Member

aojea commented May 31, 2020

Maybe we can increase the timeout and/or switch to a TCP based test:

the test should use UDP, and UDP is unreliable by definition ... however is curious that most of the failures happen because it is not able to communicate with the first pod https://storage.googleapis.com/k8s-gubernator/triage/index.html?text=continuous%20echo%20was%20not%20able%20to%20communicate%20with%20initial%20server%20pod#e76fa297acbadf09dac1

        s: "continuous echo was not able to communicate with initial server pod",

I think that is something that should be investigated, I've analyzed some occurences and it works fine for the second pod, however, the UDP losses should be totally random 🤔

@jayunit100
Copy link
Member

Hi everyone, yeah so.. collecting some data on this...

  • using a two node kind cluster, locally, with the antrea plugin, it passes
  • using a 4 node AWS cluster with the antrea CNI plugin, it failed
  • using a two node kidn cluster locally with calico, it passes

So I guess the best reproducer is to run these tests, for now, in the cloud.... Now that I have a reproducer for this maybe can live debug it some time next wk with one of yall :)

@aojea
Copy link
Member

aojea commented Jun 6, 2020

  • using a 4 node AWS cluster with the antrea CNI plugin, it failed

bear in mind that this test is connecting to a nodePort in the node IP, the test is functionally correct, it may be flaky due to the unreliable UDP nature or due to the pod scheduling, but if is failing constantly it has to be environmental.

@jkaniuk
Copy link
Contributor

jkaniuk commented Jun 8, 2020

@aojea
Copy link
Member

aojea commented Jun 17, 2020

/reopen
I'm afraid this is going to be challenging 😄
https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=conntrack

The difference now is that we are testing NodePort and ClusterIP

  • If both fail, we can blame the test for sure,
  • If only NodePort fails, we need to check the job, if GCE or KIND
  • If only ClusterIP fails, ... 🤔

@k8s-ci-robot k8s-ci-robot reopened this Jun 17, 2020
@k8s-ci-robot
Copy link
Contributor

@aojea: Reopened this issue.

In response to this:

/reopen
I'm afraid this is going to be challenging 😄
https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=conntrack

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wojtek-t wojtek-t added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 19, 2020
@wojtek-t
Copy link
Member

We're hitting this really frequently in 5k-node tests.

So I started looking a little bit into that.
I don't have full understanding of what exactly is happening, but I already see a couple things that may be causing issues:

  • when creating the first pod, you wait for corresponding endpoint for that being created; but that doesn't mean that kube-proxy was actually programmed correctly at that time (in general it isn't); given iptables programming is rate-limitted, it may easily happen that actually programming happens 10s+ later

That explains some delays, but it doesn't explain e.g. this run: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-proto/1273659998622191616 where iptables were programmed at 17:02:14 but the test didn't respond until Thu Jun 18 17:02:24 UTC 2020 Try: 36
If someone more familiar with networking stuff could look into it, it would be great.
@kubernetes/sig-network-bugs @bowei @thockin - can you help triaging ^^

  • for deletions it's even worse, because you don't even wait for endpoints to be there (i.e. Endpoints can still contain the old pod when the test start pinging to look for the second one)

So I claim that timeouts are visibly to low in this test, but I suspect it's not everything.

I will send out the first PR that tries to improve it in couple minutes.

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 19, 2020
@aojea
Copy link
Member

aojea commented Oct 15, 2020

It seems that the packet never gets DNATed in the receiving host, so it does not reach the pod in the node.
There is the same test that is using ClusterIP instead of NodePort that is more reliable.
This is being a tough one

@thockin thockin added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/unresolved Indicates an issue that can not or will not be resolved. labels Oct 15, 2020
@aojea
Copy link
Member

aojea commented Oct 19, 2020

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 19, 2020
@thejoycekung
Copy link
Contributor

Yesterday's flake - https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce/1318719364404350976

I don't think I'm seeing the log messages from the latest log PR you made @aojea ? Or are the prow logs not logging at a high-enough level?

@aojea
Copy link
Member

aojea commented Oct 21, 2020

Logs are there, please bear with me:
The error on the test is:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/conntrack.go:119
Oct 21 01:23:11.916: Failed to connect to backend 1

click on the job failure and you'll see a larger output where you can see the server pod name

�[1mSTEP�[0m: creating a backend pod pod-server-1 for the service svc-udp
�[1mSTEP�[0m: waiting up to 3m0s for service svc-udp in namespace conntrack-5226 to expose endpoints map[pod-server-1:[80]]
Oct 21 01:22:11.665: INFO: successfully validated that service svc-udp in namespace conntrack-5226 exposes endpoints map[pod-server-1:[80]]

and get the node where the pod is running:

Oct 21 01:23:14.846: INFO: pod-server-1  bootstrap-e2e-minion-group-fltx  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-10-21 01:22:07 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2020-10-21 01:22:09 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2020-10-21 01:22:09 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-10-21 01:22:07 +0000 UTC  }]

and the client-pod name and its node:

Oct 21 01:23:14.804: INFO: At 2020-10-21 01:22:01 +0000 UTC - event for pod-client: {kubelet bootstrap-e2e-minion-group-dbk4} Started: Started container pod-client

and the flow we are looking for:

udp      17 28 src=10.64.1.58 dst=10.138.0.3 sport=12345 dport=32462 [UNREPLIED] src=10.138.0.3 dst=10.64.1.58 sport=32462 dport=12345 mark=0 use=1

with the node-name we can get the kube-proxy logs there, clicking in artifacts:
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce/1318719364404350976/artifacts/bootstrap-e2e-minion-group-fltx/kube-proxy.log

and you can see the "new" conntrack log entries conntrack.go:66] Clearing conntrack entries ...:

I1021 01:22:01.849095       1 proxier.go:1698] Opened local port "nodePort for conntrack-5226/svc-udp:udp" (:32462/udp)
I1021 01:22:01.849130       1 conntrack.go:66] Clearing conntrack entries [-D -p udp --dport 32462]

and in the client:

wow, there is a kernel bug there @wojtek-t @BenTheElder , can it be related? 🤔

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce/1318719364404350976/artifacts/bootstrap-e2e-minion-group-dbk4/kern.log

ct 21 01:20:56.731721 bootstrap-e2e-minion-group-dbk4 kernel: nfsd: last server has exited, flushing export cache
Oct 21 01:20:57.039321 bootstrap-e2e-minion-group-dbk4 kernel: BUG: Dentry 00000000244348c3{i=76,n=3}  still in use (1) [unmount of nfsd nfsd]
Oct 21 01:20:57.039521 bootstrap-e2e-minion-group-dbk4 kernel: ------------[ cut here ]------------
Oct 21 01:20:57.039570 bootstrap-e2e-minion-group-dbk4 kernel: WARNING: CPU: 0 PID: 91101 at fs/dcache.c:1595 umount_check+0x69/0x70
Oct 21 01:20:57.039606 bootstrap-e2e-minion-group-dbk4 kernel: Modules linked in: nfsd fuse dummy xt_recent nfsv3 nfs_acl auth_rpcgss nfsv4 dns_resolver nfs fscache lockd grace sunrpc xt_statistic veth xt_nat ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs xt_mark iptable_raw xt_REDIRECT xt_MASQUERADE xt_addrtype iptable_nat nf_nat br_netfilter ip6table_filter ip6_tables aesni_intel glue_helper crypto_simd cryptd loadpin_trigger(O)
Oct 21 01:20:57.039650 bootstrap-e2e-minion-group-dbk4 kernel: CPU: 0 PID: 91101 Comm: umount Tainted: G        W  O      5.4.49+ #1
Oct 21 01:20:57.039679 bootstrap-e2e-minion-group-dbk4 kernel: Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Oct 21 01:20:57.039715 bootstrap-e2e-minion-group-dbk4 kernel: RIP: 0010:umount_check+0x69/0x70
Oct 21 01:20:57.039749 bootstrap-e2e-minion-group-dbk4 kernel: Code: 96 80 00 00 00 49 8b 42 28 4c 8b 08 49 81 c2 88 04 00 00 48 c7 c7 da d6 f9 a5 48 89 f1 31 c0 41 52 e8 fc 9c e2 ff 48 83 c4 08 <0f> 0b eb b7 00 00 90 0f 1f 44 00 00 55 48 89 e5 41 56 53 bf ff ff
Oct 21 01:20:57.039782 bootstrap-e2e-minion-group-dbk4 kernel: RSP: 0018:ffffa00041e07d80 EFLAGS: 00010292
Oct 21 01:20:57.039819 bootstrap-e2e-minion-group-dbk4 kernel: RAX: 000000000000004f RBX: ffff90cc29fb1360 RCX: 8977096f5098cc00
Oct 21 01:20:57.039850 bootstrap-e2e-minion-group-dbk4 kernel: RDX: ffff90cd98225f38 RSI: ffff90cd982161a8 RDI: ffff90cd982161a8
Oct 21 01:20:57.039893 bootstrap-e2e-minion-group-dbk4 kernel: RBP: ffffa00041e07d80 R08: 0000000000000000 R09: 000001b970706a70
Oct 21 01:20:57.039934 bootstrap-e2e-minion-group-dbk4 kernel: R10: 00000000000009b3 R11: ffffffffa55e8410 R12: ffff90cc29fb13b8
Oct 21 01:20:57.039968 bootstrap-e2e-minion-group-dbk4 kernel: R13: ffff90cc23cbd690 R14: ffff90cc181ec0d8 R15: ffff90cc181ec190
Oct 21 01:20:57.040003 bootstrap-e2e-minion-group-dbk4 kernel: FS:  00007f667d9fa880(0000) GS:ffff90cd98200000(0000) knlGS:0000000000000000
Oct 21 01:20:57.040036 bootstrap-e2e-minion-group-dbk4 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 21 01:20:57.040071 bootstrap-e2e-minion-group-dbk4 kernel: CR2: 00007f3ab93dd140 CR3: 00000001aef8c004 CR4: 00000000003606b0
Oct 21 01:20:57.040107 bootstrap-e2e-minion-group-dbk4 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 21 01:20:57.040139 bootstrap-e2e-minion-group-dbk4 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Oct 21 01:20:57.040167 bootstrap-e2e-minion-group-dbk4 kernel: Call Trace:
Oct 21 01:20:57.040196 bootstrap-e2e-minion-group-dbk4 kernel:  d_walk+0x15f/0x270
Oct 21 01:20:57.040227 bootstrap-e2e-minion-group-dbk4 kernel:  ? __d_free+0x20/0x20
Oct 21 01:20:57.040257 bootstrap-e2e-minion-group-dbk4 kernel:  do_one_tree+0x26/0x70
Oct 21 01:20:57.040309 bootstrap-e2e-minion-group-dbk4 kernel:  shrink_dcache_for_umount+0x2d/0x80
Oct 21 01:20:57.040360 bootstrap-e2e-minion-group-dbk4 kernel:  generic_shutdown_super+0x22/0x120
Oct 21 01:20:57.040399 bootstrap-e2e-minion-group-dbk4 kernel:  kill_litter_super+0x2e/0x40
Oct 21 01:20:57.040428 bootstrap-e2e-minion-group-dbk4 kernel:  nfsd_umount+0x16/0x30 [nfsd]
Oct 21 01:20:57.040461 bootstrap-e2e-minion-group-dbk4 kernel:  deactivate_locked_super+0x44/0x90
Oct 21 01:20:57.040492 bootstrap-e2e-minion-group-dbk4 kernel:  cleanup_mnt+0x11a/0x170
Oct 21 01:20:57.040517 bootstrap-e2e-minion-group-dbk4 kernel:  task_work_run+0x9f/0xc0
Oct 21 01:20:57.040546 bootstrap-e2e-minion-group-dbk4 kernel:  exit_to_usermode_loop+0xca/0xe0
Oct 21 01:20:57.040576 bootstrap-e2e-minion-group-dbk4 kernel:  syscall_return_slowpath+0x7d/0xb0
Oct 21 01:20:57.040609 bootstrap-e2e-minion-group-dbk4 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 21 01:20:57.040636 bootstrap-e2e-minion-group-dbk4 kernel: RIP: 0033:0x7f667cc68c47
Oct 21 01:20:57.040751 bootstrap-e2e-minion-group-dbk4 kernel: Code: 72 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 19 72 2c 00 f7 d8 64 89 01 48
Oct 21 01:20:57.040783 bootstrap-e2e-minion-group-dbk4 kernel: RSP: 002b:00007ffe22d5c188 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
Oct 21 01:20:57.040821 bootstrap-e2e-minion-group-dbk4 kernel: RAX: 0000000000000000 RBX: 000055a4bc92c060 RCX: 00007f667cc68c47
Oct 21 01:20:57.040852 bootstrap-e2e-minion-group-dbk4 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055a4bc92c2b0
Oct 21 01:20:57.040880 bootstrap-e2e-minion-group-dbk4 kernel: RBP: 000055a4bc92c2b0 R08: 000055a4bc931410 R09: 0000000000000000
Oct 21 01:20:57.040915 bootstrap-e2e-minion-group-dbk4 kernel: R10: 00007ffe22d5bbe0 R11: 0000000000000246 R12: 00007f667d7e3d58
Oct 21 01:20:57.040950 bootstrap-e2e-minion-group-dbk4 kernel: R13: 0000000000000000 R14: 000055a4bc92c160 R15: 000055a4bc92c060
Oct 21 01:20:57.040983 bootstrap-e2e-minion-group-dbk4 kernel: ---[ end trace 98c01d5b887efe82 ]---
Oct 21 01:20:57.041015 bootstrap-e2e-minion-group-dbk4 kernel: BUG: Dentry 0000000025b755b3{i=72,n=2}  still in use (1) [unmount of nfsd nfsd]
Oct 21 01:20:57.337667 bootstrap-e2e-minion-group-dbk4 kernel: ------------[ cut here ]------------
Oct 21 01:20:57.347179 bootstrap-e2e-minion-group-dbk4 kernel: WARNING: CPU: 0 PID: 91101 at fs/dcache.c:1595 umount_check+0x69/0x70
Oct 21 01:20:57.347272 bootstrap-e2e-minion-group-dbk4 kernel: Modules linked in: nfsd fuse dummy xt_recent nfsv3 nfs_acl auth_rpcgss nfsv4 dns_resolver nfs fscache lockd grace sunrpc xt_statistic veth xt_nat ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs xt_mark iptable_raw xt_REDIRECT xt_MASQUERADE xt_addrtype iptable_nat nf_nat br_netfilter ip6table_filter ip6_tables aesni_intel glue_helper crypto_simd cryptd loadpin_trigger(O)
ct 21 01:25:12.685469 bootstrap-e2e-minion-group-dbk4 kernel: ------------[ cut here ]------------
Oct 21 01:25:12.685512 bootstrap-e2e-minion-group-dbk4 kernel: WARNING: CPU: 1 PID: 129938 at fs/dcache.c:1595 umount_check+0x69/0x70
Oct 21 01:25:12.685540 bootstrap-e2e-minion-group-dbk4 kernel: Modules linked in: xt_multiport nfsd fuse dummy xt_recent nfsv3 nfs_acl auth_rpcgss nfsv4 dns_resolver nfs fscache lockd grace sunrpc xt_statistic veth xt_nat ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs xt_mark iptable_raw xt_REDIRECT xt_MASQUERADE xt_addrtype iptable_nat nf_nat br_netfilter ip6table_filter ip6_tables aesni_intel glue_helper crypto_simd cryptd loadpin_trigger(O)
Oct 21 01:25:12.685571 bootstrap-e2e-minion-group-dbk4 kernel: CPU: 1 PID: 129938 Comm: umount Tainted: G        W  O      5.4.49+ #1
Oct 21 01:25:12.685606 bootstrap-e2e-minion-group-dbk4 kernel: Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Oct 21 01:25:12.685640 bootstrap-e2e-minion-group-dbk4 kernel: RIP: 0010:umount_check+0x69/0x70
Oct 21 01:25:12.685665 bootstrap-e2e-minion-group-dbk4 kernel: Code: 96 80 00 00 00 49 8b 42 28 4c 8b 08 49 81 c2 88 04 00 00 48 c7 c7 da d6 f9 a5 48 89 f1 31 c0 41 52 e8 fc 9c e2 ff 48 83 c4 08 <0f> 0b eb b7 00 00 90 0f 1f 44 00 00 55 48 89 e5 41 56 53 bf ff ff
Oct 21 01:25:12.685689 bootstrap-e2e-minion-group-dbk4 kernel: RSP: 0018:ffffa000407dfd80 EFLAGS: 00010292
Oct 21 01:25:12.685714 bootstrap-e2e-minion-group-dbk4 kernel: RAX: 000000000000004f RBX: ffff90cbe63e0000 RCX: 271b879b03ccc500
Oct 21 01:25:12.685919 bootstrap-e2e-minion-group-dbk4 kernel: RDX: ffff90cd98325f38 RSI: ffff90cd983161a8 RDI: ffff90cd983161a8
Oct 21 01:25:12.685958 bootstrap-e2e-minion-group-dbk4 kernel: RBP: ffffa000407dfd80 R08: 0000000000000001 R09: 0000023c6327c8c0
Oct 21 01:25:12.685985 bootstrap-e2e-minion-group-dbk4 kernel: R10: 0000000000000aff R11: ffffffffa55e8410 R12: ffff90cbe63e0058
Oct 21 01:25:12.686010 bootstrap-e2e-minion-group-dbk4 kernel: R13: ffff90cbe63e0c78 R14: ffff90cbe6230e58 R15: ffff90cbe6230f10
Oct 21 01:25:12.686032 bootstrap-e2e-minion-group-dbk4 kernel: FS:  00007f1e7f8b4880(0000) GS:ffff90cd98300000(0000) knlGS:0000000000000000
Oct 21 01:25:12.686058 bootstrap-e2e-minion-group-dbk4 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 21 01:25:12.686084 bootstrap-e2e-minion-group-dbk4 kernel: CR2: 00007f9107e68140 CR3: 00000001ac63e005 CR4: 00000000003606a0
Oct 21 01:25:12.686110 bootstrap-e2e-minion-group-dbk4 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 21 01:25:12.686132 bootstrap-e2e-minion-group-dbk4 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Oct 21 01:25:12.686168 bootstrap-e2e-minion-group-dbk4 kernel: Call Trace:
Oct 21 01:25:12.686190 bootstrap-e2e-minion-group-dbk4 kernel:  d_walk+0x15f/0x270
Oct 21 01:25:12.686216 bootstrap-e2e-minion-group-dbk4 kernel:  ? __d_free+0x20/0x20
Oct 21 01:25:12.686242 bootstrap-e2e-minion-group-dbk4 kernel:  do_one_tree+0x26/0x70
Oct 21 01:25:12.686264 bootstrap-e2e-minion-group-dbk4 kernel:  shrink_dcache_for_umount+0x2d/0x80
Oct 21 01:25:12.686316 bootstrap-e2e-minion-group-dbk4 kernel:  generic_shutdown_super+0x22/0x120
Oct 21 01:25:12.686346 bootstrap-e2e-minion-group-dbk4 kernel:  kill_litter_super+0x2e/0x40
Oct 21 01:25:12.686374 bootstrap-e2e-minion-group-dbk4 kernel:  nfsd_umount+0x16/0x30 [nfsd]
Oct 21 01:25:12.686406 bootstrap-e2e-minion-group-dbk4 kernel:  deactivate_locked_super+0x44/0x90
Oct 21 01:25:12.686437 bootstrap-e2e-minion-group-dbk4 kernel:  cleanup_mnt+0x11a/0x170
Oct 21 01:25:12.686462 bootstrap-e2e-minion-group-dbk4 kernel:  task_work_run+0x9f/0xc0
Oct 21 01:25:12.686485 bootstrap-e2e-minion-group-dbk4 kernel:  exit_to_usermode_loop+0xca/0xe0
Oct 21 01:25:12.686510 bootstrap-e2e-minion-group-dbk4 kernel:  syscall_return_slowpath+0x7d/0xb0
Oct 21 01:25:12.686542 bootstrap-e2e-minion-group-dbk4 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 21 01:25:12.686565 bootstrap-e2e-minion-group-dbk4 kernel: RIP: 0033:0x7f1e7eb22c47
Oct 21 01:25:12.686586 bootstrap-e2e-minion-group-dbk4 kernel: Code: 72 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 19 72 2c 00 f7 d8 64 89 01 48
Oct 21 01:25:12.686616 bootstrap-e2e-minion-group-dbk4 kernel: RSP: 002b:00007ffd243ee4d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
Oct 21 01:25:12.686646 bootstrap-e2e-minion-group-dbk4 kernel: RAX: 0000000000000000 RBX: 0000557dbdd7f060 RCX: 00007f1e7eb22c47
Oct 21 01:25:12.686676 bootstrap-e2e-minion-group-dbk4 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557dbdd7f2b0
Oct 21 01:25:12.686699 bootstrap-e2e-minion-group-dbk4 kernel: RBP: 0000557dbdd7f2b0 R08: 0000557dbdd84410 R09: 0000000000000000
Oct 21 01:25:12.686723 bootstrap-e2e-minion-group-dbk4 kernel: R10: 00007ffd243edf60 R11: 0000000000000246 R12: 00007f1e7f69dd58
Oct 21 01:25:12.686755 bootstrap-e2e-minion-group-dbk4 kernel: R13: 0000000000000000 R14: 0000557dbdd7f160 R15: 0000557dbdd7f060
Oct 21 01:25:12.686784 bootstrap-e2e-minion-group-dbk4 kernel: ---[ end trace 98c01d5b887efe84 ]---
Oct 21 01:25:12.686812 bootstrap-e2e-minion-group-dbk4 kernel: BUG: Dentry 0000000076c6e187{i=7c,n=2}  still in use (1) [unmount of nfsd nfsd]
Oct 21 01:25:12.984532 bootstrap-e2e-minion-group-dbk4 kernel: ------------[ cut here ]------------

@wojtek-t
Copy link
Member

wohoo - some progress :)
I'm not an expert here, but I think it can be? Adding @kubernetes/sig-node-bugs here - they may be able to confirm/reject the hypothesis.

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Oct 21, 2020
@aojea
Copy link
Member

aojea commented Nov 4, 2020

This kernel panic may be a coincidence, maybe this is the reason #96174

@thejoycekung
Copy link
Contributor

@tosi3k
Copy link
Member

tosi3k commented Jan 22, 2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet