Flaky test: [sig-network] Services should be able to preserve UDP traffic when server pod cycles for a NodePort service #91236

jprzychodzen · 2020-05-19T07:58:29Z

Which jobs are flaking:
ci-kubernetes-e2e-gce-scale-correctness

Which test(s) are flaking:
[sig-network] Services should be able to preserve UDP traffic when server pod cycles for a NodePort service

Testgrid link:
https://k8s-testgrid.appspot.com/sig-release-master-informing#gce-master-scale-correctness

Reason for failure:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/service.go:1243
May 18 12:22:32.868: pod communication failed
Unexpected error:
    <*errors.errorString | 0xc003faab60>: {
        s: "continuous echo was not able to communicate with initial server pod",
    }
    continuous echo was not able to communicate with initial server pod
occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/service.go:1302

/sig network
/assign @thockin

The text was updated successfully, but these errors were encountered:

athenabot · 2020-05-19T08:12:03Z

/triage unresolved

Comment /remove-triage unresolved when the issue is assessed and confirmed.

🤖 I am a bot run by vllry. 👩‍🔬

haircommander · 2020-05-19T17:31:25Z

CRI-O CI is also hitting this failure. We have been unable to reproduce locally. our CI runs on aws

aojea · 2020-05-19T21:33:37Z

hmm, I only see 1 failure out of 4 and seems that the connectivity didn't work at all 🤔

May 18 12:22:16.516: INFO: The status of Pod server-1 is Running (Ready = true)
ï¿½[1mSTEPï¿½[0m: waiting up to 3m0s for service clusterip-test in namespace services-9934 to expose endpoints map[server-1:[80]]
May 18 12:22:16.585: INFO: successfully validated that service clusterip-test in namespace services-9934 exposes endpoints map[server-1:[80]] (68.878243ms elapsed)
ï¿½[1mSTEPï¿½[0m: Sending UDP traffic to NodePort service clusterip-test on node with publicIP 34.75.29.74
ï¿½[1mSTEPï¿½[0m: creating a second pod for the service clusterip-test
May 18 12:22:16.656: INFO: The status of Pod server-2 is Pending, waiting for it to be Running (with Ready = true)
May 18 12:22:17.620: INFO: Continuous echo ("udp://34.75.29.74:30565"): read udp 10.12.4.84:36135->34.75.29.74:30565: read: connection refused - 1 errors seen so far
May 18 12:22:18.655: INFO: Continuous echo ("udp://34.75.29.74:30565"): read udp 10.12.4.84:36135->34.75.29.74:30565: read: connection refused - 2 errors seen so far
May 18 12:22:18.691: INFO: The status of Pod server-2 is Pending, waiting for it to be Running (with Ready = true)
May 18 12:22:19.690: INFO: Continuous echo ("udp://34.75.29.74:30565"): read udp 10.12.4.84:36135->34.75.29.74:30565: read: connection refused - 3 errors seen so far
May 18 12:22:20.691: INFO: The status of Pod server-2 is Pending, waiting for it to be Running (with Ready = true)
May 18 12:22:20.724: INFO: Continuous echo ("udp://34.75.29.74:30565"): read udp 10.12.4.84:36135->34.75.29.74:30565: read: connection refused - 4 errors seen so far
May 18 12:22:21.759: INFO: Continuous echo ("udp://34.75.29.74:30565"): read udp 10.12.4.84:36135->34.75.29.74:30565: read: connection refused - 5 errors seen so far
May 18 12:22:22.691: INFO: The status of Pod server-2 is Pending, waiting for it to be Running (with Ready = true)
May 18 12:22:22.796: INFO: Continuous echo ("udp://34.75.29.74:30565"): read udp 10.12.4.84:36135->34.75.29.74:30565: read: connection refused - 6 errors seen so far
May 18 12:22:23.831: INFO: Continuous echo ("udp://34.75.29.74:30565"): read udp 10.12.4.84:36135->34.75.29.74:30565: read: connection refused - 7 errors seen so far
May 18 12:22:24.692: INFO: The status of Pod server-2 is Running (Ready = true)
May 18 12:22:24.692: INFO: Cleaning up server-1 pod
ï¿½[1mSTEPï¿½[0m: Sending UDP traffic to NodePort service clusterip-test on node with publicIP 34.75.29.74
May 18 12:22:24.867: INFO: Continuous echo ("udp://34.75.29.74:30565"): read udp 10.12.4.84:36135->34.75.29.74:30565: read: connection refused - 8 errors seen so far
May 18 12:22:28.867: INFO: Continuous echo ("udp://34.75.29.74:30565"): read udp 10.12.4.84:36135->34.75.29.74:30565: i/o timeout - 9 errors seen so far
May 18 12:22:32.868: INFO: Continuous echo ("udp://34.75.29.74:30565"): read udp 10.12.4.84:36135->34.75.29.74:30565: i/o timeout - 10 errors seen so far
May 18 12:22:32.868: INFO: Continuous echo waiting for signal to continue
May 18 12:22:32.868: FAIL: pod communication failed
Unexpected error:
    <*errors.errorString | 0xc003faab60>: {
        s: "continuous echo was not able to communicate with initial server pod",
    }
    continuous echo was not able to communicate with initial server pod
occurred

athenabot · 2020-05-26T08:12:08Z

@thockin
If this issue has been triaged, please comment /remove-triage unresolved.

If you aren't able to handle this issue, consider unassigning yourself and/or adding the help-wanted label.

🤖 I am a bot run by vllry. 👩‍🔬

robscott · 2020-05-28T21:12:26Z

/assign @jayunit100 @JacobTanenbaum

k8s-ci-robot · 2020-05-28T21:12:27Z

@robscott: GitHub didn't allow me to assign the following users: JacobTanenbaum.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @jayunit100 @JacobTanenbaum

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

saschagrunert · 2020-05-29T17:55:34Z

/cc

Saw the test flaking recently:
https://storage.googleapis.com/k8s-gubernator/triage/index.html?text=continuous%20echo%20was%20not%20able%20to%20communicate%20with%20initial%20server%20pod

Wondering if this is just a timeout issue. Maybe we can increase the timeout and/or switch to a TCP based test:

read udp 10.60.36.127:51154->34.83.245.240:30791: i/o timeout - 10 errors seen so far

aojea · 2020-05-31T12:12:03Z

Maybe we can increase the timeout and/or switch to a TCP based test:

the test should use UDP, and UDP is unreliable by definition ... however is curious that most of the failures happen because it is not able to communicate with the first pod https://storage.googleapis.com/k8s-gubernator/triage/index.html?text=continuous%20echo%20was%20not%20able%20to%20communicate%20with%20initial%20server%20pod#e76fa297acbadf09dac1

        s: "continuous echo was not able to communicate with initial server pod",

I think that is something that should be investigated, I've analyzed some occurences and it works fine for the second pod, however, the UDP losses should be totally random 🤔

jayunit100 · 2020-06-06T13:52:11Z

Hi everyone, yeah so.. collecting some data on this...

using a two node kind cluster, locally, with the antrea plugin, it passes
using a 4 node AWS cluster with the antrea CNI plugin, it failed
using a two node kidn cluster locally with calico, it passes

So I guess the best reproducer is to run these tests, for now, in the cloud.... Now that I have a reproducer for this maybe can live debug it some time next wk with one of yall :)

aojea · 2020-06-06T20:41:32Z

using a 4 node AWS cluster with the antrea CNI plugin, it failed

bear in mind that this test is connecting to a nodePort in the node IP, the test is functionally correct, it may be flaky due to the unreliable UDP nature or due to the pod scheduling, but if is failing constantly it has to be environmental.

jkaniuk · 2020-06-08T13:14:53Z

Second failure after two weeks of green 5k runs:

https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1269600395731144705

aojea · 2020-06-08T17:13:04Z

yep, :/

https://k8s-testgrid.appspot.com/sig-release-master-informing#gce-master-scale-correctness&width=5&include-filter-by-regex=udp|UDP

/assign

aojea · 2020-06-17T07:23:10Z

/reopen
I'm afraid this is going to be challenging 😄
https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=conntrack

The difference now is that we are testing NodePort and ClusterIP

If both fail, we can blame the test for sure,
If only NodePort fails, we need to check the job, if GCE or KIND
If only ClusterIP fails, ... 🤔

k8s-ci-robot · 2020-06-17T07:23:24Z

@aojea: Reopened this issue.

In response to this:

/reopen
I'm afraid this is going to be challenging 😄
https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=conntrack

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wojtek-t · 2020-06-19T13:22:09Z

We're hitting this really frequently in 5k-node tests.

So I started looking a little bit into that.
I don't have full understanding of what exactly is happening, but I already see a couple things that may be causing issues:

when creating the first pod, you wait for corresponding endpoint for that being created; but that doesn't mean that kube-proxy was actually programmed correctly at that time (in general it isn't); given iptables programming is rate-limitted, it may easily happen that actually programming happens 10s+ later

That explains some delays, but it doesn't explain e.g. this run: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-proto/1273659998622191616 where iptables were programmed at 17:02:14 but the test didn't respond until Thu Jun 18 17:02:24 UTC 2020 Try: 36
If someone more familiar with networking stuff could look into it, it would be great.
@kubernetes/sig-network-bugs @bowei @thockin - can you help triaging ^^

for deletions it's even worse, because you don't even wait for endpoints to be there (i.e. Endpoints can still contain the old pod when the test start pinging to look for the second one)

So I claim that timeouts are visibly to low in this test, but I suspect it's not everything.

I will send out the first PR that tries to improve it in couple minutes.

thejoycekung · 2020-10-15T00:27:12Z

Two more flakes today:

aojea · 2020-10-15T07:17:03Z

It seems that the packet never gets DNATed in the receiving host, so it does not reach the pod in the node.
There is the same test that is using ClusterIP instead of NodePort that is more reliable.
This is being a tough one

aojea · 2020-10-19T08:15:03Z

/triage accepted

thejoycekung · 2020-10-21T13:57:54Z

Yesterday's flake - https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce/1318719364404350976

I don't think I'm seeing the log messages from the latest log PR you made @aojea ? Or are the prow logs not logging at a high-enough level?

aojea · 2020-10-21T14:30:49Z

Logs are there, please bear with me:
The error on the test is:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/conntrack.go:119
Oct 21 01:23:11.916: Failed to connect to backend 1

click on the job failure and you'll see a larger output where you can see the server pod name

ï¿½[1mSTEPï¿½[0m: creating a backend pod pod-server-1 for the service svc-udp
ï¿½[1mSTEPï¿½[0m: waiting up to 3m0s for service svc-udp in namespace conntrack-5226 to expose endpoints map[pod-server-1:[80]]
Oct 21 01:22:11.665: INFO: successfully validated that service svc-udp in namespace conntrack-5226 exposes endpoints map[pod-server-1:[80]]

and get the node where the pod is running:

Oct 21 01:23:14.846: INFO: pod-server-1  bootstrap-e2e-minion-group-fltx  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-10-21 01:22:07 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2020-10-21 01:22:09 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2020-10-21 01:22:09 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-10-21 01:22:07 +0000 UTC  }]

and the client-pod name and its node:

Oct 21 01:23:14.804: INFO: At 2020-10-21 01:22:01 +0000 UTC - event for pod-client: {kubelet bootstrap-e2e-minion-group-dbk4} Started: Started container pod-client

and the flow we are looking for:

udp      17 28 src=10.64.1.58 dst=10.138.0.3 sport=12345 dport=32462 [UNREPLIED] src=10.138.0.3 dst=10.64.1.58 sport=32462 dport=12345 mark=0 use=1

with the node-name we can get the kube-proxy logs there, clicking in artifacts:
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce/1318719364404350976/artifacts/bootstrap-e2e-minion-group-fltx/kube-proxy.log

and you can see the "new" conntrack log entries conntrack.go:66] Clearing conntrack entries ...:

I1021 01:22:01.849095       1 proxier.go:1698] Opened local port "nodePort for conntrack-5226/svc-udp:udp" (:32462/udp)
I1021 01:22:01.849130       1 conntrack.go:66] Clearing conntrack entries [-D -p udp --dport 32462]

and in the client:

wow, there is a kernel bug there @wojtek-t @BenTheElder , can it be related? 🤔

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce/1318719364404350976/artifacts/bootstrap-e2e-minion-group-dbk4/kern.log

ct 21 01:20:56.731721 bootstrap-e2e-minion-group-dbk4 kernel: nfsd: last server has exited, flushing export cache
Oct 21 01:20:57.039321 bootstrap-e2e-minion-group-dbk4 kernel: BUG: Dentry 00000000244348c3{i=76,n=3}  still in use (1) [unmount of nfsd nfsd]
Oct 21 01:20:57.039521 bootstrap-e2e-minion-group-dbk4 kernel: ------------[ cut here ]------------
Oct 21 01:20:57.039570 bootstrap-e2e-minion-group-dbk4 kernel: WARNING: CPU: 0 PID: 91101 at fs/dcache.c:1595 umount_check+0x69/0x70
Oct 21 01:20:57.039606 bootstrap-e2e-minion-group-dbk4 kernel: Modules linked in: nfsd fuse dummy xt_recent nfsv3 nfs_acl auth_rpcgss nfsv4 dns_resolver nfs fscache lockd grace sunrpc xt_statistic veth xt_nat ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs xt_mark iptable_raw xt_REDIRECT xt_MASQUERADE xt_addrtype iptable_nat nf_nat br_netfilter ip6table_filter ip6_tables aesni_intel glue_helper crypto_simd cryptd loadpin_trigger(O)
Oct 21 01:20:57.039650 bootstrap-e2e-minion-group-dbk4 kernel: CPU: 0 PID: 91101 Comm: umount Tainted: G        W  O      5.4.49+ #1
Oct 21 01:20:57.039679 bootstrap-e2e-minion-group-dbk4 kernel: Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Oct 21 01:20:57.039715 bootstrap-e2e-minion-group-dbk4 kernel: RIP: 0010:umount_check+0x69/0x70
Oct 21 01:20:57.039749 bootstrap-e2e-minion-group-dbk4 kernel: Code: 96 80 00 00 00 49 8b 42 28 4c 8b 08 49 81 c2 88 04 00 00 48 c7 c7 da d6 f9 a5 48 89 f1 31 c0 41 52 e8 fc 9c e2 ff 48 83 c4 08 <0f> 0b eb b7 00 00 90 0f 1f 44 00 00 55 48 89 e5 41 56 53 bf ff ff
Oct 21 01:20:57.039782 bootstrap-e2e-minion-group-dbk4 kernel: RSP: 0018:ffffa00041e07d80 EFLAGS: 00010292
Oct 21 01:20:57.039819 bootstrap-e2e-minion-group-dbk4 kernel: RAX: 000000000000004f RBX: ffff90cc29fb1360 RCX: 8977096f5098cc00
Oct 21 01:20:57.039850 bootstrap-e2e-minion-group-dbk4 kernel: RDX: ffff90cd98225f38 RSI: ffff90cd982161a8 RDI: ffff90cd982161a8
Oct 21 01:20:57.039893 bootstrap-e2e-minion-group-dbk4 kernel: RBP: ffffa00041e07d80 R08: 0000000000000000 R09: 000001b970706a70
Oct 21 01:20:57.039934 bootstrap-e2e-minion-group-dbk4 kernel: R10: 00000000000009b3 R11: ffffffffa55e8410 R12: ffff90cc29fb13b8
Oct 21 01:20:57.039968 bootstrap-e2e-minion-group-dbk4 kernel: R13: ffff90cc23cbd690 R14: ffff90cc181ec0d8 R15: ffff90cc181ec190
Oct 21 01:20:57.040003 bootstrap-e2e-minion-group-dbk4 kernel: FS:  00007f667d9fa880(0000) GS:ffff90cd98200000(0000) knlGS:0000000000000000
Oct 21 01:20:57.040036 bootstrap-e2e-minion-group-dbk4 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 21 01:20:57.040071 bootstrap-e2e-minion-group-dbk4 kernel: CR2: 00007f3ab93dd140 CR3: 00000001aef8c004 CR4: 00000000003606b0
Oct 21 01:20:57.040107 bootstrap-e2e-minion-group-dbk4 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 21 01:20:57.040139 bootstrap-e2e-minion-group-dbk4 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Oct 21 01:20:57.040167 bootstrap-e2e-minion-group-dbk4 kernel: Call Trace:
Oct 21 01:20:57.040196 bootstrap-e2e-minion-group-dbk4 kernel:  d_walk+0x15f/0x270
Oct 21 01:20:57.040227 bootstrap-e2e-minion-group-dbk4 kernel:  ? __d_free+0x20/0x20
Oct 21 01:20:57.040257 bootstrap-e2e-minion-group-dbk4 kernel:  do_one_tree+0x26/0x70
Oct 21 01:20:57.040309 bootstrap-e2e-minion-group-dbk4 kernel:  shrink_dcache_for_umount+0x2d/0x80
Oct 21 01:20:57.040360 bootstrap-e2e-minion-group-dbk4 kernel:  generic_shutdown_super+0x22/0x120
Oct 21 01:20:57.040399 bootstrap-e2e-minion-group-dbk4 kernel:  kill_litter_super+0x2e/0x40
Oct 21 01:20:57.040428 bootstrap-e2e-minion-group-dbk4 kernel:  nfsd_umount+0x16/0x30 [nfsd]
Oct 21 01:20:57.040461 bootstrap-e2e-minion-group-dbk4 kernel:  deactivate_locked_super+0x44/0x90
Oct 21 01:20:57.040492 bootstrap-e2e-minion-group-dbk4 kernel:  cleanup_mnt+0x11a/0x170
Oct 21 01:20:57.040517 bootstrap-e2e-minion-group-dbk4 kernel:  task_work_run+0x9f/0xc0
Oct 21 01:20:57.040546 bootstrap-e2e-minion-group-dbk4 kernel:  exit_to_usermode_loop+0xca/0xe0
Oct 21 01:20:57.040576 bootstrap-e2e-minion-group-dbk4 kernel:  syscall_return_slowpath+0x7d/0xb0
Oct 21 01:20:57.040609 bootstrap-e2e-minion-group-dbk4 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 21 01:20:57.040636 bootstrap-e2e-minion-group-dbk4 kernel: RIP: 0033:0x7f667cc68c47
Oct 21 01:20:57.040751 bootstrap-e2e-minion-group-dbk4 kernel: Code: 72 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 19 72 2c 00 f7 d8 64 89 01 48
Oct 21 01:20:57.040783 bootstrap-e2e-minion-group-dbk4 kernel: RSP: 002b:00007ffe22d5c188 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
Oct 21 01:20:57.040821 bootstrap-e2e-minion-group-dbk4 kernel: RAX: 0000000000000000 RBX: 000055a4bc92c060 RCX: 00007f667cc68c47
Oct 21 01:20:57.040852 bootstrap-e2e-minion-group-dbk4 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055a4bc92c2b0
Oct 21 01:20:57.040880 bootstrap-e2e-minion-group-dbk4 kernel: RBP: 000055a4bc92c2b0 R08: 000055a4bc931410 R09: 0000000000000000
Oct 21 01:20:57.040915 bootstrap-e2e-minion-group-dbk4 kernel: R10: 00007ffe22d5bbe0 R11: 0000000000000246 R12: 00007f667d7e3d58
Oct 21 01:20:57.040950 bootstrap-e2e-minion-group-dbk4 kernel: R13: 0000000000000000 R14: 000055a4bc92c160 R15: 000055a4bc92c060
Oct 21 01:20:57.040983 bootstrap-e2e-minion-group-dbk4 kernel: ---[ end trace 98c01d5b887efe82 ]---
Oct 21 01:20:57.041015 bootstrap-e2e-minion-group-dbk4 kernel: BUG: Dentry 0000000025b755b3{i=72,n=2}  still in use (1) [unmount of nfsd nfsd]
Oct 21 01:20:57.337667 bootstrap-e2e-minion-group-dbk4 kernel: ------------[ cut here ]------------
Oct 21 01:20:57.347179 bootstrap-e2e-minion-group-dbk4 kernel: WARNING: CPU: 0 PID: 91101 at fs/dcache.c:1595 umount_check+0x69/0x70
Oct 21 01:20:57.347272 bootstrap-e2e-minion-group-dbk4 kernel: Modules linked in: nfsd fuse dummy xt_recent nfsv3 nfs_acl auth_rpcgss nfsv4 dns_resolver nfs fscache lockd grace sunrpc xt_statistic veth xt_nat ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs xt_mark iptable_raw xt_REDIRECT xt_MASQUERADE xt_addrtype iptable_nat nf_nat br_netfilter ip6table_filter ip6_tables aesni_intel glue_helper crypto_simd cryptd loadpin_trigger(O)

ct 21 01:25:12.685469 bootstrap-e2e-minion-group-dbk4 kernel: ------------[ cut here ]------------
Oct 21 01:25:12.685512 bootstrap-e2e-minion-group-dbk4 kernel: WARNING: CPU: 1 PID: 129938 at fs/dcache.c:1595 umount_check+0x69/0x70
Oct 21 01:25:12.685540 bootstrap-e2e-minion-group-dbk4 kernel: Modules linked in: xt_multiport nfsd fuse dummy xt_recent nfsv3 nfs_acl auth_rpcgss nfsv4 dns_resolver nfs fscache lockd grace sunrpc xt_statistic veth xt_nat ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs xt_mark iptable_raw xt_REDIRECT xt_MASQUERADE xt_addrtype iptable_nat nf_nat br_netfilter ip6table_filter ip6_tables aesni_intel glue_helper crypto_simd cryptd loadpin_trigger(O)
Oct 21 01:25:12.685571 bootstrap-e2e-minion-group-dbk4 kernel: CPU: 1 PID: 129938 Comm: umount Tainted: G        W  O      5.4.49+ #1
Oct 21 01:25:12.685606 bootstrap-e2e-minion-group-dbk4 kernel: Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Oct 21 01:25:12.685640 bootstrap-e2e-minion-group-dbk4 kernel: RIP: 0010:umount_check+0x69/0x70
Oct 21 01:25:12.685665 bootstrap-e2e-minion-group-dbk4 kernel: Code: 96 80 00 00 00 49 8b 42 28 4c 8b 08 49 81 c2 88 04 00 00 48 c7 c7 da d6 f9 a5 48 89 f1 31 c0 41 52 e8 fc 9c e2 ff 48 83 c4 08 <0f> 0b eb b7 00 00 90 0f 1f 44 00 00 55 48 89 e5 41 56 53 bf ff ff
Oct 21 01:25:12.685689 bootstrap-e2e-minion-group-dbk4 kernel: RSP: 0018:ffffa000407dfd80 EFLAGS: 00010292
Oct 21 01:25:12.685714 bootstrap-e2e-minion-group-dbk4 kernel: RAX: 000000000000004f RBX: ffff90cbe63e0000 RCX: 271b879b03ccc500
Oct 21 01:25:12.685919 bootstrap-e2e-minion-group-dbk4 kernel: RDX: ffff90cd98325f38 RSI: ffff90cd983161a8 RDI: ffff90cd983161a8
Oct 21 01:25:12.685958 bootstrap-e2e-minion-group-dbk4 kernel: RBP: ffffa000407dfd80 R08: 0000000000000001 R09: 0000023c6327c8c0
Oct 21 01:25:12.685985 bootstrap-e2e-minion-group-dbk4 kernel: R10: 0000000000000aff R11: ffffffffa55e8410 R12: ffff90cbe63e0058
Oct 21 01:25:12.686010 bootstrap-e2e-minion-group-dbk4 kernel: R13: ffff90cbe63e0c78 R14: ffff90cbe6230e58 R15: ffff90cbe6230f10
Oct 21 01:25:12.686032 bootstrap-e2e-minion-group-dbk4 kernel: FS:  00007f1e7f8b4880(0000) GS:ffff90cd98300000(0000) knlGS:0000000000000000
Oct 21 01:25:12.686058 bootstrap-e2e-minion-group-dbk4 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 21 01:25:12.686084 bootstrap-e2e-minion-group-dbk4 kernel: CR2: 00007f9107e68140 CR3: 00000001ac63e005 CR4: 00000000003606a0
Oct 21 01:25:12.686110 bootstrap-e2e-minion-group-dbk4 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 21 01:25:12.686132 bootstrap-e2e-minion-group-dbk4 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Oct 21 01:25:12.686168 bootstrap-e2e-minion-group-dbk4 kernel: Call Trace:
Oct 21 01:25:12.686190 bootstrap-e2e-minion-group-dbk4 kernel:  d_walk+0x15f/0x270
Oct 21 01:25:12.686216 bootstrap-e2e-minion-group-dbk4 kernel:  ? __d_free+0x20/0x20
Oct 21 01:25:12.686242 bootstrap-e2e-minion-group-dbk4 kernel:  do_one_tree+0x26/0x70
Oct 21 01:25:12.686264 bootstrap-e2e-minion-group-dbk4 kernel:  shrink_dcache_for_umount+0x2d/0x80
Oct 21 01:25:12.686316 bootstrap-e2e-minion-group-dbk4 kernel:  generic_shutdown_super+0x22/0x120
Oct 21 01:25:12.686346 bootstrap-e2e-minion-group-dbk4 kernel:  kill_litter_super+0x2e/0x40
Oct 21 01:25:12.686374 bootstrap-e2e-minion-group-dbk4 kernel:  nfsd_umount+0x16/0x30 [nfsd]
Oct 21 01:25:12.686406 bootstrap-e2e-minion-group-dbk4 kernel:  deactivate_locked_super+0x44/0x90
Oct 21 01:25:12.686437 bootstrap-e2e-minion-group-dbk4 kernel:  cleanup_mnt+0x11a/0x170
Oct 21 01:25:12.686462 bootstrap-e2e-minion-group-dbk4 kernel:  task_work_run+0x9f/0xc0
Oct 21 01:25:12.686485 bootstrap-e2e-minion-group-dbk4 kernel:  exit_to_usermode_loop+0xca/0xe0
Oct 21 01:25:12.686510 bootstrap-e2e-minion-group-dbk4 kernel:  syscall_return_slowpath+0x7d/0xb0
Oct 21 01:25:12.686542 bootstrap-e2e-minion-group-dbk4 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 21 01:25:12.686565 bootstrap-e2e-minion-group-dbk4 kernel: RIP: 0033:0x7f1e7eb22c47
Oct 21 01:25:12.686586 bootstrap-e2e-minion-group-dbk4 kernel: Code: 72 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 19 72 2c 00 f7 d8 64 89 01 48
Oct 21 01:25:12.686616 bootstrap-e2e-minion-group-dbk4 kernel: RSP: 002b:00007ffd243ee4d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
Oct 21 01:25:12.686646 bootstrap-e2e-minion-group-dbk4 kernel: RAX: 0000000000000000 RBX: 0000557dbdd7f060 RCX: 00007f1e7eb22c47
Oct 21 01:25:12.686676 bootstrap-e2e-minion-group-dbk4 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557dbdd7f2b0
Oct 21 01:25:12.686699 bootstrap-e2e-minion-group-dbk4 kernel: RBP: 0000557dbdd7f2b0 R08: 0000557dbdd84410 R09: 0000000000000000
Oct 21 01:25:12.686723 bootstrap-e2e-minion-group-dbk4 kernel: R10: 00007ffd243edf60 R11: 0000000000000246 R12: 00007f1e7f69dd58
Oct 21 01:25:12.686755 bootstrap-e2e-minion-group-dbk4 kernel: R13: 0000000000000000 R14: 0000557dbdd7f160 R15: 0000557dbdd7f060
Oct 21 01:25:12.686784 bootstrap-e2e-minion-group-dbk4 kernel: ---[ end trace 98c01d5b887efe84 ]---
Oct 21 01:25:12.686812 bootstrap-e2e-minion-group-dbk4 kernel: BUG: Dentry 0000000076c6e187{i=7c,n=2}  still in use (1) [unmount of nfsd nfsd]
Oct 21 01:25:12.984532 bootstrap-e2e-minion-group-dbk4 kernel: ------------[ cut here ]------------

wojtek-t · 2020-10-21T18:58:02Z

wohoo - some progress :)
I'm not an expert here, but I think it can be? Adding @kubernetes/sig-node-bugs here - they may be able to confirm/reject the hypothesis.

aojea · 2020-11-04T19:05:42Z

This kernel panic may be a coincidence, maybe this is the reason #96174

thejoycekung · 2021-01-15T15:53:25Z

Adding some recent failures:

thejoycekung · 2021-01-19T15:37:12Z

Spotted another flake today:

https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1351415838426533888

tosi3k · 2021-01-22T09:04:11Z

Another flake in the 5k job: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1351862443671818240.

jprzychodzen added the kind/flake Categorizes issue or PR as related to a flaky test. label May 19, 2020

k8s-ci-robot assigned thockin May 19, 2020

k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label May 19, 2020

k8s-ci-robot added the triage/unresolved Indicates an issue that can not or will not be resolved. label May 19, 2020

haircommander mentioned this issue May 19, 2020

'Services should be able to preserve UDP traffic when server pod cycles for a NodePort service' fails if conntrack is not installed on node #91216

Closed

k8s-ci-robot assigned jayunit100 May 28, 2020

thockin removed their assignment May 29, 2020

k8s-ci-robot assigned aojea Jun 8, 2020

robscott mentioned this issue Jun 8, 2020

[Flaky test] [sig-network] Networking Granular Checks: Services should update nodePort: udp #91774

Closed

This was referenced Jun 9, 2020

Deflake Services should be able to preserve UDP traffic when server pod cycles for a NodePort service #91962

Merged

WIP improve kube-proxy conntrack resilience #92122

Closed

k8s-ci-robot closed this as completed in #91962 Jun 17, 2020

k8s-ci-robot reopened this Jun 17, 2020

wojtek-t added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 19, 2020

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 19, 2020

thockin added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/unresolved Indicates an issue that can not or will not be resolved. labels Oct 15, 2020

thejoycekung mentioned this issue Oct 16, 2020

Use gsutil to download kube binaries and release artifacts #95625

Merged

aojea mentioned this issue Oct 19, 2020

Log conntrack operations #95694

Merged

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 19, 2020

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Oct 21, 2020

spiffxp mentioned this issue Oct 21, 2020

Return non-zero code on logexporter failures #95563

Merged

JornShen mentioned this issue Jan 18, 2021

optimize proxier duplicate localaddrset #98083

Merged

aojea mentioned this issue Jan 19, 2021

migrate to use k8s.io/util/net/port in kube-proxy #98130

Merged

This was referenced Jan 22, 2021

kube-proxy has to clear NodePort stale UDP entries #98305

Merged

[Flaky test] [sig-network] Services should be able to change the type and ports of a service [Slow] #98099

Closed

aojea mentioned this issue Feb 7, 2021

[Flaky Test] [sig-network] Conntrack should be able to preserve UDP traffic when server pod cycles for a NodePort service #98855

Closed

k8s-ci-robot closed this as completed in #98305 Feb 11, 2021

thejoycekung mentioned this issue Mar 11, 2021

[Flaky test] [sig-network] Conntrack should be able to preserve UDP traffic when server pod cycles for a ClusterIP service #100134

Closed

jkh52 mentioned this issue Jul 19, 2021

Upgrade to latest COS M85 image. #103781

Closed

vteratipally mentioned this issue Jul 19, 2021

update cos 85 version to latest version #103777

Merged

aojea mentioned this issue Jul 20, 2021

"dial timeout" flakes in e2e tests with konnectivity proxy enabled #102904

Closed

dkulchinsky mentioned this issue Jul 22, 2021

UDP Pod behind a LoadBalancer Service (External IP) replacing another Pod in the same node not receiving traffic #103857

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test: [sig-network] Services should be able to preserve UDP traffic when server pod cycles for a NodePort service #91236

Flaky test: [sig-network] Services should be able to preserve UDP traffic when server pod cycles for a NodePort service #91236

jprzychodzen commented May 19, 2020

athenabot commented May 19, 2020

haircommander commented May 19, 2020

aojea commented May 19, 2020

athenabot commented May 26, 2020

robscott commented May 28, 2020

k8s-ci-robot commented May 28, 2020

saschagrunert commented May 29, 2020

aojea commented May 31, 2020

jayunit100 commented Jun 6, 2020

aojea commented Jun 6, 2020 •

edited

Loading

jkaniuk commented Jun 8, 2020

aojea commented Jun 8, 2020

aojea commented Jun 17, 2020 •

edited

Loading

k8s-ci-robot commented Jun 17, 2020

wojtek-t commented Jun 19, 2020

thejoycekung commented Oct 15, 2020 •

edited

Loading

aojea commented Oct 15, 2020

aojea commented Oct 19, 2020

thejoycekung commented Oct 21, 2020

aojea commented Oct 21, 2020 •

edited

Loading

wojtek-t commented Oct 21, 2020

aojea commented Nov 4, 2020

thejoycekung commented Jan 15, 2021 •

edited

Loading

thejoycekung commented Jan 19, 2021

tosi3k commented Jan 22, 2021

Flaky test: [sig-network] Services should be able to preserve UDP traffic when server pod cycles for a NodePort service #91236

Flaky test: [sig-network] Services should be able to preserve UDP traffic when server pod cycles for a NodePort service #91236

Comments

jprzychodzen commented May 19, 2020

athenabot commented May 19, 2020

haircommander commented May 19, 2020

aojea commented May 19, 2020

athenabot commented May 26, 2020

robscott commented May 28, 2020

k8s-ci-robot commented May 28, 2020

saschagrunert commented May 29, 2020

aojea commented May 31, 2020

jayunit100 commented Jun 6, 2020

aojea commented Jun 6, 2020 • edited Loading

jkaniuk commented Jun 8, 2020

aojea commented Jun 8, 2020

aojea commented Jun 17, 2020 • edited Loading

k8s-ci-robot commented Jun 17, 2020

wojtek-t commented Jun 19, 2020

thejoycekung commented Oct 15, 2020 • edited Loading

aojea commented Oct 15, 2020

aojea commented Oct 19, 2020

thejoycekung commented Oct 21, 2020

aojea commented Oct 21, 2020 • edited Loading

wojtek-t commented Oct 21, 2020

aojea commented Nov 4, 2020

thejoycekung commented Jan 15, 2021 • edited Loading

thejoycekung commented Jan 19, 2021

tosi3k commented Jan 22, 2021

aojea commented Jun 6, 2020 •

edited

Loading

aojea commented Jun 17, 2020 •

edited

Loading

thejoycekung commented Oct 15, 2020 •

edited

Loading

aojea commented Oct 21, 2020 •

edited

Loading

thejoycekung commented Jan 15, 2021 •

edited

Loading