Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][DO NOT REVIEW] debug ipv6 jobs flakiness #85727

Closed
wants to merge 5 commits into from

Conversation

aojea
Copy link
Member

@aojea aojea commented Nov 28, 2019

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespace from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot
Copy link
Contributor

@aojea: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 28, 2019
@aojea
Copy link
Member Author

aojea commented Nov 28, 2019

/test pull-kubernetes-e2e-kind-ipv6

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 28, 2019
@aojea
Copy link
Member Author

aojea commented Nov 28, 2019

one lock error occurrence
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/85727/pull-kubernetes-e2e-kind-ipv6/1200055488255889408/artifacts/logs/kind-worker/containers/kube-proxy-dqm9h_kube-system_kube-proxy-05f7dc90bcab173c17450e719f70981f922ff5d2435ef1c11dd6ea5e3871d72c.log

2019-11-28T14:25:28.895549562Z stderr F I1128 14:25:28.894170       1 iptables.go:433] running iptables: ip6tables [-w 5 -N KUBE-SERVICES -t filter]
2019-11-28T14:25:28.92091182Z stderr F I1128 14:25:28.919878       1 iptables.go:433] running iptables: ip6tables [-w 5 -C FORWARD -t filter -m conntrack --ctstate NEW -m comment --comment kubernetes service portals -j KUBE-SERVICES]
2019-11-28T14:25:29.416469776Z stderr F I1128 14:25:29.416314       1 config.go:167] Calling handler.OnEndpointsUpdate
2019-11-28T14:25:29.891298918Z stderr F I1128 14:25:29.891036       1 config.go:167] Calling handler.OnEndpointsUpdate
2019-11-28T14:25:31.482877247Z stderr F I1128 14:25:31.482713       1 config.go:167] Calling handler.OnEndpointsUpdate
2019-11-28T14:25:31.933823087Z stderr F I1128 14:25:31.933663       1 config.go:167] Calling handler.OnEndpointsUpdate
2019-11-28T14:25:33.566210669Z stderr F I1128 14:25:33.565765       1 config.go:368] Calling handler.OnServiceDelete
2019-11-28T14:25:33.596028909Z stderr F I1128 14:25:33.590488       1 config.go:186] Calling handler.OnEndpointsDelete
2019-11-28T14:25:33.596109646Z stderr F I1128 14:25:33.590543       1 endpoints.go:376] Setting endpoints for "svc-latency-7239/latency-svc-22nhg:" to [[fd00:10:244:0:1::24]:80]
2019-11-28T14:25:33.596174131Z stderr F I1128 14:25:33.590583       1 config.go:167] Calling handler.OnEndpointsUpdate
2019-11-28T14:25:33.62145753Z stderr F I1128 14:25:33.620587       1 config.go:368] Calling handler.OnServiceDelete
2019-11-28T14:25:33.628043945Z stderr F I1128 14:25:33.627845       1 config.go:186] Calling handler.OnEndpointsDelete
2019-11-28T14:25:33.628081874Z stderr F I1128 14:25:33.627884       1 endpoints.go:376] Setting endpoints for "svc-latency-7239/latency-svc-27fkr:" to [[fd00:10:244:0:1::24]:80]
2019-11-28T14:25:33.678079801Z stderr F I1128 14:25:33.677804       1 config.go:186] Calling handler.OnEndpointsDelete
2019-11-28T14:25:33.678132379Z stderr F I1128 14:25:33.677853       1 endpoints.go:376] Setting endpoints for "svc-latency-7239/latency-svc-2955r:" to [[fd00:10:244:0:1::24]:80]
2019-11-28T14:25:33.678143006Z stderr F I1128 14:25:33.677804       1 config.go:368] Calling handler.OnServiceDelete
2019-11-28T14:25:33.7032062Z stderr F I1128 14:25:33.702742       1 config.go:368] Calling handler.OnServiceDelete
2019-11-28T14:25:33.725827208Z stderr F I1128 14:25:33.723816       1 config.go:186] Calling handler.OnEndpointsDelete
2019-11-28T14:25:33.725871283Z stderr F I1128 14:25:33.723864       1 endpoints.go:376] Setting endpoints for "svc-latency-7239/latency-svc-2b5mj:" to [[fd00:10:244:0:1::24]:80]
2019-11-28T14:25:33.758071667Z stderr F I1128 14:25:33.748959       1 config.go:368] Calling handler.OnServiceDelete
2019-11-28T14:25:33.758112006Z stderr F I1128 14:25:33.755893       1 config.go:186] Calling handler.OnEndpointsDelete
2019-11-28T14:25:33.758120546Z stderr F I1128 14:25:33.755967       1 endpoints.go:376] Setting endpoints for "svc-latency-7239/latency-svc-2bz7d:" to [[fd00:10:244:0:1::24]:80]
2019-11-28T14:25:33.79957753Z stderr F I1128 14:25:33.799361       1 config.go:368] Calling handler.OnServiceDelete
2019-11-28T14:25:33.819839508Z stderr F I1128 14:25:33.819457       1 config.go:186] Calling handler.OnEndpointsDelete
2019-11-28T14:25:33.81987256Z stderr F I1128 14:25:33.819503       1 endpoints.go:376] Setting endpoints for "svc-latency-7239/latency-svc-2ddl4:" to [[fd00:10:244:0:1::24]:80]
2019-11-28T14:25:33.859934462Z stderr F I1128 14:25:33.851983       1 config.go:368] Calling handler.OnServiceDelete
2019-11-28T14:25:33.87294907Z stderr F I1128 14:25:33.871635       1 config.go:186] Calling handler.OnEndpointsDelete
2019-11-28T14:25:33.872991151Z stderr F I1128 14:25:33.871685       1 endpoints.go:376] Setting endpoints for "svc-latency-7239/latency-svc-2jqfs:" to [[fd00:10:244:0:1::24]:80]
2019-11-28T14:25:33.919451262Z stderr F I1128 14:25:33.919197       1 config.go:368] Calling handler.OnServiceDelete
2019-11-28T14:25:33.945424161Z stderr F I1128 14:25:33.938860       1 config.go:186] Calling handler.OnEndpointsDelete
2019-11-28T14:25:33.945455837Z stderr F I1128 14:25:33.938902       1 endpoints.go:376] Setting endpoints for "svc-latency-7239/latency-svc-2kgrn:" to [[fd00:10:244:0:1::24]:80]
2019-11-28T14:25:33.991270737Z stderr F I1128 14:25:33.991009       1 config.go:368] Calling handler.OnServiceDelete
2019-11-28T14:25:33.997073285Z stderr F I1128 14:25:33.994060       1 config.go:167] Calling handler.OnEndpointsUpdate
2019-11-28T14:25:34.004050918Z stderr F E1128 14:25:34.002437       1 proxier.go:800] Failed to ensure that filter chain FORWARD jumps to KUBE-SERVICES: error checking rule: exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
2019-11-28T14:25:34.004088792Z stderr F I1128 14:25:34.002479       1 proxier.go:784] Sync failed; retrying in 30s

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 28, 2019
@aojea
Copy link
Member Author

aojea commented Nov 28, 2019

/test pull-kubernetes-e2e-kind-ipv6

1 similar comment
@aojea
Copy link
Member Author

aojea commented Nov 29, 2019

/test pull-kubernetes-e2e-kind-ipv6

@aojea
Copy link
Member Author

aojea commented Nov 29, 2019

/test pull-kubernetes-e2e-kind-ipv6

1 similar comment
@aojea
Copy link
Member Author

aojea commented Nov 29, 2019

/test pull-kubernetes-e2e-kind-ipv6

@aojea
Copy link
Member Author

aojea commented Nov 29, 2019

/test pull-kubernetes-e2e-kind-ipv6

1 similar comment
@aojea
Copy link
Member Author

aojea commented Nov 29, 2019

/test pull-kubernetes-e2e-kind-ipv6

@aojea
Copy link
Member Author

aojea commented Nov 29, 2019

/test pull-kubernetes-conformance-kind-ipv6

@aojea
Copy link
Member Author

aojea commented Nov 29, 2019

/test pull-kubernetes-e2e-kind-ipv6

/test pull-kubernetes-conformance-kind-ipv6

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 30, 2019
@aojea
Copy link
Member Author

aojea commented Nov 30, 2019

/test pull-kubernetes-e2e-kind-ipv6

/test pull-kubernetes-conformance-kind-ipv6

iptables has two options to modify the behaviour trying to
acquire the lock.

--wait  -w [seconds]    maximum wait to acquire xtables lock
                        before give up
--wait-interval -W [usecs]  wait time to try to acquire xtables
                            lock
                            interval to wait for xtables lock
                            default is 1 second

Kubernetes uses -w 5 that means that wait 5 seconds to try to
acquire the lock. If we are not able to acquire it, kube-proxy
fails and retries in 30 seconds, that is an important penalty
on sensitive applications.
We can be a bit more aggresive and try to acquire the lock every
100 msec, that means that we have to fail 50 times to not being
able to succeed.
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 8, 2019
@aojea
Copy link
Member Author

aojea commented Dec 8, 2019

/test pull-kubernetes-e2e-kind-ipv6
/test pull-kubernetes-conformance-kind-ipv6

@aojea
Copy link
Member Author

aojea commented Dec 8, 2019

/test pull-kubernetes-e2e-kind-ipv6
/test pull-kubernetes-conformance-kind-ipv6

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aojea
To complete the pull request process, please assign danwinship, feiskyer, random-liu
You can assign the PR to them by writing /assign @danwinship @feiskyer @random-liu in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

@aojea: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-bazel-test 022eacf link /test pull-kubernetes-bazel-test
pull-kubernetes-e2e-gce 022eacf link /test pull-kubernetes-e2e-gce

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@aojea
Copy link
Member Author

aojea commented Dec 8, 2019

/test pull-kubernetes-e2e-kind-ipv6
/test pull-kubernetes-conformance-kind-ipv6

@aojea
Copy link
Member Author

aojea commented Dec 8, 2019

/test pull-kubernetes-e2e-kind-ipv6
/test pull-kubernetes-conformance-kind-ipv6

no lock errors in any of the kube-proxy files during last 4 jobs, let's see if we have a winner now and the lock errors were because of the contention created with containerd/cri#1352 and containernetworking/plugins#418

solved with this workaround 022eacf

@aojea
Copy link
Member Author

aojea commented Dec 9, 2019

/test pull-kubernetes-e2e-kind-ipv6
/test pull-kubernetes-conformance-kind-ipv6

4 similar comments
@aojea
Copy link
Member Author

aojea commented Dec 9, 2019

/test pull-kubernetes-e2e-kind-ipv6
/test pull-kubernetes-conformance-kind-ipv6

@aojea
Copy link
Member Author

aojea commented Dec 9, 2019

/test pull-kubernetes-e2e-kind-ipv6
/test pull-kubernetes-conformance-kind-ipv6

@aojea
Copy link
Member Author

aojea commented Dec 9, 2019

/test pull-kubernetes-e2e-kind-ipv6
/test pull-kubernetes-conformance-kind-ipv6

@aojea
Copy link
Member Author

aojea commented Dec 10, 2019

/test pull-kubernetes-e2e-kind-ipv6
/test pull-kubernetes-conformance-kind-ipv6

@aojea
Copy link
Member Author

aojea commented Dec 15, 2019

/test pull-kubernetes-e2e-kind-ipv6
/test pull-kubernetes-conformance-kind-ipv6

@liggitt liggitt added the area/deflake Issues or PRs related to deflaking kubernetes tests label Dec 16, 2019
@aojea
Copy link
Member Author

aojea commented Dec 16, 2019

/test pull-kubernetes-e2e-kind-ipv6
/test pull-kubernetes-conformance-kind-ipv6

@aojea
Copy link
Member Author

aojea commented Jan 16, 2020

/close
let's see if the portmap patch fixes the problem, hard to tell with current CI status, but will keep monitoring this

@k8s-ci-robot
Copy link
Contributor

@aojea: Closed this PR.

In response to this:

/close
let's see if the portmap patch fixes the problem, hard to tell with current CI status, but will keep monitoring this

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@aojea: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 16, 2020
@aojea aojea deleted the flakiness_ipv6 branch January 16, 2020 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/deflake Issues or PRs related to deflaking kubernetes tests area/kubeadm area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants