Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing test] [sig-node] regular resource usage tracking resource tracking for 100 pods per node #75039

Open
mariantalla opened this Issue Mar 6, 2019 · 18 comments

Comments

@mariantalla
Copy link
Contributor

mariantalla commented Mar 6, 2019

Which jobs are failing:

Which test(s) are failing:
[k8s.io] [sig-node] Kubelet [Serial] [Slow] [k8s.io] [sig-node] regular resource usage tracking resource tracking for 100 pods per node

Since when has it been failing:
2019-03-01

Reason for failure:
Not all expected pods come up:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/node/kubelet_perf.go:263
Expected error:
    <*errors.errorString | 0xc00103a5e0>: {
        s: "Only 296 pods started out of 300",
    }
    Only 296 pods started out of 300
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/node/kubelet_perf.go:79

(296 is not consistent)

Anything else we need to know:
Might be related: #74917

/kind failing-test
/sig node
/priority critical-urgent
(for now, because it's a consistent failure in a release-blocking test job)
/milestone v1.14

cc @smourapina @alejandroEsc @kacole2 @mortent

@mariantalla

This comment has been minimized.

Copy link
Contributor Author

mariantalla commented Mar 6, 2019

@mariantalla mariantalla added this to New (no response yet) in 1.15 CI Signal Mar 6, 2019

@mariantalla

This comment has been minimized.

Copy link
Contributor Author

mariantalla commented Mar 7, 2019

@dchen1107 @derekwaynecarr Could you help with triaging this? It has been failing for 5 days and it's a release-blocking test.

@soggiest

This comment has been minimized.

Copy link
Contributor

soggiest commented Mar 8, 2019

Hello! We are in code freeze for 1.14. Since this test is release blocking can we get some more reviews on this issue.

@dchen1107 @derekwaynecarr

@ibrasho

This comment has been minimized.

Copy link
Member

ibrasho commented Mar 9, 2019

@derekwaynecarr is not available currently (hope the kid gets better soon ❤️).

Anyone else in @kubernetes/sig-node-test-failures can help with this? (@Random-Liu can you shed more light on what change could have cause this?)

@dchen1107

This comment has been minimized.

Copy link
Member

dchen1107 commented Mar 12, 2019

@dashpole Can you please triage the failing test? Thanks!

@mariantalla

This comment has been minimized.

Copy link
Contributor Author

mariantalla commented Mar 12, 2019

The 0 pods per node test is also flaking in the upgrade dashboards (not as heavily; testgrid), due to memory usage exceeding the expected limits.

Issue: #75298 (cc'ed you to that too @dashpole , hope that's ok)

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Mar 12, 2019

I0312 17:17:54.144] Mar 12 17:17:54.098: INFO: At 2019-03-12 17:10:17 +0000 UTC - event for resource300-ad3d2e9b-44e9-11e9-9a99-c2a9ffc1bdd9-5lbnl: {default-scheduler } FailedScheduling: 0/4 nodes are available: 1 node(s) were unschedulable, 3 Insufficient pods.

It looks like we must have added an additional pod to each node at some point... Ill dig and try and find which pod that is.

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Mar 12, 2019

It looks like these tests are being run in parallel? I see a bunch of pods from other tests on the node:

Logging pods the kubelet thinks is on node bootstrap-e2e-minion-group-m6dd
kube-dns-autoscaler-97df449df-v6qzs
kube-proxy-bootstrap-e2e-minion-group-m6dd-c2a9ffc1bdd9-5ntx4
res-cons-upgrade-qwqcd
ds1-m696l
res-cons-upgrade-ctrl-tn5tx
service-test-2drlj
metadata-proxy-v0.1-jfzjq
ss-1
echoheaders-https-twrs9
metrics-server-v0.3.1-58d65f8d6-g9hp4
foo-s9wdz
kubernetes-dashboard-66bb48f98c-sdmtf
fluentd-gcp-v3.2.0-snrxn

For example, res-cons-upgrade is from the HPA test.

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Mar 12, 2019

Either way, this is not release blocking, as the test is only failing because of improper test setup. Tests marked [Serial] must be run serially, and often rely on all node resources (including Pods) being available.

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Mar 12, 2019

/priority important-soon

cc @krzyzacy @justinsb
Who have worked on this test recently, and may have context. This coincides with the 1.12 update (kubernetes/test-infra#11583). It looks like the test suite is incorrectly running tests in parallel, but i'm not sure how the two are related.

@krzyzacy

This comment has been minimized.

Copy link
Member

krzyzacy commented Mar 12, 2019

/cc @amwat
poking

@krzyzacy

This comment has been minimized.

Copy link
Member

krzyzacy commented Mar 12, 2019

I don't think it's running in serial parallel 🤦‍♂️ (thanks @justinsb)?

Running: ./hack/ginkgo-e2e.sh --ginkgo.focus=\[Slow\]|\[Serial\]|\[Disruptive\] --ginkgo.skip=\[Flaky\]|\[Feature:.+\] --kubectl-path=../../../../kubernetes_skew/cluster/kubectl.sh --minStartupPods=8 --report-dir=/workspace/_artifacts --disable-log-dump=true

I0312 08:32:49.944] I0312 08:32:49.944137   35037 e2e.go:224] Starting e2e run "6b453d84-44a1-11e9-9a99-c2a9ffc1bdd9" on Ginkgo node 1
I0312 08:32:50.090] Running Suite: Kubernetes e2e suite

Oppose to the parallel suite where you can see logs like

W0312 22:26:34.679] 2019/03/12 22:26:34 process.go:153: Running: ./hack/ginkgo-e2e.sh --ginkgo.skip=\[Slow\]|\[Serial\]|\[Disruptive\]|\[Flaky\]|\[Feature:.+\] --minStartupPods=8 --report-dir=/workspace/_artifacts --disable-log-dump=true --cluster-ip-range=10.64.0.0/14

...

I0312 22:26:45.193] Running Suite: Kubernetes e2e suite
I0312 22:26:45.193] ===================================
I0312 22:26:45.193] Random Seed: �[1m1552429595�[0m - Will randomize all specs
I0312 22:26:45.193] Will run �[1m3582�[0m specs
I0312 22:26:45.193] 
I0312 22:26:45.194] Running in parallel across �[1m30�[0m nodes

I'm more worry about the actual UpgradeTest itself is panicking, anything running afterwards will not be a valid signal.

@amwat

This comment has been minimized.

Copy link

amwat commented Mar 12, 2019

I think it might be related to the go1.12 update, there are other failures in the upgrade jobs that are legitimate failures: #74890

@justinsb

This comment has been minimized.

Copy link
Member

justinsb commented Mar 13, 2019

@krzyzacy I think you meant "I don't think it is running in parallel"?

That was the conclusion I came to also - that there was only one ginkgo runner visible.

Both jobs are showing the scary "fatal error: sync: inconsistent mutex state" panic which #74890 is a best-guess effort to solve. It really is just trying to gather more data though, it's not really a fix...

@mariantalla

This comment has been minimized.

Copy link
Contributor Author

mariantalla commented Mar 13, 2019

/remove-priority critical-urgent

fyi #75305 (investigation/speculative workaround for golang 1.12 related issues) is currently merging. Will keep an eye on changes 👀

@mariantalla

This comment has been minimized.

Copy link
Contributor Author

mariantalla commented Mar 13, 2019

@dashpole - as we can't rely on this test for CI signal for now, are there other tests that cover the same use cases? e.g. are the node-kubelet-benchmark similar?

In other words, if someone asks whether node resource usage has deteriorated with the 1.14 release, where can we look to confidently answer that question?

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Mar 13, 2019

Yeah, the node-kubelet-benchmark is the test suite we rely on for resource tracking signal.

@spiffxp

This comment has been minimized.

Copy link
Member

spiffxp commented Mar 16, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.