gce-serial tests are panicking due to NPE in autoscaling code #47127

smarterclayton · 2017-06-07T14:50:16Z

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1173

I0607 01:09:38.349] �[91m�[1mâ€¢! Panic [32.767 seconds]�[0m
I0607 01:09:38.349] [k8s.io] Initializers
I0607 01:09:38.349] �[90m/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:617�[0m
I0607 01:09:38.349]   �[91m�[1mshould dynamically register and apply initializers to pods [Serial] [It]�[0m
I0607 01:09:38.349]   �[90m/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/extension/initializers.go:206�[0m
I0607 01:09:38.349] 
I0607 01:09:38.350]   �[91m�[1mTest Panicked�[0m
I0607 01:09:38.350]   �[91mruntime error: invalid memory address or nil pointer dereference�[0m
I0607 01:09:38.350]   /usr/local/go/src/runtime/asm_amd64.s:514
I0607 01:09:38.350] 
I0607 01:09:38.350]   �[91mFull Stack Trace�[0m
I0607 01:09:38.350]   	/usr/local/go/src/runtime/panic.go:489 +0x2cf
I0607 01:09:38.350]   k8s.io/kubernetes/test/e2e/common.(*ResourceConsumer).sendConsumeCPURequest.func1(0x2ef4ce0, 0xc421f65140, 0xc42114fd3f)
I0607 01:09:38.351]   	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/autoscaling_utils.go:231 +0x5f

The text was updated successfully, but these errors were encountered:

k8s-github-robot · 2017-06-07T14:50:22Z

@smarterclayton There are no sig labels on this issue. Please add a sig label by:
(1) mentioning a sig: @kubernetes/sig-<team-name>-misc
(2) specifying the label manually: /sig <label>

Note: method (1) will trigger a notification to the team. You can find the team list here and label list here

smarterclayton · 2017-06-07T14:52:09Z

#46833 might be a culprit, but maybe not

smarterclayton · 2017-06-07T14:53:32Z

This could be related to how the test initializes the framework or cleans it up, maybe.

smarterclayton · 2017-06-07T14:54:47Z

Another failure in the same run, not just this one test:

I0606 21:26:15.687] �[91m�[1mâ€¢! Panic [756.712 seconds]�[0m
I0606 21:26:15.688] [k8s.io] Network Partition [Disruptive] [Slow]
I0606 21:26:15.688] �[90m/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:617�[0m
I0606 21:26:15.688]   [k8s.io] [ReplicationController]
I0606 21:26:15.688]   �[90m/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:617�[0m
I0606 21:26:15.688]     �[91m�[1mshould eagerly create replacement pod during network partition when termination grace is non-zero [It]�[0m
I0606 21:26:15.688]     �[90m/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network_partition.go:331�[0m
I0606 21:26:15.688] 
I0606 21:26:15.688]     �[91m�[1mTest Panicked�[0m
I0606 21:26:15.688]     �[91mruntime error: invalid memory address or nil pointer dereference�[0m
I0606 21:26:15.688]     /usr/local/go/src/runtime/asm_amd64.s:514
I0606 21:26:15.689] 
I0606 21:26:15.689]     �[91mFull Stack Trace�[0m
I0606 21:26:15.689]     	/usr/local/go/src/runtime/panic.go:489 +0x2cf
I0606 21:26:15.689]     k8s.io/kubernetes/test/e2e/common.(*ResourceConsumer).sendConsumeCPURequest.func1(0x2ef4ce0, 0xc420eae2a0, 0xc4217c5d3f)
I0606 21:26:15.689]     	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/autoscaling_utils.go:231 +0x5f
I0606 21:26:15.689]     k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.WaitFor(0xc420d2cb20, 0xc421b27200, 0xc420eae240, 0xc421abe780, 0x365130f)
I0606 21:26:15.689]     	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:275 +0x70
I0606 21:26:15.689]     k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.pollInternal(0xc420d2cb20, 0xc421b27200, 0x0, 0x0)
I0606 21:26:15.689]     	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:191 +0x99
I0606 21:26:15.690]     k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.pollImmediateInternal(0xc420d2cb20, 0xc421b27200, 0xc420d2cb20, 0xc421b27200)
I0606 21:26:15.690]     	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:216 +0x76
I0606 21:26:15.690]     k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.PollImmediate(0x37e11d600, 0x1bf08eb000, 0xc421b27200, 0x5272fe0, 0xc420d6b980)
I0606 21:26:15.690]     	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:205 +0x4d
I0606 21:26:15.690]     k8s.io/kubernetes/test/e2e/common.(*ResourceConsumer).sendConsumeCPURequest(0xc4215f0ab0, 0xfa)
I0606 21:26:15.690]     	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/autoscaling_utils.go:244 +0x1d6
I0606 21:26:15.690]     k8s.io/kubernetes/test/e2e/common.(*ResourceConsumer).makeConsumeCPURequests(0xc4215f0ab0)
I0606 21:26:15.690]     	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/autoscaling_utils.go:175 +0x328
I0606 21:26:15.690]     created by k8s.io/kubernetes/test/e2e/common.newResourceConsumer
I0606 21:26:15.690]     	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/autoscaling_utils.go:137 +0x399
I0606 21:26:15.691]

wojtek-t · 2017-06-07T15:06:32Z

@wasylkowski @piosz @mwielgus @MaciekPytel

smarterclayton · 2017-06-07T15:17:08Z

I'm not positive but the stack trace might be wrong due to inlining optimizations. rc.framework == nil was my first guess, but that would fail earlier in that method.

smarterclayton · 2017-06-07T15:26:42Z

Framework AfterEach runs namespace deletion, which may mean that .Namespace is nil (if inlining is indeed present).

wasylkowski · 2017-06-08T02:47:02Z

I am looking into this right now.

wasylkowski · 2017-06-08T03:23:48Z

It's unrelated. I reverted #46833 from the current master HEAD locally (via git revert -m 1 07e4cca7b3), then did make and finally go run hack/e2e.go -v -test --test_args='--ginkgo.focus=Initializers\sshould\sdynamically\sregister\sand\sapply\sinitializers\sto\spods\s\[Serial\]$' and it did not help. The test still fails with the same message "Expected <*v1.Initializers | 0x0>: nil not to be nil" (location: /usr/local/google/home/wasylkowski/.gvm/pkgsets/go1.8.3/global/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/extension/initializers.go:171)

wasylkowski · 2017-06-08T03:41:21Z

Note that I have executed one of the initializers e2e test (not autoscaling), because it is one of the tests reported by https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1173 as panicked.

wasylkowski · 2017-06-08T03:49:54Z

Looking a bit further I found out that commit 772ab8e added the test above as already failing.

wasylkowski · 2017-06-08T04:03:53Z

I have just realized that I have discovered yesterday a race condition in ResourceConsumer implementation, and #46833 might actually make it worse. Let me quickly make a PR out of a fix.

wasylkowski · 2017-06-08T04:15:59Z

Let me try to recap what I have found out so far:

The test that is being referred to at the top of this issue (e2e/extension/initializers.go) fails "out of the box" and probably deserves a separate issue to track it
Other failing tests that look like they have a similar failure pattern (NPE) fail because of trying to send a "consume CPU" request to a non-existent pod (unrelated to initializers.go above, as far as I can tell)
PR Fixed ResourceConsumer.CleanUp to properly clean up non-replication-controller resources and pods #46833 might cause the effect above (and probably does) because it actually deletes the pods that should have been deleted
However, the root cause is that there is a race condition inside ResourceConsumer that makes "consume CPU" requests being sometimes sent already after "CleanUp" got called (fix prepared in PR Removed a race condition from ResourceConsumer #47157)

smarterclayton · 2017-06-08T05:23:21Z

Initializer was fixed in a different PR, depending on what cluster you test on.

…

On Thu, Jun 8, 2017 at 12:16 AM, Andrzej Wasylkowski < ***@***.***> wrote: Let me try to recap what I have found out so far: - The test that is being referred to at the top of this issue (e2e/extension/initializers.go) fails "out of the box" and probably deserves a separate issue to track it - Other failing tests that look like they have a similar failure pattern (NPE) fail because of trying to send a "consume CPU" request to a non-existent pod (unrelated to initializers.go above, as far as I can tell) - PR #46833 <#46833> might cause the effect above (and probably does) because it actually deletes the pods that should have been deleted - However, the root cause is that there is a race condition inside ResourceConsumer that makes "consume CPU" requests being sometimes sent already after "CleanUp" got called (fix prepared in PR #47157 <#47157>) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#47127 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p3tOZsrX2UITd8WDIhlwUJQNbifqks5sB3WHgaJpZM4Ny0pZ> .

smarterclayton · 2017-06-08T05:27:46Z

Yeah, confirmed initializers test is now passing on GCE and disabled on GKE (which doesn't enable initializers configuration group yet)

…

On Thu, Jun 8, 2017 at 1:23 AM, Clayton Coleman ***@***.***> wrote: Initializer was fixed in a different PR, depending on what cluster you test on. On Thu, Jun 8, 2017 at 12:16 AM, Andrzej Wasylkowski < ***@***.***> wrote: > Let me try to recap what I have found out so far: > > - The test that is being referred to at the top of this issue > (e2e/extension/initializers.go) fails "out of the box" and probably > deserves a separate issue to track it > - Other failing tests that look like they have a similar failure > pattern (NPE) fail because of trying to send a "consume CPU" request to a > non-existent pod (unrelated to initializers.go above, as far as I can tell) > - PR #46833 <#46833> > might cause the effect above (and probably does) because it actually > deletes the pods that should have been deleted > - However, the root cause is that there is a race condition inside > ResourceConsumer that makes "consume CPU" requests being sometimes sent > already after "CleanUp" got called (fix prepared in PR #47157 > <#47157>) > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#47127 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABG_p3tOZsrX2UITd8WDIhlwUJQNbifqks5sB3WHgaJpZM4Ny0pZ> > . >

wasylkowski · 2017-06-08T18:48:45Z

Clayton, can you assign this to me? I cannot do this myself.

…ondition Automatic merge from submit-queue (batch tested with PRs 47065, 47157, 47143) Removed a race condition from ResourceConsumer **What this PR does / why we need it**: Without this PR there is a race condition in ResourceConsumer that sometimes results in communication to pods that might not exist anymore. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes kubernetes#47127 **Special notes for your reviewer**: **Release note**: ```release-note NONE ```

k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 7, 2017

smarterclayton added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels Jun 7, 2017

smarterclayton added this to the v1.7 milestone Jun 7, 2017

k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 7, 2017

smarterclayton mentioned this issue Jun 7, 2017

Fixed ResourceConsumer.CleanUp to properly clean up non-replication-controller resources and pods #46833

Merged

wasylkowski mentioned this issue Jun 8, 2017

Removed a race condition from ResourceConsumer #47157

Merged

k8s-github-robot closed this as completed in #47157 Jun 8, 2017

wasylkowski mentioned this issue Jun 8, 2017

ResourceConsumer panic #46769

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gce-serial tests are panicking due to NPE in autoscaling code #47127

gce-serial tests are panicking due to NPE in autoscaling code #47127

smarterclayton commented Jun 7, 2017

k8s-github-robot commented Jun 7, 2017

smarterclayton commented Jun 7, 2017

smarterclayton commented Jun 7, 2017

smarterclayton commented Jun 7, 2017

wojtek-t commented Jun 7, 2017

smarterclayton commented Jun 7, 2017

smarterclayton commented Jun 7, 2017

wasylkowski commented Jun 8, 2017

wasylkowski commented Jun 8, 2017

wasylkowski commented Jun 8, 2017

wasylkowski commented Jun 8, 2017

wasylkowski commented Jun 8, 2017

wasylkowski commented Jun 8, 2017

smarterclayton commented Jun 8, 2017 via email

smarterclayton commented Jun 8, 2017 via email

wasylkowski commented Jun 8, 2017

gce-serial tests are panicking due to NPE in autoscaling code #47127

gce-serial tests are panicking due to NPE in autoscaling code #47127

Comments

smarterclayton commented Jun 7, 2017

k8s-github-robot commented Jun 7, 2017

smarterclayton commented Jun 7, 2017

smarterclayton commented Jun 7, 2017

smarterclayton commented Jun 7, 2017

wojtek-t commented Jun 7, 2017

smarterclayton commented Jun 7, 2017

smarterclayton commented Jun 7, 2017

wasylkowski commented Jun 8, 2017

wasylkowski commented Jun 8, 2017

wasylkowski commented Jun 8, 2017

wasylkowski commented Jun 8, 2017

wasylkowski commented Jun 8, 2017

wasylkowski commented Jun 8, 2017

smarterclayton commented Jun 8, 2017 via email

smarterclayton commented Jun 8, 2017 via email

wasylkowski commented Jun 8, 2017