Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gce-serial tests are panicking due to NPE in autoscaling code #47127

Closed
smarterclayton opened this issue Jun 7, 2017 · 16 comments · Fixed by #47157
Closed

gce-serial tests are panicking due to NPE in autoscaling code #47127

smarterclayton opened this issue Jun 7, 2017 · 16 comments · Fixed by #47157
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Milestone

Comments

@smarterclayton
Copy link
Contributor

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1173

I0607 01:09:38.349] �[91m�[1m•! Panic [32.767 seconds]�[0m
I0607 01:09:38.349] [k8s.io] Initializers
I0607 01:09:38.349] �[90m/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:617�[0m
I0607 01:09:38.349]   �[91m�[1mshould dynamically register and apply initializers to pods [Serial] [It]�[0m
I0607 01:09:38.349]   �[90m/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/extension/initializers.go:206�[0m
I0607 01:09:38.349] 
I0607 01:09:38.350]   �[91m�[1mTest Panicked�[0m
I0607 01:09:38.350]   �[91mruntime error: invalid memory address or nil pointer dereference�[0m
I0607 01:09:38.350]   /usr/local/go/src/runtime/asm_amd64.s:514
I0607 01:09:38.350] 
I0607 01:09:38.350]   �[91mFull Stack Trace�[0m
I0607 01:09:38.350]   	/usr/local/go/src/runtime/panic.go:489 +0x2cf
I0607 01:09:38.350]   k8s.io/kubernetes/test/e2e/common.(*ResourceConsumer).sendConsumeCPURequest.func1(0x2ef4ce0, 0xc421f65140, 0xc42114fd3f)
I0607 01:09:38.351]   	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/autoscaling_utils.go:231 +0x5f
@k8s-github-robot
Copy link

@smarterclayton There are no sig labels on this issue. Please add a sig label by:
(1) mentioning a sig: @kubernetes/sig-<team-name>-misc
(2) specifying the label manually: /sig <label>

Note: method (1) will trigger a notification to the team. You can find the team list here and label list here

@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 7, 2017
@smarterclayton
Copy link
Contributor Author

#46833 might be a culprit, but maybe not

@smarterclayton smarterclayton added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels Jun 7, 2017
@smarterclayton smarterclayton added this to the v1.7 milestone Jun 7, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 7, 2017
@smarterclayton
Copy link
Contributor Author

This could be related to how the test initializes the framework or cleans it up, maybe.

@smarterclayton
Copy link
Contributor Author

Another failure in the same run, not just this one test:

I0606 21:26:15.687] �[91m�[1m•! Panic [756.712 seconds]�[0m
I0606 21:26:15.688] [k8s.io] Network Partition [Disruptive] [Slow]
I0606 21:26:15.688] �[90m/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:617�[0m
I0606 21:26:15.688]   [k8s.io] [ReplicationController]
I0606 21:26:15.688]   �[90m/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:617�[0m
I0606 21:26:15.688]     �[91m�[1mshould eagerly create replacement pod during network partition when termination grace is non-zero [It]�[0m
I0606 21:26:15.688]     �[90m/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network_partition.go:331�[0m
I0606 21:26:15.688] 
I0606 21:26:15.688]     �[91m�[1mTest Panicked�[0m
I0606 21:26:15.688]     �[91mruntime error: invalid memory address or nil pointer dereference�[0m
I0606 21:26:15.688]     /usr/local/go/src/runtime/asm_amd64.s:514
I0606 21:26:15.689] 
I0606 21:26:15.689]     �[91mFull Stack Trace�[0m
I0606 21:26:15.689]     	/usr/local/go/src/runtime/panic.go:489 +0x2cf
I0606 21:26:15.689]     k8s.io/kubernetes/test/e2e/common.(*ResourceConsumer).sendConsumeCPURequest.func1(0x2ef4ce0, 0xc420eae2a0, 0xc4217c5d3f)
I0606 21:26:15.689]     	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/autoscaling_utils.go:231 +0x5f
I0606 21:26:15.689]     k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.WaitFor(0xc420d2cb20, 0xc421b27200, 0xc420eae240, 0xc421abe780, 0x365130f)
I0606 21:26:15.689]     	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:275 +0x70
I0606 21:26:15.689]     k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.pollInternal(0xc420d2cb20, 0xc421b27200, 0x0, 0x0)
I0606 21:26:15.689]     	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:191 +0x99
I0606 21:26:15.690]     k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.pollImmediateInternal(0xc420d2cb20, 0xc421b27200, 0xc420d2cb20, 0xc421b27200)
I0606 21:26:15.690]     	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:216 +0x76
I0606 21:26:15.690]     k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.PollImmediate(0x37e11d600, 0x1bf08eb000, 0xc421b27200, 0x5272fe0, 0xc420d6b980)
I0606 21:26:15.690]     	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:205 +0x4d
I0606 21:26:15.690]     k8s.io/kubernetes/test/e2e/common.(*ResourceConsumer).sendConsumeCPURequest(0xc4215f0ab0, 0xfa)
I0606 21:26:15.690]     	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/autoscaling_utils.go:244 +0x1d6
I0606 21:26:15.690]     k8s.io/kubernetes/test/e2e/common.(*ResourceConsumer).makeConsumeCPURequests(0xc4215f0ab0)
I0606 21:26:15.690]     	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/autoscaling_utils.go:175 +0x328
I0606 21:26:15.690]     created by k8s.io/kubernetes/test/e2e/common.newResourceConsumer
I0606 21:26:15.690]     	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/autoscaling_utils.go:137 +0x399
I0606 21:26:15.691]     

@wojtek-t
Copy link
Member

wojtek-t commented Jun 7, 2017

@smarterclayton
Copy link
Contributor Author

I'm not positive but the stack trace might be wrong due to inlining optimizations. rc.framework == nil was my first guess, but that would fail earlier in that method.

@smarterclayton
Copy link
Contributor Author

Framework AfterEach runs namespace deletion, which may mean that .Namespace is nil (if inlining is indeed present).

@wasylkowski
Copy link

I am looking into this right now.

@wasylkowski
Copy link

It's unrelated. I reverted #46833 from the current master HEAD locally (via git revert -m 1 07e4cca7b3), then did make and finally go run hack/e2e.go -v -test --test_args='--ginkgo.focus=Initializers\sshould\sdynamically\sregister\sand\sapply\sinitializers\sto\spods\s\[Serial\]$' and it did not help. The test still fails with the same message "Expected <*v1.Initializers | 0x0>: nil not to be nil" (location: /usr/local/google/home/wasylkowski/.gvm/pkgsets/go1.8.3/global/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/extension/initializers.go:171)

@wasylkowski
Copy link

Note that I have executed one of the initializers e2e test (not autoscaling), because it is one of the tests reported by https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1173 as panicked.

@wasylkowski
Copy link

Looking a bit further I found out that commit 772ab8e added the test above as already failing.

@wasylkowski
Copy link

I have just realized that I have discovered yesterday a race condition in ResourceConsumer implementation, and #46833 might actually make it worse. Let me quickly make a PR out of a fix.

@wasylkowski
Copy link

Let me try to recap what I have found out so far:

  • The test that is being referred to at the top of this issue (e2e/extension/initializers.go) fails "out of the box" and probably deserves a separate issue to track it
  • Other failing tests that look like they have a similar failure pattern (NPE) fail because of trying to send a "consume CPU" request to a non-existent pod (unrelated to initializers.go above, as far as I can tell)
  • PR Fixed ResourceConsumer.CleanUp to properly clean up non-replication-controller resources and pods #46833 might cause the effect above (and probably does) because it actually deletes the pods that should have been deleted
  • However, the root cause is that there is a race condition inside ResourceConsumer that makes "consume CPU" requests being sometimes sent already after "CleanUp" got called (fix prepared in PR Removed a race condition from ResourceConsumer #47157)

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Jun 8, 2017 via email

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Jun 8, 2017 via email

@wasylkowski
Copy link

Clayton, can you assign this to me? I cannot do this myself.

dims pushed a commit to dims/kubernetes that referenced this issue Jun 8, 2017
…ondition

Automatic merge from submit-queue (batch tested with PRs 47065, 47157, 47143)

Removed a race condition from ResourceConsumer

**What this PR does / why we need it**: Without this PR there is a race condition in ResourceConsumer that sometimes results in communication to pods that might not exist anymore.

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes kubernetes#47127

**Special notes for your reviewer**:

**Release note**:

```release-note
NONE
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants