New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite} #31981
Comments
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-gce-scale/1385/ Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}
|
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-500-gce/5522/ Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}
|
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-500-gce/5529/ Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}
|
[FLAKE-PING] @wojtek-t This flaky-test issue would love to have more attention. |
Hmm - something strange is happening in those kubemark-500 failure. The first failure - it seems that controller-manager wasn't even aware of the decrease in number of replicas:
apiserver:
controller-manager:
So it seems that controller-amanger didn't even receive the update from 250 to 211 and it only received update to 146 replicas. On the other hand, in this failure: test:
apiserver:
|
I've never seen anything like that in the past, so it seems like recent regression. But I don't see anything suspicious being merge in the last few days. @kubernetes/sig-api-machinery @lavalamp |
#31981 (comment) is addressed by #32081 The other 3 failures are still under investigation. |
Sorry - I missed something in the logs for the second run. So actually all three runs (except from the one that is being fixed by #32081) are instance of exactly the same problem:
What is suspicious to me:
So my hypothesis is that it's related to RC, but this is currently just a hypothesis. |
…ntroller Automatic merge from submit-queue NodeController listing nodes from apiserver cache Ref #31981 This is addressing this particular failure: #31981 (comment)
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-500-gce/5600/ Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}
|
The last failure is exactly the same as the other three, so we continue debugging the same issue.
from apiserver:
from controller:
So at the time when the scale from 5 to 3 happened, ReplicationController received some update. However, the log we have isn't exactly what we need so there are two possibilities:
#32100 is supposed to fix the log so that we can debug it further. |
Automatic merge from submit-queue Fix debugging log in RC Ref #31981
For now, we are waiting for the next failure (currently with correct logs), to narrow down the reason of the failure. |
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-500-gce/5610/ Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}
|
OK - so we seem to have enough logs to start real debugging now.
There weren't any other PUT events on our replication controller around that time. From the above logs, it's pretty clear, that we correctly deliver watch event that is changing the size of RC from 5 to 4, but then some new event is coming that scales RC back from 4 to 5. So in my opionion, the problem is either in RC itself, or somewhere in the controller framework. @lavalamp @bprashanth FYI ^^ |
Note that difference between processing those two events is 0.15ms, so they are basically processed on by the other. My guess is that there are two events on the underlying DeltaFIFO and they are just processed on by the other. The other event might potentially be resync, but I still don't see the possibility for it to be "incorrect" sync event. |
Looking into logs from RC from around that time, there is huge amount of logs like this:
[where number of replicas didn't change in spec] Thus my very strong hypothesis is that there was a Resync at that time. Which probably means, that we have some race between Resync and Update... @lavalamp ^^ |
OK - so I think I understand what is happening here.
I think the solution is to not append "Sync" event, if there is already one queued in DeltaFIFO for that object. @lavalamp - FYI |
OK - I have reproduced it unit test, so I'm pretty sure this is this problem @kubernetes/sig-api-machinery - FYI |
PR fixing it is coming soon |
The fix is out for review. |
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-gce-scale/1393/ Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}
|
Automatic merge from submit-queue Reduce replication_controller log spam Decrease verbosity and reword 'Observed updated replication controller ...' now that the issue it was added for has been fixed. This was originally added to debug #31981, and it was fixed back in September 2016. cc @gmarek @wojtek-t @Kargakis @eparis @smarterclayton
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-500-gce/5489/
Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}
Previous issues for this test: #26544 #26938 #27595 #30146 #30469 #31374 #31427 #31433 #31589
The text was updated successfully, but these errors were encountered: