-
Notifications
You must be signed in to change notification settings - Fork 39.3k
-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing Test : persistent-volume-upgrade [sig-storage] #68899
Comments
looks like some permissions issue after upgrade |
@davidz627 has triaged this to an understanding that the failures happen because the pod is scheduled on the master and getting logs through the method they're using is disabled on master nodes. This makes me think this is similar to #68881 (review) and we have non-release-blocking issues in test cases not being quite robust enough now in the face of node taints. |
It is interesting that in some places this test is always failing, some it is always succeeding, and even in one it flakes but with a high success rate: https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-new-master-upgrade-cluster |
Adding on to @tpepper's comment: The Pod Spec in question:
Potentially relevant scheduler logs.
|
The fix mentioned in #68881 is only handling DaemonSets. I don't think it will fix this issue. |
Do we need to get sig-scheduling involved to see why the master is showing up as a choice for this pod? |
/sig scheduling |
It seems that this pvc-tester- pod is only scheduled to master in every run which caused the test to fail consistently.
|
@janetkuo Whereas in this one it doesn't seem to fail at all!: https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gke-gci-new-gci-master-upgrade-cluster-parallel I talked to @krzyzacy about why the behavior might be different across different test suites but we were unable to come up with anything concrete. |
@davidz627 Is the master node in your test environment tainted properly? |
When we were doing unrelated development with DaemonSets, we noticed a similar issue where our DaemonSet pod would get scheduled on gce environment using hack/cluster-up.sh, and we noticed masters brought up using that script were not tainted |
What release was the tainting master logic added? |
Not sure if this is a source of truth on the cluster but the test artifacts here on a successful run show that the master node doesn't have a taint: and on a failed run ALSO no taint: But I'm not really sure where this |
ie: #68492 |
Don't think #68492 will help. It looks specific to gpu labeling |
Also ^^ is GKE specific, and GKE docu says no taints are inherited to a pool from the cluster, so shouldn't be something from that direction. |
Some log snippets around the area of interest:
|
/priority critical-urgent |
I think the fix is obvious: we need to taint the master node on the gce master setup in cluster/. What I still don't understand is how did we get this far without anything else failing, and why only this one test? |
Do taints persist across master upgrades, or do they need to be re-applied again? |
cc @dashpole for what we can do on the node side. |
@kubernetes/sig-node-bugs help please |
@bsalamat regarding your example "a pod that tolerates certain node conditions might not be scheduled if the scheduler checked both node taints and node condition", could this be an API compatibility issue? Will a pod that tolerates a node condition in 1.11 fail to schedule in 1.12 because the scheduler no longer looks at node conditions and only node taints? |
@msau42 Pods were not able to tolerate node conditions by setting some API fields. We had some annotations, such as pod critical annotation and other hard-coded logic in DS controller, etc. that would have the same effect of tolerating conditions, but they were not in our API. |
Thanks for the explanation. It looks like the critical pod annotation was alpha, so does not have a backwards compatibility requirement, and we've fixed all the known users of it. |
xref: #63897 - cc @mikedanese @liggitt Should this be handled by the node controller? I.e. if the Unschedulable bool is set on the node, add the unschedulable taint? |
I don't believe the kubelet has the ability to update its taints anymore. Assuming this is the case, this seems like a version skew problem between master components. I.E. the 1.11 controller manager isn't compatible with the 1.13 scheduler because it doesn't add taints that the scheduler requires. This means that if the scheduler is upgraded before the controller manager (or before it has the chance to perform updates), we can run into issues. I think making the scheduler respect conditions in 1.13, and thus making it compatible with the 1.11 controller manager, seems like the best remedy to this. Regardless of this issue, user pods would need to include both the critical annotation as well as the toleration in order to maintain the same behavior across releases that transition from conditions to taints. |
Oh, I see that the node controller is already doing that, but there's a race condition (#68899 (comment)), sorry I missed that earlier. |
Yes, translation of conditions and unschedulable field to taints is done by the controller |
I wonder if we can change our upgrade flow to ensure that the controller manager is upgraded and becomes stable (adds the taints) before the scheduler is upgraded. |
The kubelet has never required the ability to update taints.
Was that a typo, supposed to be 1.12? Have we ever supported two version skew between those components?
Under what circumstances do we upgrade to 1.13 scheduler while a 1.11 controller manager is still running?
Same question about two version skew between these two components.
That's more concerning… is that something all user pods would be exposed to and required to change? |
+1 to @dashpole's suggestion in #68899 (comment) to unblock the release |
What about HA master upgrade? scheduler and controller-manager each have independent leader elections, so during a HA rolling upgrade, the current leaders for scheduler and controller managers could be on different releases in either direction |
I wouldn't expect an upgrade from 1.12 to 1.13 to begin until the upgrade from 1.11 to 1.12 was complete/successful |
Thanks for correcting me. It wasn't a typo, but just a lack of knowledge about how multi-version upgrades are performed. In that case we need to support 1 version skew.
Yeah, it should be 1.12, not 1.13. But generally, the newer scheduler should still be compatible with the state left by the previous version's controller manager, even if it isn't running anymore.
Actually, I agree. @bsalamat have we officially deprecated the critical pod annotation? We should ideally add the correct tolerations to pods with the critical pod annotation to ensure backwards compatibility. |
Bobby commented here that the system pods have already been corrected. I also noted that the critical pod annotation is alpha. |
I mean doing an HA rolling upgrade from 1.11 to 1.12. Let's say we have 3 masters, A, B, C, all on 1.11.
|
That feature (critical pods) was an alpha feature and we have already replaced it. The dilemma is that I am not so sure if critical pods were the only kind of pods with a "hack" for tolerating node conditions. Let me dig a bit more. |
One version of skew, absolutely. @dashpole's comment made me concerned two versions of skew support were being claimed/proposed |
I looked the scheduler code and the DaemonSet controller code to find if adding "CheckNodeCondition" could cause issues in presence of TaintNodeByCondition and I didn't find anything particularly concerning. So, I created #68953. Let's see if it breaks tests. |
that will break schedule daemonset by scheduler. and registerwithtaints of kubelet is not worked in upgrade. |
I suspected that it would conflict with pods which now have tolerations and the scheduler should ignore node conditions for, but I thought it would be limited to critical pods. Thanks for pointing it out. I closed the PR. |
unschedulable is taint by controller & kubelet; #63955 taints node with unschedulable to avoid race condition between controller and scheduler, it dependents on
let me think about it, I'll propose one today. |
The race that is important to resolve at creation time is that of a new node object appearing untainted. That race does not apply in an upgrade scenario, where the node object already exists, and can already be observed and acted upon. Relying on the condition-to-taint controller in upgrade scenarios is reasonable. |
The race after kubelet start up is well known; but we did not handle startup case well. I think what we're doing is to deprecate |
Failing Job
sig-release-master-upgrade#gce-new-master-upgrade-master
Gubernator Logs
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-master/1778
Error
Triage
This test is a persistent fail in this one test bucket only, and aside from one flake it appears to consistently pass everywhere else in sig-release-master-upgrade and in sig-release-1.12-all.
/kind bug
/priority failing-test
/sig storage
@kubernetes/sig-storage-bugs
/assign @msau42
/cc @saad-ali @childsb
/milestone v1.12
The text was updated successfully, but these errors were encountered: