-
Notifications
You must be signed in to change notification settings - Fork 395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MachineConfigs can be garbage collected while a node is still booting #301
Comments
This problem is actually fairly new (as of basically since Friday) because of the near-simultaneous addition of |
/kind bug |
As a data point, the |
Sorry, I was wrong. It's likely the |
To reduce churn if MCs are being created rapidly - both on general principle, and also to reduce our exposure to the current bug that a booting node may fail to find a GC'd MachineConfig: openshift#301
Hmm, wonder if it'd work to add another owner reference from the MCD on the MC before handling it. Let me look into that. |
The same way we pass the specific desiredConfig we "locked on" when updating, also similarly pass the currentConfig. This is a minor optimization so we avoid querying the cluster again for information we already have, but it does also make the code more resilient to humans meddling with node annotations. Related: openshift#301
So, just to do a brain dump on this. I'm still trying to reproduce this locally. It doesn't trigger at least when I manually create MCs in rapid succession (this is with #303 reverted). One thing that caught my eye though is that there is no code today that actually deletes MachineConfigs, right? The render controller does try to remove the pool ownerReference from previously generated MachineConfigs (that actually doesn't work right now; working on a patch to fix that), but even so, deleting the ownerRef doesn't automatically delete the object, it just orphans them (which is actually an issue; we need to figure out garbage collection). |
Something in that CI run was pruning MCs though.
Doesn't Kube itself do the GC https://blog.openshift.com/garbage-collection-custom-resources-available-kubernetes-1-8/ ? We should indeed figure this issue out better but it's not in the blocking path right now I'd say. |
Agreed |
AFAICT, deleting a parent will delete children that have ownerRefs to the parent. But deleting the ownerRef directly from the child doesn't actually delete the child itself, it just orphans them. I tested this out with e.g.:
and verifying that Kube didn't delete it afterwards.
My worry is that not knowing exactly what's deleting MCs means we don't actually know how bad this is. |
(BTW I wrote this wasn't in the critpath but then I hit #273 (comment) )
But...when does Kube GC happen? Could it have been delayed? I keep circling back again to something is definitely deleting MCs today. |
Yeah agreed. Will try to get more visibility on this today. BTW, was #273 (comment) in a CI cluster, or local? Wondering if there's somehow a higher likelihood of hitting this in CI for whatever reason. |
I think we have this in CI but we're not noticing it. If it's happening we need to fix it. Ref: openshift#301
Yeah, this is still a blocking issue for #273 In a recent CI run all my masters went degraded due to failing to fetch the MC. And for some reason I seem to have lost my workers entirely ( |
In current master (i.e. after 0c36d1e landed):
So yeah, that code was clearly broken. |
Should 0c36d1e be reverted? |
Offhand, here's what I'm thinking right now. Today AIUI, the render controller tries a policy of "keep only latest MC for a pool". I think a pool should have something like this:
Then the render controller only queues for GC any MC which is not in that set. OK, now how do we maintain that set? I think a first clear cut at this would be "all MCs that are either currentConfig or desiredConfig" on a node (so the node controller writes this)? However...that still leaves open the special case of the MCS providing a config to boot a node, and having the config be GC'd while it's still booting. My short term vote is: Let's do the really simple thing and only GC MCs that are older than 1 hour. This will fix the "MCs going away during cluster bringup" problem. However, it would also mean that during an update, if it takes more than an hour, if somehow the MCD gets restarted on a node it'll go degraded. |
In other words, there would always be a desiredConfig and currentConfig available, but anything that didn't match those would be under a GC of 1 hour? |
Hm, that's combining my suggestions; I was thinking of it more "add the 1 hour" thing now, and then later "remove 1 hour hack, add more precise GC". But...until we also figure out a design for the MCS special case, maybe we might as well leave the 1 hour thing in. |
We weren't actually deleting the ownerReference because CRDs don't yet support strategic merges[1], which have the fancy `$patch: delete` syntax. We were also specifying the wrong PatchType argument. For now, just drop down to the vanilla JSONPatch[2] to do this. [1] https://kubernetes.io/docs/tasks/run-application/update-api-object-kubectl-patch [2] https://tools.ietf.org/html/rfc6902 Related: openshift#301
Notes so far from a meeting on this. We don't think (but this needs to be verified) that deleting an owner ref shouldn't GC the object. And yeah, just playing with this briefly, that seems to be the case. And clearly the patching code wasn't working. We think that the MCs going away may have something more to do with some sort of race condition on cluster bringup. Abhinav suggested the operator should pause rolling out any configs until the master pool has stabilized. I'm wondering if maybe the race is something like us rebooting the master before the other ones have come online, and etcd didn't get to finish committing the rendered MC? |
OK I added a quick hack here: 836f5e0 Also I noticed that nothing is setting So ideas for better waits in the operator; we could add a new Hmm...what would be a good signal for "cluster is ready for MCO to start running"? Maybe when all the workers specified in the install config are online? |
And based on the discussion so far we shouldn't merge #318 right? Because it would make this worse (assuming the patch works and I would believe it does). I forget, from the meeting did we have any ideas on what a good GC policy for the MCs would be? Did we land anywhere different from #301 (comment) ? |
That sounds reasonable. If we wanted to be extra explicit about flow we could also keep MCD's from being scheduled on the workers until MC's are generated and available via the MCS. Just thinking out loud. |
I thought that we had said that the 1 hour part didn't really work? |
It's the inverse a bit of that right? We want to wait for EDIT: To clarify though I like this idea of keying off the MCD status. |
It definitely works. :) Though I'd agree we shouldn't merge it until we have a better understanding of what's going on. |
Some work on "operator pause" here #329 - WIP, not really tested (since I'd have to make a custom release payload to do that sanely); going to hope CI runs it and see. |
OK I think I'm finally understanding this. No MCs are being GC'd. The problem is until the osimage work, we were relying on the fact that the MC generated by the bootstrap was exactly the same as the MC initially generated in the cluster. So we either need to render the osimageurl during bootstrap (probably best) or also render the base template MC without the osimageurl too (would seem fine). If I'm correct about this and one of those fixes works, this issue would then turn to the other problem of when we GC MCs. |
👍 yes! The original architecture passed to us was that the MC from bootstrap would end up being the same. This can't ALWAYS be counted on since it's possible someone may use an old MC when installing a new cluster, but, in those cases, the extra reboot and pivot to the updated MC would be expected and fine. I'm good with either or both of the above. |
I think we have this in CI but we're not noticing it. If it's happening we need to fix it. Ref: openshift#301
Closing in favor of #354 |
I think we have this in CI but we're not noticing it. If it's happening we need to fix it. Ref: openshift#301
Bug 1853070: update base image version in dockerfiles
Splitting this out of #273 (comment)
#273 (comment)
Here are relevant mcc logs from a cluster spun up by that PR:
You can see that this caused very fast churn in the machineconfigs, and the previous ones were GCd. But - there were secondary masters that were still booting and expecting to be able to find that MC.
This is a tricky problem - we need to have a way to avoid pruning "in flight" MCs passed from the MCC to Ignition.
The text was updated successfully, but these errors were encountered: