Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MachineConfigs can be garbage collected while a node is still booting #301

Closed
cgwalters opened this issue Jan 14, 2019 · 30 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@cgwalters
Copy link
Member

cgwalters commented Jan 14, 2019

Splitting this out of #273 (comment)
#273 (comment)

Here are relevant mcc logs from a cluster spun up by that PR:

I0114 22:26:10.482441       1 render_controller.go:437] Generated machineconfig master-46c05bfb9cb3d4e05608277bb2cb0a5d from 1 configs: [{MachineConfig  00-master  machineconfiguration.openshift.io/v1  }]
I0114 22:26:11.589272       1 render_controller.go:437] Generated machineconfig master-7734c782bad1ead0f8ef5b6affcaf35c from 2 configs: [{MachineConfig  00-master  machineconfiguration.openshift.io/v1  } {Machin
eConfig  00-master-osimageurl  machineconfiguration.openshift.io/v1  }]
I0114 22:26:12.484212       1 render_controller.go:437] Generated machineconfig master-383ca913310d861fee0be89e6f1d0127 from 3 configs: [{MachineConfig  00-master  machineconfiguration.openshift.io/v1  } {Machin
eConfig  00-master-osimageurl  machineconfiguration.openshift.io/v1  } {MachineConfig  00-master-ssh  machineconfiguration.openshift.io/v1  }]
I0114 22:26:14.483810       1 render_controller.go:437] Generated machineconfig master-a206c9459a44d859587164a68bb484f2 from 4 configs: [{MachineConfig  00-master  machineconfiguration.openshift.io/v1  } {Machin
eConfig  00-master-osimageurl  machineconfiguration.openshift.io/v1  } {MachineConfig  00-master-ssh  machineconfiguration.openshift.io/v1  } {MachineConfig  01-master-kubelet  machineconfiguration.openshift.io/
v1  }]

You can see that this caused very fast churn in the machineconfigs, and the previous ones were GCd. But - there were secondary masters that were still booting and expecting to be able to find that MC.
This is a tricky problem - we need to have a way to avoid pruning "in flight" MCs passed from the MCC to Ignition.

@cgwalters
Copy link
Member Author

cgwalters commented Jan 14, 2019

This problem is actually fairly new (as of basically since Friday) because of the near-simultaneous addition of -ssh and -kubelet MCs which causes MC regeneration churn immediately on cluster bringup, and I'm currently going to add another.

@ashcrow
Copy link
Member

ashcrow commented Jan 14, 2019

/kind bug

@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 14, 2019
@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Jan 14, 2019

As a data point, the -ssh mc was added last Monday, so any idea why you saw the issues begin on Friday? I also haven't see any MC race symptoms running on AWS.

@cgwalters
Copy link
Member Author

As a data point, the -ssh mc was added last Monday, so any idea why you saw the issues begin on Friday?

Sorry, I was wrong. It's likely the -kubelet one on top of the -ssh - but both are pretty recent.

cgwalters added a commit to cgwalters/machine-config-operator that referenced this issue Jan 15, 2019
To reduce churn if MCs are being created rapidly - both on general
principle, and also to reduce our exposure to the current bug
that a booting node may fail to find a GC'd MachineConfig:
openshift#301
@jlebon
Copy link
Member

jlebon commented Jan 15, 2019

Hmm, wonder if it'd work to add another owner reference from the MCD on the MC before handling it. Let me look into that.

jlebon added a commit to jlebon/machine-config-operator that referenced this issue Jan 15, 2019
The same way we pass the specific desiredConfig we "locked on" when
updating, also similarly pass the currentConfig. This is a minor
optimization so we avoid querying the cluster again for information we
already have, but it does also make the code more resilient to humans
meddling with node annotations.

Related: openshift#301
@jlebon
Copy link
Member

jlebon commented Jan 16, 2019

So, just to do a brain dump on this. I'm still trying to reproduce this locally. It doesn't trigger at least when I manually create MCs in rapid succession (this is with #303 reverted). One thing that caught my eye though is that there is no code today that actually deletes MachineConfigs, right?

The render controller does try to remove the pool ownerReference from previously generated MachineConfigs (that actually doesn't work right now; working on a patch to fix that), but even so, deleting the ownerRef doesn't automatically delete the object, it just orphans them (which is actually an issue; we need to figure out garbage collection).

@cgwalters
Copy link
Member Author

The render controller does try to remove the pool ownerReference from previously generated MachineConfigs (that actually doesn't work right now; working on a patch to fix that),

Something in that CI run was pruning MCs though.

it just orphans them (which is actually an issue; we need to figure out garbage collection).

Doesn't Kube itself do the GC https://blog.openshift.com/garbage-collection-custom-resources-available-kubernetes-1-8/ ?

We should indeed figure this issue out better but it's not in the blocking path right now I'd say.

@ashcrow
Copy link
Member

ashcrow commented Jan 16, 2019

We should indeed figure this issue out better but it's not in the blocking path right now I'd say.

Agreed

@jlebon
Copy link
Member

jlebon commented Jan 16, 2019

AFAICT, deleting a parent will delete children that have ownerRefs to the parent. But deleting the ownerRef directly from the child doesn't actually delete the child itself, it just orphans them. I tested this out with e.g.:

oc patch machineconfig worker-fab428ea272fbe07901d4b94b7db4d0d --type=json -p '[{"op": "remove", "path": "/metadata/ownerReferences/0"}]'

and verifying that Kube didn't delete it afterwards.

We should indeed figure this issue out better but it's not in the blocking path right now I'd say.

My worry is that not knowing exactly what's deleting MCs means we don't actually know how bad this is.

@cgwalters
Copy link
Member Author

(BTW I wrote this wasn't in the critpath but then I hit #273 (comment) )

and verifying that Kube didn't delete it afterwards.

But...when does Kube GC happen? Could it have been delayed?

I keep circling back again to something is definitely deleting MCs today.

@jlebon
Copy link
Member

jlebon commented Jan 17, 2019

I keep circling back again to something is definitely deleting MCs today.

Yeah agreed. Will try to get more visibility on this today.

BTW, was #273 (comment) in a CI cluster, or local? Wondering if there's somehow a higher likelihood of hitting this in CI for whatever reason.

@cgwalters
Copy link
Member Author

cgwalters commented Jan 17, 2019

BTW, was #273 (comment) in a CI cluster, or local?

That was in CI. However, that comment was before #303 which I think has been a strong mitigation. But, we don't have much data here - #319 is attempting to gather some.

@cgwalters cgwalters changed the title Fast cycling of MCs (and GC issues) MachineConfigs can be garbage collected while a node is still booting Jan 17, 2019
cgwalters added a commit to cgwalters/machine-config-operator that referenced this issue Jan 17, 2019
I think we have this in CI but we're not noticing it.
If it's happening we need to fix it.

Ref: openshift#301
@cgwalters
Copy link
Member Author

Yeah, this is still a blocking issue for #273

In a recent CI run all my masters went degraded due to failing to fetch the MC. And for some reason I seem to have lost my workers entirely (NotReady) and the DS pods are NodeLost.

@cgwalters
Copy link
Member Author

In current master (i.e. after 0c36d1e landed):

W0118 13:53:05.784041       1 render_controller.go:488] Failed to delete ownerReference from 00-worker-ssh: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:05.790057       1 render_controller.go:488] Failed to delete ownerReference from 01-master-kubelet: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:05.856325       1 render_controller.go:488] Failed to delete ownerReference from 00-master-ssh: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:05.948509       1 render_controller.go:488] Failed to delete ownerReference from 00-worker: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:05.956110       1 render_controller.go:488] Failed to delete ownerReference from 00-worker-ssh: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:06.159607       1 render_controller.go:488] Failed to delete ownerReference from 01-master-kubelet: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:06.254004       1 render_controller.go:488] Failed to delete ownerReference from worker-dc91f58f81073a943eda478f6e6c9c24: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:06.254523       1 render_controller.go:488] Failed to delete ownerReference from 00-master-ssh: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:06.562560       1 render_controller.go:488] Failed to delete ownerReference from 00-worker: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:06.648050       1 render_controller.go:488] Failed to delete ownerReference from 00-master: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:06.754139       1 render_controller.go:488] Failed to delete ownerReference from worker-dc91f58f81073a943eda478f6e6c9c24: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:06.756698       1 render_controller.go:488] Failed to delete ownerReference from master-305cb5e168750be92391b570a00bf078: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:06.854664       1 render_controller.go:488] Failed to delete ownerReference from master-e9dbd67aa91af61df98032b1cbc4c3e0: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:06.962235       1 render_controller.go:488] Failed to delete ownerReference from 00-master: json: cannot unmarshal object into Go value of type jsonpatch.Patch

So yeah, that code was clearly broken.

@ashcrow
Copy link
Member

ashcrow commented Jan 18, 2019

Should 0c36d1e be reverted?

@cgwalters
Copy link
Member Author

Should 0c36d1e be reverted?

Er sorry when I said "that code was broken" I meant before 0c36d1e. The new code is revealing errors that existed before.

@cgwalters
Copy link
Member Author

cgwalters commented Jan 18, 2019

Offhand, here's what I'm thinking right now. Today AIUI, the render controller tries a policy of "keep only latest MC for a pool".

I think a pool should have something like this:

diff --git a/pkg/apis/machineconfiguration.openshift.io/v1/types.go b/pkg/apis/machineconfiguration.openshift.io/v1/types.go
index aa13fdf..cbfba9d 100644
--- a/pkg/apis/machineconfiguration.openshift.io/v1/types.go
+++ b/pkg/apis/machineconfiguration.openshift.io/v1/types.go
@@ -280,6 +280,8 @@ type MachineConfigPoolStatus struct {
 	// The current MachineConfig object for the machine pool.
 	Configuration MachineConfigPoolStatusConfiguration `json:"configuration"`
 
+	ActiveConfigurations []MachineConfig `json:activeConfigurations`
+
 	// Total number of machines in the machine pool.
 	MachineCount int32 `json:"machineCount"`

Then the render controller only queues for GC any MC which is not in that set.

OK, now how do we maintain that set? I think a first clear cut at this would be "all MCs that are either currentConfig or desiredConfig" on a node (so the node controller writes this)?

However...that still leaves open the special case of the MCS providing a config to boot a node, and having the config be GC'd while it's still booting.

My short term vote is: Let's do the really simple thing and only GC MCs that are older than 1 hour. This will fix the "MCs going away during cluster bringup" problem. However, it would also mean that during an update, if it takes more than an hour, if somehow the MCD gets restarted on a node it'll go degraded.

@ashcrow
Copy link
Member

ashcrow commented Jan 18, 2019

In other words, there would always be a desiredConfig and currentConfig available, but anything that didn't match those would be under a GC of 1 hour?

@cgwalters
Copy link
Member Author

cgwalters commented Jan 18, 2019

In other words, there would always be a desiredConfig and currentConfig available, but anything that didn't match those would be under a GC of 1 hour?

Hm, that's combining my suggestions; I was thinking of it more "add the 1 hour" thing now, and then later "remove 1 hour hack, add more precise GC". But...until we also figure out a design for the MCS special case, maybe we might as well leave the 1 hour thing in.

jlebon added a commit to jlebon/machine-config-operator that referenced this issue Jan 18, 2019
We weren't actually deleting the ownerReference because CRDs don't yet
support strategic merges[1], which have the fancy `$patch: delete`
syntax. We were also specifying the wrong PatchType argument.

For now, just drop down to the vanilla JSONPatch[2] to do this.

[1] https://kubernetes.io/docs/tasks/run-application/update-api-object-kubectl-patch
[2] https://tools.ietf.org/html/rfc6902

Related: openshift#301
@cgwalters
Copy link
Member Author

Notes so far from a meeting on this.

We don't think (but this needs to be verified) that deleting an owner ref shouldn't GC the object. And yeah, just playing with this briefly, that seems to be the case.

And clearly the patching code wasn't working.

We think that the MCs going away may have something more to do with some sort of race condition on cluster bringup. Abhinav suggested the operator should pause rolling out any configs until the master pool has stabilized.

I'm wondering if maybe the race is something like us rebooting the master before the other ones have come online, and etcd didn't get to finish committing the rendered MC?

@cgwalters
Copy link
Member Author

OK I added a quick hack here: 836f5e0

Also I noticed that nothing is setting Paused - is it intended for humans? I guessed it is.

So ideas for better waits in the operator; we could add a new OperatorPaused field that we own, have it be equivalent to Paused, and set it initially to true, then change it false after...

Hmm...what would be a good signal for "cluster is ready for MCO to start running"? Maybe when all the workers specified in the install config are online?

@cgwalters
Copy link
Member Author

And based on the discussion so far we shouldn't merge #318 right? Because it would make this worse (assuming the patch works and I would believe it does).

I forget, from the meeting did we have any ideas on what a good GC policy for the MCs would be? Did we land anywhere different from #301 (comment) ?

@ashcrow
Copy link
Member

ashcrow commented Jan 18, 2019

Hmm...what would be a good signal for "cluster is ready for MCO to start running"? Maybe when all the workers specified in the install config are online?

That sounds reasonable. If we wanted to be extra explicit about flow we could also keep MCD's from being scheduled on the workers until MC's are generated and available via the MCS. Just thinking out loud.

@kikisdeliveryservice
Copy link
Contributor

I thought that we had said that the 1 hour part didn't really work?

@cgwalters
Copy link
Member Author

cgwalters commented Jan 18, 2019

MCD's from being scheduled on the workers until MC's are generated and available via the MCS.

It's the inverse a bit of that right? We want to wait for ds/machine-config-daemon to be fully rolled out (and hence the MCD on each node has fetched its config) and then unpause the pools, no?

EDIT: To clarify though I like this idea of keying off the MCD status.

@jlebon
Copy link
Member

jlebon commented Jan 18, 2019

And based on the discussion so far we shouldn't merge #318 right? Because it would make this worse (assuming the patch works and I would believe it does).

It definitely works. :)

Though I'd agree we shouldn't merge it until we have a better understanding of what's going on.

@cgwalters
Copy link
Member Author

Some work on "operator pause" here #329 - WIP, not really tested (since I'd have to make a custom release payload to do that sanely); going to hope CI runs it and see.

@cgwalters
Copy link
Member Author

OK I think I'm finally understanding this. No MCs are being GC'd. The problem is until the osimage work, we were relying on the fact that the MC generated by the bootstrap was exactly the same as the MC initially generated in the cluster. So we either need to render the osimageurl during bootstrap (probably best) or also render the base template MC without the osimageurl too (would seem fine).

If I'm correct about this and one of those fixes works, this issue would then turn to the other problem of when we GC MCs.

@ashcrow
Copy link
Member

ashcrow commented Jan 21, 2019

OK I think I'm finally understanding this. No MCs are being GC'd. The problem is until the osimage work, we were relying on the fact that the MC generated by the bootstrap was exactly the same as the MC initially generated in the cluster. So we either need to render the osimageurl during bootstrap (probably best) or also render the base template MC without the osimageurl too (would seem fine).

👍 yes! The original architecture passed to us was that the MC from bootstrap would end up being the same. This can't ALWAYS be counted on since it's possible someone may use an old MC when installing a new cluster, but, in those cases, the extra reboot and pivot to the updated MC would be expected and fine.

I'm good with either or both of the above.

cgwalters added a commit to cgwalters/machine-config-operator that referenced this issue Jan 21, 2019
I think we have this in CI but we're not noticing it.
If it's happening we need to fix it.

Ref: openshift#301
@cgwalters
Copy link
Member Author

Closing in favor of #354
since there wasn't actually any GC #301 (comment)

cgwalters added a commit to cgwalters/machine-config-operator that referenced this issue Feb 4, 2019
I think we have this in CI but we're not noticing it.
If it's happening we need to fix it.

Ref: openshift#301
osherdp pushed a commit to osherdp/machine-config-operator that referenced this issue Apr 13, 2021
Bug 1853070: update base image version in dockerfiles
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants