MachineConfigs can be garbage collected while a node is still booting #301

cgwalters · 2019-01-14T22:53:49Z

Splitting this out of #273 (comment)
#273 (comment)

Here are relevant mcc logs from a cluster spun up by that PR:

I0114 22:26:10.482441       1 render_controller.go:437] Generated machineconfig master-46c05bfb9cb3d4e05608277bb2cb0a5d from 1 configs: [{MachineConfig  00-master  machineconfiguration.openshift.io/v1  }]
I0114 22:26:11.589272       1 render_controller.go:437] Generated machineconfig master-7734c782bad1ead0f8ef5b6affcaf35c from 2 configs: [{MachineConfig  00-master  machineconfiguration.openshift.io/v1  } {Machin
eConfig  00-master-osimageurl  machineconfiguration.openshift.io/v1  }]
I0114 22:26:12.484212       1 render_controller.go:437] Generated machineconfig master-383ca913310d861fee0be89e6f1d0127 from 3 configs: [{MachineConfig  00-master  machineconfiguration.openshift.io/v1  } {Machin
eConfig  00-master-osimageurl  machineconfiguration.openshift.io/v1  } {MachineConfig  00-master-ssh  machineconfiguration.openshift.io/v1  }]
I0114 22:26:14.483810       1 render_controller.go:437] Generated machineconfig master-a206c9459a44d859587164a68bb484f2 from 4 configs: [{MachineConfig  00-master  machineconfiguration.openshift.io/v1  } {Machin
eConfig  00-master-osimageurl  machineconfiguration.openshift.io/v1  } {MachineConfig  00-master-ssh  machineconfiguration.openshift.io/v1  } {MachineConfig  01-master-kubelet  machineconfiguration.openshift.io/
v1  }]

You can see that this caused very fast churn in the machineconfigs, and the previous ones were GCd. But - there were secondary masters that were still booting and expecting to be able to find that MC.
This is a tricky problem - we need to have a way to avoid pruning "in flight" MCs passed from the MCC to Ignition.

The text was updated successfully, but these errors were encountered:

cgwalters · 2019-01-14T22:58:07Z

This problem is actually fairly new (as of basically since Friday) because of the near-simultaneous addition of -ssh and -kubelet MCs which causes MC regeneration churn immediately on cluster bringup, and I'm currently going to add another.

ashcrow · 2019-01-14T23:07:58Z

/kind bug

kikisdeliveryservice · 2019-01-14T23:21:34Z

As a data point, the -ssh mc was added last Monday, so any idea why you saw the issues begin on Friday? I also haven't see any MC race symptoms running on AWS.

cgwalters · 2019-01-15T14:08:03Z

As a data point, the -ssh mc was added last Monday, so any idea why you saw the issues begin on Friday?

Sorry, I was wrong. It's likely the -kubelet one on top of the -ssh - but both are pretty recent.

To reduce churn if MCs are being created rapidly - both on general principle, and also to reduce our exposure to the current bug that a booting node may fail to find a GC'd MachineConfig: openshift#301

jlebon · 2019-01-15T15:54:47Z

Hmm, wonder if it'd work to add another owner reference from the MCD on the MC before handling it. Let me look into that.

The same way we pass the specific desiredConfig we "locked on" when updating, also similarly pass the currentConfig. This is a minor optimization so we avoid querying the cluster again for information we already have, but it does also make the code more resilient to humans meddling with node annotations. Related: openshift#301

jlebon · 2019-01-16T19:35:07Z

So, just to do a brain dump on this. I'm still trying to reproduce this locally. It doesn't trigger at least when I manually create MCs in rapid succession (this is with #303 reverted). One thing that caught my eye though is that there is no code today that actually deletes MachineConfigs, right?

The render controller does try to remove the pool ownerReference from previously generated MachineConfigs (that actually doesn't work right now; working on a patch to fix that), but even so, deleting the ownerRef doesn't automatically delete the object, it just orphans them (which is actually an issue; we need to figure out garbage collection).

cgwalters · 2019-01-16T20:49:26Z

The render controller does try to remove the pool ownerReference from previously generated MachineConfigs (that actually doesn't work right now; working on a patch to fix that),

Something in that CI run was pruning MCs though.

it just orphans them (which is actually an issue; we need to figure out garbage collection).

Doesn't Kube itself do the GC https://blog.openshift.com/garbage-collection-custom-resources-available-kubernetes-1-8/ ?

We should indeed figure this issue out better but it's not in the blocking path right now I'd say.

ashcrow · 2019-01-16T21:03:29Z

We should indeed figure this issue out better but it's not in the blocking path right now I'd say.

Agreed

jlebon · 2019-01-16T21:09:01Z

AFAICT, deleting a parent will delete children that have ownerRefs to the parent. But deleting the ownerRef directly from the child doesn't actually delete the child itself, it just orphans them. I tested this out with e.g.:

oc patch machineconfig worker-fab428ea272fbe07901d4b94b7db4d0d --type=json -p '[{"op": "remove", "path": "/metadata/ownerReferences/0"}]'

and verifying that Kube didn't delete it afterwards.

We should indeed figure this issue out better but it's not in the blocking path right now I'd say.

My worry is that not knowing exactly what's deleting MCs means we don't actually know how bad this is.

cgwalters · 2019-01-16T21:29:45Z

(BTW I wrote this wasn't in the critpath but then I hit #273 (comment) )

and verifying that Kube didn't delete it afterwards.

But...when does Kube GC happen? Could it have been delayed?

I keep circling back again to something is definitely deleting MCs today.

jlebon · 2019-01-17T13:53:57Z

I keep circling back again to something is definitely deleting MCs today.

Yeah agreed. Will try to get more visibility on this today.

BTW, was #273 (comment) in a CI cluster, or local? Wondering if there's somehow a higher likelihood of hitting this in CI for whatever reason.

cgwalters · 2019-01-17T14:07:47Z

BTW, was #273 (comment) in a CI cluster, or local?

That was in CI. However, that comment was before #303 which I think has been a strong mitigation. But, we don't have much data here - #319 is attempting to gather some.

I think we have this in CI but we're not noticing it. If it's happening we need to fix it. Ref: openshift#301

cgwalters · 2019-01-17T22:52:34Z

Yeah, this is still a blocking issue for #273

In a recent CI run all my masters went degraded due to failing to fetch the MC. And for some reason I seem to have lost my workers entirely (NotReady) and the DS pods are NodeLost.

cgwalters · 2019-01-18T14:02:32Z

In current master (i.e. after 0c36d1e landed):

W0118 13:53:05.784041       1 render_controller.go:488] Failed to delete ownerReference from 00-worker-ssh: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:05.790057       1 render_controller.go:488] Failed to delete ownerReference from 01-master-kubelet: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:05.856325       1 render_controller.go:488] Failed to delete ownerReference from 00-master-ssh: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:05.948509       1 render_controller.go:488] Failed to delete ownerReference from 00-worker: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:05.956110       1 render_controller.go:488] Failed to delete ownerReference from 00-worker-ssh: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:06.159607       1 render_controller.go:488] Failed to delete ownerReference from 01-master-kubelet: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:06.254004       1 render_controller.go:488] Failed to delete ownerReference from worker-dc91f58f81073a943eda478f6e6c9c24: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:06.254523       1 render_controller.go:488] Failed to delete ownerReference from 00-master-ssh: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:06.562560       1 render_controller.go:488] Failed to delete ownerReference from 00-worker: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:06.648050       1 render_controller.go:488] Failed to delete ownerReference from 00-master: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:06.754139       1 render_controller.go:488] Failed to delete ownerReference from worker-dc91f58f81073a943eda478f6e6c9c24: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:06.756698       1 render_controller.go:488] Failed to delete ownerReference from master-305cb5e168750be92391b570a00bf078: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:06.854664       1 render_controller.go:488] Failed to delete ownerReference from master-e9dbd67aa91af61df98032b1cbc4c3e0: json: cannot unmarshal object into Go value of type jsonpatch.Patch
W0118 13:53:06.962235       1 render_controller.go:488] Failed to delete ownerReference from 00-master: json: cannot unmarshal object into Go value of type jsonpatch.Patch

So yeah, that code was clearly broken.

ashcrow · 2019-01-18T15:42:59Z

Should 0c36d1e be reverted?

cgwalters · 2019-01-18T15:56:30Z

Should 0c36d1e be reverted?

Er sorry when I said "that code was broken" I meant before 0c36d1e. The new code is revealing errors that existed before.

cgwalters · 2019-01-18T16:15:23Z

Offhand, here's what I'm thinking right now. Today AIUI, the render controller tries a policy of "keep only latest MC for a pool".

I think a pool should have something like this:

diff --git a/pkg/apis/machineconfiguration.openshift.io/v1/types.go b/pkg/apis/machineconfiguration.openshift.io/v1/types.go
index aa13fdf..cbfba9d 100644
--- a/pkg/apis/machineconfiguration.openshift.io/v1/types.go
+++ b/pkg/apis/machineconfiguration.openshift.io/v1/types.go
@@ -280,6 +280,8 @@ type MachineConfigPoolStatus struct {
 	// The current MachineConfig object for the machine pool.
 	Configuration MachineConfigPoolStatusConfiguration `json:"configuration"`
 
+	ActiveConfigurations []MachineConfig `json:activeConfigurations`
+
 	// Total number of machines in the machine pool.
 	MachineCount int32 `json:"machineCount"`

Then the render controller only queues for GC any MC which is not in that set.

OK, now how do we maintain that set? I think a first clear cut at this would be "all MCs that are either currentConfig or desiredConfig" on a node (so the node controller writes this)?

However...that still leaves open the special case of the MCS providing a config to boot a node, and having the config be GC'd while it's still booting.

My short term vote is: Let's do the really simple thing and only GC MCs that are older than 1 hour. This will fix the "MCs going away during cluster bringup" problem. However, it would also mean that during an update, if it takes more than an hour, if somehow the MCD gets restarted on a node it'll go degraded.

ashcrow · 2019-01-18T16:17:49Z

In other words, there would always be a desiredConfig and currentConfig available, but anything that didn't match those would be under a GC of 1 hour?

cgwalters · 2019-01-18T16:27:50Z

In other words, there would always be a desiredConfig and currentConfig available, but anything that didn't match those would be under a GC of 1 hour?

Hm, that's combining my suggestions; I was thinking of it more "add the 1 hour" thing now, and then later "remove 1 hour hack, add more precise GC". But...until we also figure out a design for the MCS special case, maybe we might as well leave the 1 hour thing in.

We weren't actually deleting the ownerReference because CRDs don't yet support strategic merges[1], which have the fancy `$patch: delete` syntax. We were also specifying the wrong PatchType argument. For now, just drop down to the vanilla JSONPatch[2] to do this. [1] https://kubernetes.io/docs/tasks/run-application/update-api-object-kubectl-patch [2] https://tools.ietf.org/html/rfc6902 Related: openshift#301

cgwalters · 2019-01-18T19:44:10Z

Notes so far from a meeting on this.

We don't think (but this needs to be verified) that deleting an owner ref shouldn't GC the object. And yeah, just playing with this briefly, that seems to be the case.

And clearly the patching code wasn't working.

We think that the MCs going away may have something more to do with some sort of race condition on cluster bringup. Abhinav suggested the operator should pause rolling out any configs until the master pool has stabilized.

I'm wondering if maybe the race is something like us rebooting the master before the other ones have come online, and etcd didn't get to finish committing the rendered MC?

cgwalters · 2019-01-18T21:05:12Z

OK I added a quick hack here: 836f5e0

Also I noticed that nothing is setting Paused - is it intended for humans? I guessed it is.

So ideas for better waits in the operator; we could add a new OperatorPaused field that we own, have it be equivalent to Paused, and set it initially to true, then change it false after...

Hmm...what would be a good signal for "cluster is ready for MCO to start running"? Maybe when all the workers specified in the install config are online?

cgwalters · 2019-01-18T21:12:28Z

And based on the discussion so far we shouldn't merge #318 right? Because it would make this worse (assuming the patch works and I would believe it does).

I forget, from the meeting did we have any ideas on what a good GC policy for the MCs would be? Did we land anywhere different from #301 (comment) ?

ashcrow · 2019-01-18T21:13:02Z

Hmm...what would be a good signal for "cluster is ready for MCO to start running"? Maybe when all the workers specified in the install config are online?

That sounds reasonable. If we wanted to be extra explicit about flow we could also keep MCD's from being scheduled on the workers until MC's are generated and available via the MCS. Just thinking out loud.

kikisdeliveryservice · 2019-01-18T21:17:07Z

I thought that we had said that the 1 hour part didn't really work?

cgwalters · 2019-01-18T21:23:52Z

MCD's from being scheduled on the workers until MC's are generated and available via the MCS.

It's the inverse a bit of that right? We want to wait for ds/machine-config-daemon to be fully rolled out (and hence the MCD on each node has fetched its config) and then unpause the pools, no?

EDIT: To clarify though I like this idea of keying off the MCD status.

jlebon · 2019-01-18T21:53:35Z

And based on the discussion so far we shouldn't merge #318 right? Because it would make this worse (assuming the patch works and I would believe it does).

It definitely works. :)

Though I'd agree we shouldn't merge it until we have a better understanding of what's going on.

cgwalters · 2019-01-18T22:18:27Z

Some work on "operator pause" here #329 - WIP, not really tested (since I'd have to make a custom release payload to do that sanely); going to hope CI runs it and see.

cgwalters · 2019-01-19T18:33:25Z

OK I think I'm finally understanding this. No MCs are being GC'd. The problem is until the osimage work, we were relying on the fact that the MC generated by the bootstrap was exactly the same as the MC initially generated in the cluster. So we either need to render the osimageurl during bootstrap (probably best) or also render the base template MC without the osimageurl too (would seem fine).

If I'm correct about this and one of those fixes works, this issue would then turn to the other problem of when we GC MCs.

ashcrow · 2019-01-21T14:42:59Z

OK I think I'm finally understanding this. No MCs are being GC'd. The problem is until the osimage work, we were relying on the fact that the MC generated by the bootstrap was exactly the same as the MC initially generated in the cluster. So we either need to render the osimageurl during bootstrap (probably best) or also render the base template MC without the osimageurl too (would seem fine).

👍 yes! The original architecture passed to us was that the MC from bootstrap would end up being the same. This can't ALWAYS be counted on since it's possible someone may use an old MC when installing a new cluster, but, in those cases, the extra reboot and pivot to the updated MC would be expected and fine.

I'm good with either or both of the above.

I think we have this in CI but we're not noticing it. If it's happening we need to fix it. Ref: openshift#301

cgwalters · 2019-02-01T19:50:48Z

Closing in favor of #354
since there wasn't actually any GC #301 (comment)

I think we have this in CI but we're not noticing it. If it's happening we need to fix it. Ref: openshift#301

Bug 1853070: update base image version in dockerfiles

cgwalters mentioned this issue Jan 14, 2019

WIP: Add RHCOS oscontainer into payload, render to 00-$role-osimageurl MC #273

Closed

2 tasks

openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 14, 2019

cgwalters mentioned this issue Jan 15, 2019

controller: Add a 5s delay before rendering MCs #303

Merged

jlebon mentioned this issue Jan 16, 2019

controller/render: fix deleting ownerReference #318

Closed

cgwalters mentioned this issue Jan 16, 2019

WIP: test/e2e: Validate that no nodes went degraded #319

Closed

cgwalters changed the title ~~Fast cycling of MCs (and GC issues)~~ MachineConfigs can be garbage collected while a node is still booting Jan 17, 2019

cgwalters added a commit to cgwalters/machine-config-operator that referenced this issue Jan 17, 2019

test/e2e: Validate that no nodes went degraded

4921826

I think we have this in CI but we're not noticing it. If it's happening we need to fix it. Ref: openshift#301

cgwalters added a commit to cgwalters/machine-config-operator that referenced this issue Jan 21, 2019

test/e2e: Validate that no nodes went degraded

2a719dd

I think we have this in CI but we're not noticing it. If it's happening we need to fix it. Ref: openshift#301

jlebon mentioned this issue Jan 29, 2019

Garbage collect old generated MachineConfigs #354

Closed

cgwalters closed this as completed Feb 1, 2019

cgwalters mentioned this issue Feb 1, 2019

master machineconfig pool reports degraded on new cluster #367

Closed

cgwalters added a commit to cgwalters/machine-config-operator that referenced this issue Feb 4, 2019

test/e2e: Validate that no nodes went degraded

32b226a

I think we have this in CI but we're not noticing it. If it's happening we need to fix it. Ref: openshift#301

jlebon mentioned this issue Feb 14, 2019

Inject OSImageURL from CVO into templated MachineConfigs #363

Closed

osherdp pushed a commit to osherdp/machine-config-operator that referenced this issue Apr 13, 2021

Merge pull request openshift#301 from gabemontero/dockerfile-ocp-version

b604211

Bug 1853070: update base image version in dockerfiles

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MachineConfigs can be garbage collected while a node is still booting #301

MachineConfigs can be garbage collected while a node is still booting #301

cgwalters commented Jan 14, 2019 •

edited

Loading

cgwalters commented Jan 14, 2019 •

edited

Loading

ashcrow commented Jan 14, 2019

kikisdeliveryservice commented Jan 14, 2019 •

edited

Loading

cgwalters commented Jan 15, 2019

jlebon commented Jan 15, 2019 •

edited

Loading

jlebon commented Jan 16, 2019

cgwalters commented Jan 16, 2019

ashcrow commented Jan 16, 2019

jlebon commented Jan 16, 2019

cgwalters commented Jan 16, 2019

jlebon commented Jan 17, 2019

cgwalters commented Jan 17, 2019 •

edited

Loading

cgwalters commented Jan 17, 2019

cgwalters commented Jan 18, 2019

ashcrow commented Jan 18, 2019

cgwalters commented Jan 18, 2019

cgwalters commented Jan 18, 2019 •

edited

Loading

ashcrow commented Jan 18, 2019

cgwalters commented Jan 18, 2019 •

edited

Loading

cgwalters commented Jan 18, 2019

cgwalters commented Jan 18, 2019

cgwalters commented Jan 18, 2019

ashcrow commented Jan 18, 2019

kikisdeliveryservice commented Jan 18, 2019

cgwalters commented Jan 18, 2019 •

edited

Loading

jlebon commented Jan 18, 2019

cgwalters commented Jan 18, 2019

cgwalters commented Jan 19, 2019

ashcrow commented Jan 21, 2019

cgwalters commented Feb 1, 2019

MachineConfigs can be garbage collected while a node is still booting #301

MachineConfigs can be garbage collected while a node is still booting #301

Comments

cgwalters commented Jan 14, 2019 • edited Loading

cgwalters commented Jan 14, 2019 • edited Loading

ashcrow commented Jan 14, 2019

kikisdeliveryservice commented Jan 14, 2019 • edited Loading

cgwalters commented Jan 15, 2019

jlebon commented Jan 15, 2019 • edited Loading

jlebon commented Jan 16, 2019

cgwalters commented Jan 16, 2019

ashcrow commented Jan 16, 2019

jlebon commented Jan 16, 2019

cgwalters commented Jan 16, 2019

jlebon commented Jan 17, 2019

cgwalters commented Jan 17, 2019 • edited Loading

cgwalters commented Jan 17, 2019

cgwalters commented Jan 18, 2019

ashcrow commented Jan 18, 2019

cgwalters commented Jan 18, 2019

cgwalters commented Jan 18, 2019 • edited Loading

ashcrow commented Jan 18, 2019

cgwalters commented Jan 18, 2019 • edited Loading

cgwalters commented Jan 18, 2019

cgwalters commented Jan 18, 2019

cgwalters commented Jan 18, 2019

ashcrow commented Jan 18, 2019

kikisdeliveryservice commented Jan 18, 2019

cgwalters commented Jan 18, 2019 • edited Loading

jlebon commented Jan 18, 2019

cgwalters commented Jan 18, 2019

cgwalters commented Jan 19, 2019

ashcrow commented Jan 21, 2019

cgwalters commented Feb 1, 2019

cgwalters commented Jan 14, 2019 •

edited

Loading

cgwalters commented Jan 14, 2019 •

edited

Loading

kikisdeliveryservice commented Jan 14, 2019 •

edited

Loading

jlebon commented Jan 15, 2019 •

edited

Loading

cgwalters commented Jan 17, 2019 •

edited

Loading

cgwalters commented Jan 18, 2019 •

edited

Loading

cgwalters commented Jan 18, 2019 •

edited

Loading

cgwalters commented Jan 18, 2019 •

edited

Loading