switch MCP watching to full EventHandler implementation #269

pmores · 2023-02-28T17:51:12Z

The main motivation behind this is to give ourselves much more transparency and control over when precisely we reconcile on an MCP change.

The EnqueueRequestsFromMapFunc() based mechanism used so far invokes the registered function (mapKataConfigToRequests()) on any and all MCP changes. Combined with the mapKataConfigToRequests() implementation that filed a reconcile.Request any time it was invoked without discrimination we were triggering unnecessarily many reconciliations.

With this change, we only reconcile on relevant MCP changes ("worker" and "kata-oc") unlike so far when we reconciled on any MCP change, however unrelated to this controller the changed MCP might have been. In addition, we only reconcile on MCP update since reconciling on MCP creation or deletion doesn't seem useful to this controller. We also guard against spurious MCP changes that are irrelevant to us (e.g. timestamps) by making sure that we only reconcile when values actually used by the controller change (machine counts and Updating/Updated conditions).

- Description of the problem which is fixed/What is the use case

- What I did

- How to verify it

- Description for the changelog

bpradipt · 2023-03-01T09:51:47Z

controllers/openshift_controller.go

+		for _, condType := range []mcfgv1.MachineConfigPoolConditionType{"Updating", "Updated"} {
+			condOld := mcfgv1.GetMachineConfigPoolCondition(statusOld, condType)
+			condNew := mcfgv1.GetMachineConfigPoolCondition(statusNew, condType)
+			condStatusOld := "<missing>"


Is <missing> status string coming from the mcp code ?

Nope, it's just a string I picked because a) it's not a valid condition value and b) looks good (=is easy to understand) in the log.

It's meant to mean that the condition (Updating or Updated) is literally missing from the MCP's condition array - which it legally can.

Probably better to add the above as a comment in the code to help us when we look at the code after few months :-)

I'm not against but I can't think of a wording that said anything that wasn't in the code already. I mean, the whole function is called logMcpChange() so it seems expected that it just composes log messages...

@pmores I should have worded it better. My suggestion was specific to the <missing> status string.
Maybe you can declare a constant like const missingStatusString .. and use it. When we look back at this code after a few months, we'll know what <missing> is for and not get confused about why we are using <missing> and not an empty string.
Btw just to be clear I'm fine with the code as-is. It was just a suggestion based on past experiences.

Oh I see, I'm sorry, I misunderstood. Will do!

Fixed by the last force push.

pmores · 2023-03-01T11:07:59Z

sorry about the force push noise, the code to be reviewed isn't changing though

bpradipt

/lgtm
Thanks @pmores

jensfr

@pmores looks good to me, thank you! my only concern would be that we get a lot of new info log entries

The main motivation behind this is to give ourselves much more transparency and control over when precisely we reconcile on an MCP change. The EnqueueRequestsFromMapFunc() based mechanism used so far invokes the registered function (mapKataConfigToRequests()) on any and all MCP changes. Combined with the mapKataConfigToRequests() implementation that filed a reconcile.Request any time it was invoked without discrimination we were triggering unnecessarily many reconciliations. With this change, we only reconcile on relevant MCP changes ("worker" and "kata-oc") unlike so far when we reconciled on any MCP change, however unrelated to this controller the changed MCP might have been. In addition, we only reconcile on MCP update since reconciling on MCP creation or deletion doesn't seem useful to this controller. We also guard against spurious MCP changes that are irrelevant to us (e.g. timestamps) by making sure that we only reconcile when values actually used by the controller change (machine counts and Updating/Updated conditions). Signed-off-by: Pavel Mores <pmores@redhat.com>

openshift-ci · 2023-03-02T09:31:47Z

New changes are detected. LGTM label has been removed.

pmores · 2023-03-02T10:00:45Z

@jensfr Oh I took a lot of effort to ensure that wouldn't be the case. First of all, we never log anything about MCPs other than "kata-oc" and "worker". For those, we get a single line for each of MCP creation, deletion and generic event (I've never seen one of those yet). These are useful as they create a context to make it easier to understand the rest of the logging.

So any substantial logging happens only on "worker"/"kata-oc" update. However even update logs only on changes that are relevant to the OSC controller (basically changes in MachineCounts and Updating/Updated conditions). (That's essentially the only purpose of factoring logging out of McpEventHandler.Update() into logMcpChange() - to ensure that we never hit the log until we are very sure it's appropriate.)

To give you an idea how it looks in practice here's an example log. It captures what happens when a worker is labelled to match kataConfigPoolSelector on a 2 worker cluster where the other worker is already a member of "kata-oc":

2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        MCP updated     {"MCP name": "worker"}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        MachineCount changed    {"MCP name": "worker", "old": 1, "new": 0}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        ReadyMachineCount changed       {"MCP name": "worker", "old": 1, "new": 0}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        UpdatedMachineCount changed     {"MCP name": "worker", "old": 1, "new": 0}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        DegradedMachineCount    {"MCP name": "worker", "#": 0}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        MCP updated     {"MCP name": "kata-oc"}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        MachineCount changed    {"MCP name": "kata-oc", "old": 1, "new": 2}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        ReadyMachineCount       {"MCP name": "kata-oc", "#": 1}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        UpdatedMachineCount     {"MCP name": "kata-oc", "#": 1}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        DegradedMachineCount    {"MCP name": "kata-oc", "#": 0}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        mcp.status.conditions[] changed {"MCP name": "kata-oc", "type": "Updating", "old": "False", "new": "True"}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        mcp.status.conditions[] changed {"MCP name": "kata-oc", "type": "Updated", "old": "True", "new": "False"}
2023-02-28T17:04:37Z    INFO    controllers.KataConfig.McpUpdate        MCP updated     {"MCP name": "kata-oc"}
2023-02-28T17:04:37Z    INFO    controllers.KataConfig.McpUpdate        MachineCount    {"MCP name": "kata-oc", "#": 2}
2023-02-28T17:04:37Z    INFO    controllers.KataConfig.McpUpdate        ReadyMachineCount changed       {"MCP name": "kata-oc", "old": 1, "new": 2}
2023-02-28T17:04:37Z    INFO    controllers.KataConfig.McpUpdate        UpdatedMachineCount changed     {"MCP name": "kata-oc", "old": 1, "new": 2}
2023-02-28T17:04:37Z    INFO    controllers.KataConfig.McpUpdate        DegradedMachineCount    {"MCP name": "kata-oc", "#": 0}
2023-02-28T17:04:37Z    INFO    controllers.KataConfig.McpUpdate        mcp.status.conditions[] changed {"MCP name": "kata-oc", "type": "Updating", "old": "True", "new": "False"}
2023-02-28T17:04:37Z    INFO    controllers.KataConfig.McpUpdate        mcp.status.conditions[] changed {"MCP name": "kata-oc", "type": "Updated", "old": "False", "new": "True"}

The first 5 lines show us that the labelled worker was drained from "worker" immediately and ensure us that "worker" was left in a good shape.

The next 7 lines tell us that the worker joined "kata-oc" as it should (MachineCount going from 1 to 2) but is not ready yet (Ready & UpdatedMachineCounts stay at 1) and the MCP went from Updated to Updating.

The final 7 lines (three and a half minutes later) tell us that the new worker became Ready in "kata-oc" and the MCP settled (Updating -> Updated).

I found this logging extremely useful for understanding what's going on.

openshift-ci bot requested review from cpmeadors and gkurz February 28, 2023 17:52

bpradipt reviewed Mar 1, 2023

View reviewed changes

pmores force-pushed the improve-MCP-watching branch 2 times, most recently from 3e3853e to 9e02f32 Compare March 1, 2023 11:05

bpradipt approved these changes Mar 2, 2023

View reviewed changes

openshift-ci bot assigned bpradipt Mar 2, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 2, 2023

jensfr approved these changes Mar 2, 2023

View reviewed changes

pmores force-pushed the improve-MCP-watching branch from 9e02f32 to ab11dce Compare March 2, 2023 09:31

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 2, 2023

pmores merged commit fa64b2d into openshift:main Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

switch MCP watching to full EventHandler implementation #269

switch MCP watching to full EventHandler implementation #269

pmores commented Feb 28, 2023

bpradipt Mar 1, 2023

pmores Mar 1, 2023

bpradipt Mar 1, 2023

pmores Mar 1, 2023 •

edited

bpradipt Mar 2, 2023

pmores Mar 2, 2023

pmores Mar 2, 2023

pmores commented Mar 1, 2023

bpradipt left a comment

jensfr left a comment

openshift-ci bot commented Mar 2, 2023

pmores commented Mar 2, 2023

switch MCP watching to full EventHandler implementation #269

switch MCP watching to full EventHandler implementation #269

Conversation

pmores commented Feb 28, 2023

bpradipt Mar 1, 2023

Choose a reason for hiding this comment

pmores Mar 1, 2023

Choose a reason for hiding this comment

bpradipt Mar 1, 2023

Choose a reason for hiding this comment

pmores Mar 1, 2023 • edited

Choose a reason for hiding this comment

bpradipt Mar 2, 2023

Choose a reason for hiding this comment

pmores Mar 2, 2023

Choose a reason for hiding this comment

pmores Mar 2, 2023

Choose a reason for hiding this comment

pmores commented Mar 1, 2023

bpradipt left a comment

Choose a reason for hiding this comment

jensfr left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Mar 2, 2023

pmores commented Mar 2, 2023

pmores Mar 1, 2023 •

edited