Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch MCP watching to full EventHandler implementation #269

Merged
merged 1 commit into from Mar 2, 2023

Conversation

pmores
Copy link
Contributor

@pmores pmores commented Feb 28, 2023

The main motivation behind this is to give ourselves much more transparency and control over when precisely we reconcile on an MCP change.

The EnqueueRequestsFromMapFunc() based mechanism used so far invokes the registered function (mapKataConfigToRequests()) on any and all MCP changes. Combined with the mapKataConfigToRequests() implementation that filed a reconcile.Request any time it was invoked without discrimination we were triggering unnecessarily many reconciliations.

With this change, we only reconcile on relevant MCP changes ("worker" and "kata-oc") unlike so far when we reconciled on any MCP change, however unrelated to this controller the changed MCP might have been. In addition, we only reconcile on MCP update since reconciling on MCP creation or deletion doesn't seem useful to this controller. We also guard against spurious MCP changes that are irrelevant to us (e.g. timestamps) by making sure that we only reconcile when values actually used by the controller change (machine counts and Updating/Updated conditions).

- Description of the problem which is fixed/What is the use case

- What I did

- How to verify it

- Description for the changelog

for _, condType := range []mcfgv1.MachineConfigPoolConditionType{"Updating", "Updated"} {
condOld := mcfgv1.GetMachineConfigPoolCondition(statusOld, condType)
condNew := mcfgv1.GetMachineConfigPoolCondition(statusNew, condType)
condStatusOld := "<missing>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is <missing> status string coming from the mcp code ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it's just a string I picked because a) it's not a valid condition value and b) looks good (=is easy to understand) in the log.

It's meant to mean that the condition (Updating or Updated) is literally missing from the MCP's condition array - which it legally can.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to add the above as a comment in the code to help us when we look at the code after few months :-)

Copy link
Contributor Author

@pmores pmores Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against but I can't think of a wording that said anything that wasn't in the code already. I mean, the whole function is called logMcpChange() so it seems expected that it just composes log messages...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmores I should have worded it better. My suggestion was specific to the <missing> status string.
Maybe you can declare a constant like const missingStatusString .. and use it. When we look back at this code after a few months, we'll know what <missing> is for and not get confused about why we are using <missing> and not an empty string.
Btw just to be clear I'm fine with the code as-is. It was just a suggestion based on past experiences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, I'm sorry, I misunderstood. Will do!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by the last force push.

@pmores pmores force-pushed the improve-MCP-watching branch 2 times, most recently from 3e3853e to 9e02f32 Compare March 1, 2023 11:05
@pmores
Copy link
Contributor Author

pmores commented Mar 1, 2023

sorry about the force push noise, the code to be reviewed isn't changing though

Copy link
Contributor

@bpradipt bpradipt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Thanks @pmores

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 2, 2023
Copy link
Contributor

@jensfr jensfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmores looks good to me, thank you! my only concern would be that we get a lot of new info log entries

The main motivation behind this is to give ourselves much more transparency
and control over when precisely we reconcile on an MCP change.

The EnqueueRequestsFromMapFunc() based mechanism used so far invokes the
registered function (mapKataConfigToRequests()) on any and all MCP changes.
Combined with the mapKataConfigToRequests() implementation that filed a
reconcile.Request any time it was invoked without discrimination we were
triggering unnecessarily many reconciliations.

With this change, we only reconcile on relevant MCP changes ("worker" and
"kata-oc") unlike so far when we reconciled on any MCP change, however
unrelated to this controller the changed MCP might have been.  In addition,
we only reconcile on MCP update since reconciling on MCP creation or
deletion doesn't seem useful to this controller.  We also guard against
spurious MCP changes that are irrelevant to us (e.g. timestamps) by making
sure that we only reconcile when values actually used by the controller
change (machine counts and Updating/Updated conditions).

Signed-off-by: Pavel Mores <pmores@redhat.com>
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 2, 2023
@openshift-ci
Copy link

openshift-ci bot commented Mar 2, 2023

New changes are detected. LGTM label has been removed.

@pmores
Copy link
Contributor Author

pmores commented Mar 2, 2023

@jensfr Oh I took a lot of effort to ensure that wouldn't be the case. First of all, we never log anything about MCPs other than "kata-oc" and "worker". For those, we get a single line for each of MCP creation, deletion and generic event (I've never seen one of those yet). These are useful as they create a context to make it easier to understand the rest of the logging.

So any substantial logging happens only on "worker"/"kata-oc" update. However even update logs only on changes that are relevant to the OSC controller (basically changes in MachineCounts and Updating/Updated conditions). (That's essentially the only purpose of factoring logging out of McpEventHandler.Update() into logMcpChange() - to ensure that we never hit the log until we are very sure it's appropriate.)

To give you an idea how it looks in practice here's an example log. It captures what happens when a worker is labelled to match kataConfigPoolSelector on a 2 worker cluster where the other worker is already a member of "kata-oc":

2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        MCP updated     {"MCP name": "worker"}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        MachineCount changed    {"MCP name": "worker", "old": 1, "new": 0}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        ReadyMachineCount changed       {"MCP name": "worker", "old": 1, "new": 0}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        UpdatedMachineCount changed     {"MCP name": "worker", "old": 1, "new": 0}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        DegradedMachineCount    {"MCP name": "worker", "#": 0}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        MCP updated     {"MCP name": "kata-oc"}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        MachineCount changed    {"MCP name": "kata-oc", "old": 1, "new": 2}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        ReadyMachineCount       {"MCP name": "kata-oc", "#": 1}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        UpdatedMachineCount     {"MCP name": "kata-oc", "#": 1}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        DegradedMachineCount    {"MCP name": "kata-oc", "#": 0}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        mcp.status.conditions[] changed {"MCP name": "kata-oc", "type": "Updating", "old": "False", "new": "True"}
2023-02-28T17:01:08Z    INFO    controllers.KataConfig.McpUpdate        mcp.status.conditions[] changed {"MCP name": "kata-oc", "type": "Updated", "old": "True", "new": "False"}
2023-02-28T17:04:37Z    INFO    controllers.KataConfig.McpUpdate        MCP updated     {"MCP name": "kata-oc"}
2023-02-28T17:04:37Z    INFO    controllers.KataConfig.McpUpdate        MachineCount    {"MCP name": "kata-oc", "#": 2}
2023-02-28T17:04:37Z    INFO    controllers.KataConfig.McpUpdate        ReadyMachineCount changed       {"MCP name": "kata-oc", "old": 1, "new": 2}
2023-02-28T17:04:37Z    INFO    controllers.KataConfig.McpUpdate        UpdatedMachineCount changed     {"MCP name": "kata-oc", "old": 1, "new": 2}
2023-02-28T17:04:37Z    INFO    controllers.KataConfig.McpUpdate        DegradedMachineCount    {"MCP name": "kata-oc", "#": 0}
2023-02-28T17:04:37Z    INFO    controllers.KataConfig.McpUpdate        mcp.status.conditions[] changed {"MCP name": "kata-oc", "type": "Updating", "old": "True", "new": "False"}
2023-02-28T17:04:37Z    INFO    controllers.KataConfig.McpUpdate        mcp.status.conditions[] changed {"MCP name": "kata-oc", "type": "Updated", "old": "False", "new": "True"}

The first 5 lines show us that the labelled worker was drained from "worker" immediately and ensure us that "worker" was left in a good shape.

The next 7 lines tell us that the worker joined "kata-oc" as it should (MachineCount going from 1 to 2) but is not ready yet (Ready & UpdatedMachineCounts stay at 1) and the MCP went from Updated to Updating.

The final 7 lines (three and a half minutes later) tell us that the new worker became Ready in "kata-oc" and the MCP settled (Updating -> Updated).

I found this logging extremely useful for understanding what's going on.

@pmores pmores merged commit fa64b2d into openshift:main Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants