Skip to content

Conversation

jgallagher
Copy link
Contributor

The change to enable blueprint execution will only affect newly-deployed systems that go through RSS after this change lands. Existing systems will need to have execution enabled via omdb.

The changes to the default chicken switches will:

  • enable the planner
  • disable zone additions if a mupdate override is present

on any systems that have not had reconfigurator chicken switches explicitly set (this is probably all deployed systems we care about; even the dogfood rack has not had any of these set explicitly).

Fixes #8960. Fixes #8961.

I'll run this through a racklette before merging.

The change to enable blueprint execution will only affect newly-deployed
systems that go through RSS after this change lands. Existing systems
will need to have execution enabled via `omdb`.

The changes to the default chicken switches will enable the planner and
disable zone additions if a mupdate override is present on any systems
that have not had reconfigurator chicken switches explicitly set (this
is _probably_ all deployed systems we care about; even the dogfood rack
has not had any of these set explicitly).
@jgallagher
Copy link
Contributor Author

Initial spot checks on a racklette look reasonable. We have exactly one blueprint and it's enabled:

root@oxz_switch0:~# omdb reconfigurator history
VERSN TIME                     BLUEPRINT
    1 2025-09-03T20:28:49.119Z 6792ed0d-b224-47e2-ab69-c67252f0c9b6  enabled: initial blueprint from rack setup

and execution is happening:

root@oxz_switch0:~# omdb nexus background-tasks show blueprint_executor
task: "blueprint_executor"
  configured period: every 1m
  currently executing: no
  last completed activation: iter 4, triggered by a periodic timer firing
    started at 2025-09-03T20:29:58.877Z (13s ago) and ran for 1694ms
    target blueprint: 6792ed0d-b224-47e2-ab69-c67252f0c9b6
    execution:        enabled
    status:           completed (15 steps)
    error:            (none)

I don't think (?) we have an easy way to see "the current chicken switches" if they've never been set, but we can see that we haven't set any:

root@oxz_switch0:~# omdb reconfigurator chicken-switches-history
VERSION PLANNER_ENABLED ADD_ZONES_WITH_MUPDATE_OVERRIDE TIME_MODIFIED

The planner is running (and doing nothing as expected):

root@oxz_switch0:~# omdb nexus background-tasks show blueprint_planner
task: "blueprint_planner"
  configured period: every 1m
  currently executing: no
  last completed activation: iter 8, triggered by a dependent task completing
    started at 2025-09-03T20:30:11.603Z (3s ago) and ran for 198ms
    plan unchanged from parent 6792ed0d-b224-47e2-ab69-c67252f0c9b6

and we can get the current value of the inner switches by asking for the entire reconfigurator state:

root@oxz_switch0:~# omdb reconfigurator export state.json
assembling reconfigurator state ... done
wrote state.json
root@oxz_switch0:~# jq .planning_input.policy.chicken_switches < state.json
{
  "add_zones_with_mupdate_override": false
}

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have a few more questions (e.g., maybe checking that a planner run would be a no-op) if this changed existing systems - but for new deployments, seems good to me.

Seems like we should move existing systems to "enabled" pretty soon too, to avoid a dichotomy in our deployments.

@jgallagher
Copy link
Contributor Author

I would have a few more questions (e.g., maybe checking that a planner run would be a no-op) if this changed existing systems - but for new deployments, seems good to me.

Ehh, it does change planning enabled for existing systems (under certain circumstances I noted above, which I think in practice are true for every existing system). Do you want to ask those extra questions? 😅

Seems like we should move existing systems to "enabled" pretty soon too, to avoid a dichotomy in our deployments.

I'm not entirely sure what this means - the next release ships update, so we have to move all of them to "enabled" as a part of that release anyway, right?

@jgallagher
Copy link
Contributor Author

I would have a few more questions (e.g., maybe checking that a planner run would be a no-op) if this changed existing systems - but for new deployments, seems good to me.

Ehh, it does change planning enabled for existing systems (under certain circumstances I noted above, which I think in practice are true for every existing system). Do you want to ask those extra questions? 😅

Oh, actually, I guess even though this does enable the planner on existing systems, the fact that it doesn't enable execution means if the planner actually does produce something unexpected, we won't try to execute it. We can add details about this to the update runbook for the release, I think.

@smklein
Copy link
Collaborator

smklein commented Sep 3, 2025

Yeah, I was keying on:

Existing systems will need to have execution enabled via omdb.

Basically, I was wondering "do we want to verify, before we let execution start running by itself, that it won't create/enact a plan that is immediately unreasonable"? And related: What should an operator be looking at, before flipping the execution bit to "enabled"?

My overwhelming assumption is "this should be fine", but figured we could at least check the temperature before fully jumping in.

@jgallagher
Copy link
Contributor Author

This broke... a lot of tests. I think (?) the right move is to just disable this automation for tests, so I made several nontrivial changes in e00edaf:

  • The big rack handoff transaction that happens at RSS now inserts an initial ReconfiguratorChickenSwitches value
  • I added a ReconfiguratorAutomationConfig type that has production() and test() methods, and updated all the call sites of rack handoff to use the correct one

I'm converting this to draft for now and will re-request review after CI passes and I get another round on a racklette to make sure this didn't break anything surprising.

@jgallagher jgallagher marked this pull request as draft September 4, 2025 15:40
jgallagher added a commit that referenced this pull request Sep 5, 2025
The chicken switch loading task exposes a watch channel that downstream
consumers can use to read the current value. It initializes that channel
to the `::default()` value of chicken switches and then reads the most
recent value from the db when it's activated, but that introduces a race
where consumers can see an incorrect value: if there is a chicken switch
value in the db, they can see the `::default()` _before_ the first
activation, allowing them to ignore the actual value in the db in favor
of whatever the default set is.

This PR closes the race window by setting the initial watch channel
value to `NotYetLoaded`, and then replacing that during the first
activation.

This fell out of trying to fix up tests on #8980, but is a legit bug and
easily separable from that work. So here it is!
@jgallagher
Copy link
Contributor Author

Alright, tests are passing, so I'm marking this as ready for review. The major changes from the original version of this PR are related to disabling both blueprint execution and planning when running under #[nexus_test]. I don't love the way I did either of these:

  • the rack_initialize() method now takes a blueprint_execution_enabled: bool that determines how the initial blueprint is inserted in the db
  • I added an optional set of "initial chicken switches" to Nexus's config; this is only provided by the test suite, and if present we insert this before turning on the background task system

but they're both functional enough. Feedback very welcome.

I'll run this on a racklette again and make sure none of my "disable automation in tests" things broke the prod path, but I think this is ready for review again.

@jgallagher jgallagher marked this pull request as ready for review September 5, 2025 18:11
}
}).collect(),
).await.map_err(|e| {
err.set(RackInitError::BlueprintTargetSet(e)).unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for patching this. Clearly I missed it.

Comment on lines +862 to +864
/// We use this hook to disable reconfigurator automation in the test suite
#[serde(default)]
pub initial_reconfigurator_chicken_switches:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this a little in chat, but wanted to make it public:

The usage of this planner_enabled option for tests is 100% reasonable, but not at all a "chicken switch":

  • We aren't leaving it as a toggle-able option to be conservative, and for enabling later
  • ... it's "just a config option", which we want to have set in some circumstances, and unset in others.

It probably behooves us to rename ReconfiguratorChickenSwitch to ReconfiguratorConfig, if the list of settings is going to be long-lasting (which is fine! It's just confusing to call it a "chicken switch").

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up: Obviously, doesn't need to block this PR.

@jgallagher
Copy link
Contributor Author

Initial spot checks on a racklette look reasonable.

I repeated all of these spot checks on the latest commit with all the "disable for tests" work, and they're all identical - execution and planning are still on by default for newly-deployed prod systems.

@jgallagher jgallagher merged commit 1fb71d1 into main Sep 5, 2025
17 checks passed
@jgallagher jgallagher deleted the john/automate-reconfigurator branch September 5, 2025 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

enable blueprint execution by default enable blueprint planner by default

2 participants