[reconfigurator] Enable execution and planning by default #8980

jgallagher · 2025-09-03T19:50:27Z

The change to enable blueprint execution will only affect newly-deployed systems that go through RSS after this change lands. Existing systems will need to have execution enabled via omdb.

The changes to the default chicken switches will:

enable the planner
disable zone additions if a mupdate override is present

on any systems that have not had reconfigurator chicken switches explicitly set (this is probably all deployed systems we care about; even the dogfood rack has not had any of these set explicitly).

Fixes #8960. Fixes #8961.

I'll run this through a racklette before merging.

The change to enable blueprint execution will only affect newly-deployed systems that go through RSS after this change lands. Existing systems will need to have execution enabled via `omdb`. The changes to the default chicken switches will enable the planner and disable zone additions if a mupdate override is present on any systems that have not had reconfigurator chicken switches explicitly set (this is _probably_ all deployed systems we care about; even the dogfood rack has not had any of these set explicitly).

jgallagher · 2025-09-03T20:36:56Z

Initial spot checks on a racklette look reasonable. We have exactly one blueprint and it's enabled:

root@oxz_switch0:~# omdb reconfigurator history
VERSN TIME                     BLUEPRINT
    1 2025-09-03T20:28:49.119Z 6792ed0d-b224-47e2-ab69-c67252f0c9b6  enabled: initial blueprint from rack setup

and execution is happening:

root@oxz_switch0:~# omdb nexus background-tasks show blueprint_executor
task: "blueprint_executor"
  configured period: every 1m
  currently executing: no
  last completed activation: iter 4, triggered by a periodic timer firing
    started at 2025-09-03T20:29:58.877Z (13s ago) and ran for 1694ms
    target blueprint: 6792ed0d-b224-47e2-ab69-c67252f0c9b6
    execution:        enabled
    status:           completed (15 steps)
    error:            (none)

I don't think (?) we have an easy way to see "the current chicken switches" if they've never been set, but we can see that we haven't set any:

root@oxz_switch0:~# omdb reconfigurator chicken-switches-history
VERSION PLANNER_ENABLED ADD_ZONES_WITH_MUPDATE_OVERRIDE TIME_MODIFIED

The planner is running (and doing nothing as expected):

root@oxz_switch0:~# omdb nexus background-tasks show blueprint_planner
task: "blueprint_planner"
  configured period: every 1m
  currently executing: no
  last completed activation: iter 8, triggered by a dependent task completing
    started at 2025-09-03T20:30:11.603Z (3s ago) and ran for 198ms
    plan unchanged from parent 6792ed0d-b224-47e2-ab69-c67252f0c9b6

and we can get the current value of the inner switches by asking for the entire reconfigurator state:

root@oxz_switch0:~# omdb reconfigurator export state.json
assembling reconfigurator state ... done
wrote state.json
root@oxz_switch0:~# jq .planning_input.policy.chicken_switches < state.json
{
  "add_zones_with_mupdate_override": false
}

smklein

I would have a few more questions (e.g., maybe checking that a planner run would be a no-op) if this changed existing systems - but for new deployments, seems good to me.

Seems like we should move existing systems to "enabled" pretty soon too, to avoid a dichotomy in our deployments.

jgallagher · 2025-09-03T20:59:09Z

I would have a few more questions (e.g., maybe checking that a planner run would be a no-op) if this changed existing systems - but for new deployments, seems good to me.

Ehh, it does change planning enabled for existing systems (under certain circumstances I noted above, which I think in practice are true for every existing system). Do you want to ask those extra questions? 😅

Seems like we should move existing systems to "enabled" pretty soon too, to avoid a dichotomy in our deployments.

I'm not entirely sure what this means - the next release ships update, so we have to move all of them to "enabled" as a part of that release anyway, right?

jgallagher · 2025-09-03T21:02:08Z

I would have a few more questions (e.g., maybe checking that a planner run would be a no-op) if this changed existing systems - but for new deployments, seems good to me.

Ehh, it does change planning enabled for existing systems (under certain circumstances I noted above, which I think in practice are true for every existing system). Do you want to ask those extra questions? 😅

Oh, actually, I guess even though this does enable the planner on existing systems, the fact that it doesn't enable execution means if the planner actually does produce something unexpected, we won't try to execute it. We can add details about this to the update runbook for the release, I think.

smklein · 2025-09-03T21:04:41Z

Yeah, I was keying on:

Existing systems will need to have execution enabled via omdb.

Basically, I was wondering "do we want to verify, before we let execution start running by itself, that it won't create/enact a plan that is immediately unreasonable"? And related: What should an operator be looking at, before flipping the execution bit to "enabled"?

My overwhelming assumption is "this should be fine", but figured we could at least check the temperature before fully jumping in.

jgallagher · 2025-09-04T15:39:27Z

This broke... a lot of tests. I think (?) the right move is to just disable this automation for tests, so I made several nontrivial changes in e00edaf:

The big rack handoff transaction that happens at RSS now inserts an initial ReconfiguratorChickenSwitches value
I added a ReconfiguratorAutomationConfig type that has production() and test() methods, and updated all the call sites of rack handoff to use the correct one

I'm converting this to draft for now and will re-request review after CI passes and I get another round on a racklette to make sure this didn't break anything surprising.

The chicken switch loading task exposes a watch channel that downstream consumers can use to read the current value. It initializes that channel to the `::default()` value of chicken switches and then reads the most recent value from the db when it's activated, but that introduces a race where consumers can see an incorrect value: if there is a chicken switch value in the db, they can see the `::default()` _before_ the first activation, allowing them to ignore the actual value in the db in favor of whatever the default set is. This PR closes the race window by setting the initial watch channel value to `NotYetLoaded`, and then replacing that during the first activation. This fell out of trying to fix up tests on #8980, but is a legit bug and easily separable from that work. So here it is!

…gurator

jgallagher · 2025-09-05T18:09:26Z

Alright, tests are passing, so I'm marking this as ready for review. The major changes from the original version of this PR are related to disabling both blueprint execution and planning when running under #[nexus_test]. I don't love the way I did either of these:

the rack_initialize() method now takes a blueprint_execution_enabled: bool that determines how the initial blueprint is inserted in the db
I added an optional set of "initial chicken switches" to Nexus's config; this is only provided by the test suite, and if present we insert this before turning on the background task system

but they're both functional enough. Feedback very welcome.

I'll run this on a racklette again and make sure none of my "disable automation in tests" things broke the prod path, but I think this is ready for review again.

smklein · 2025-09-05T18:37:30Z

nexus/db-queries/src/db/datastore/rack.rs

                                }
                            }).collect(),
                    ).await.map_err(|e| {
-                        err.set(RackInitError::BlueprintTargetSet(e)).unwrap();


Ah, thanks for patching this. Clearly I missed it.

smklein · 2025-09-05T18:43:47Z

nexus-config/src/nexus_config.rs

+    /// We use this hook to disable reconfigurator automation in the test suite
+    #[serde(default)]
+    pub initial_reconfigurator_chicken_switches:


Discussed this a little in chat, but wanted to make it public:

The usage of this planner_enabled option for tests is 100% reasonable, but not at all a "chicken switch":

We aren't leaving it as a toggle-able option to be conservative, and for enabling later

... it's "just a config option", which we want to have set in some circumstances, and unset in others.

It probably behooves us to rename ReconfiguratorChickenSwitch to ReconfiguratorConfig, if the list of settings is going to be long-lasting (which is fine! It's just confusing to call it a "chicken switch").

Follow-up: Obviously, doesn't need to block this PR.

jgallagher · 2025-09-05T18:57:08Z

Initial spot checks on a racklette look reasonable.

I repeated all of these spot checks on the latest commit with all the "disable for tests" work, and they're all identical - execution and planning are still on by default for newly-deployed prod systems.

…9005) Followup from #8980 (comment).

jgallagher requested review from davepacheco and smklein September 3, 2025 19:50

smklein approved these changes Sep 3, 2025

View reviewed changes

disable reconfigurator automation in tests

e00edaf

jgallagher marked this pull request as draft September 4, 2025 15:40

jgallagher added 5 commits September 4, 2025 13:24

expectorate

bb7b550

really disable planning in tests

9ab400e

expectorate

8c1d38c

fix and expand chicken switches task tests

51cc8af

clippy

996fb4d

jgallagher mentioned this pull request Sep 4, 2025

[reconfigurator] Fix race in chicken switch loader #8998

Merged

hakari

f52625b

Merge remote-tracking branch 'origin/main' into john/automate-reconfi…

0b1f6fb

…gurator

jgallagher marked this pull request as ready for review September 5, 2025 18:11

jgallagher requested review from andrewjstone and smklein September 5, 2025 18:11

smklein approved these changes Sep 5, 2025

View reviewed changes

jgallagher merged commit 1fb71d1 into main Sep 5, 2025
17 checks passed

jgallagher deleted the john/automate-reconfigurator branch September 5, 2025 18:57

jgallagher mentioned this pull request Sep 5, 2025

Rename ReconfiguratorChickenSwitches* to ReconfiguratorConfig* #9005

Merged

jgallagher added a commit that referenced this pull request Sep 9, 2025

Rename ReconfiguratorChickenSwitches* to ReconfiguratorConfig* (#…

e44ad55

…9005) Followup from #8980 (comment).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[reconfigurator] Enable execution and planning by default #8980

[reconfigurator] Enable execution and planning by default #8980

Uh oh!

jgallagher commented Sep 3, 2025

Uh oh!

jgallagher commented Sep 3, 2025

Uh oh!

smklein left a comment

Uh oh!

jgallagher commented Sep 3, 2025

Uh oh!

jgallagher commented Sep 3, 2025

Uh oh!

smklein commented Sep 3, 2025

Uh oh!

jgallagher commented Sep 4, 2025

Uh oh!

jgallagher commented Sep 5, 2025

Uh oh!

smklein Sep 5, 2025

Uh oh!

smklein Sep 5, 2025

Uh oh!

smklein Sep 5, 2025

Uh oh!

jgallagher commented Sep 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[reconfigurator] Enable execution and planning by default #8980

[reconfigurator] Enable execution and planning by default #8980

Uh oh!

Conversation

jgallagher commented Sep 3, 2025

Uh oh!

jgallagher commented Sep 3, 2025

Uh oh!

smklein left a comment

Choose a reason for hiding this comment

Uh oh!

jgallagher commented Sep 3, 2025

Uh oh!

jgallagher commented Sep 3, 2025

Uh oh!

smklein commented Sep 3, 2025

Uh oh!

jgallagher commented Sep 4, 2025

Uh oh!

jgallagher commented Sep 5, 2025

Uh oh!

smklein Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

smklein Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

smklein Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

jgallagher commented Sep 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants