Skip to content

Conversation

davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented Sep 22, 2025

This is not ready yet.

  • decide if we want this to block on Complete target-release test simulating an entire update #9059 rather than have the existing callers just use the current blueprint's nexus generation (as this PR currently does)
  • figure out what other tests to add
    • explicit tests for blueprint_internal_dns_config/blueprint_external_dns_config for these cases (not doing this)
    • reconfigurator-cli tests (using blueprint-diff-dns)
  • update the handoff live test to check both internal and external DNS before and after the handoff

Fixes #8505.

@davepacheco davepacheco changed the base branch from main to john/target-release-test-nexus-handoff September 24, 2025 00:22
@davepacheco davepacheco changed the title WIP: internal/external DNS should only contain records for active Nexus instances internal/external DNS should only contain records for active Nexus instances Sep 24, 2025
Copy link
Collaborator Author

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ready for review, but I'm keeping it draft because it depends on #9059.

There was some discussion of using the reconfigurator-cli simulated state around the active Nexus generation when computing the contents of internal/external DNS. This would be appropriate if we had reconfigurator-cli commands that show DNS information for a blueprint based on the current simulated state (like blueprint-show might). But we don't have such a command today. (blueprint-show in reconfigurator-cli doesn't show DNS information.) The only commands that compute DNS contents are load-example (which is doing it for a freshly-created example system), blueprint-diff (which is diff'ing historical blueprints and we can't assume the simulated state applies to them), and blueprint-diff-dns (where you give it the specific DNS contents to diff against, so this too could be a historical blueprint, not a current one). So in the end, I chose to always use the blueprint's own generation. There's a big comment on blueprint_active_nexus_generation() explaining why.

Comment on lines 7896 to 7901
DNS zone: "control-plane.oxide.internal" (unchanged)
unchanged names: 51 (records: 65)

external DNS:
* DNS zone: "oxide.example":
* name: example-silo.sys (records: 3 -> 6)
- A 192.0.2.2
- A 192.0.2.3
- A 192.0.2.4
+ A 192.0.2.2
+ A 192.0.2.7
+ A 192.0.2.3
+ A 192.0.2.6
+ A 192.0.2.4
+ A 192.0.2.5
unchanged names: 4 (records: 6)
DNS zone: "oxide.example" (unchanged)
unchanged names: 5 (records: 9)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change reflects the bug fixed by this PR. Before, we were updating DNS to include both old and new zones as soon as new ones were created. Now, we don't change DNS at this point in the process.

Comment on lines 7971 to 8002
* DNS zone: "control-plane.oxide.internal":
- name: 0c71b3b2-6ceb-4e8f-b020-b08675e83038.host (records: 1)
- AAAA fd00:1122:3344:101::22
- name: 3eeb8d49-eb1a-43f8-bb64-c2338421c2c6.host (records: 1)
- AAAA fd00:1122:3344:103::22
- name: 466a9f29-62bf-4e63-924a-b9efdb86afec.host (records: 1)
- AAAA fd00:1122:3344:102::22
+ name: 90dbd6f3-9bcb-4a62-ad85-d61f5b3a36ad.host (records: 1)
+ AAAA fd00:1122:3344:103::2b
+ name: 9ae90740-7fdb-4073-ae43-048f2fca3d69.host (records: 1)
+ AAAA fd00:1122:3344:102::2c
* name: _nexus._tcp (records: 3 -> 3)
- SRV port 12221 0c71b3b2-6ceb-4e8f-b020-b08675e83038.host.control-plane.oxide.internal
- SRV port 12221 3eeb8d49-eb1a-43f8-bb64-c2338421c2c6.host.control-plane.oxide.internal
- SRV port 12221 466a9f29-62bf-4e63-924a-b9efdb86afec.host.control-plane.oxide.internal
+ SRV port 12221 90dbd6f3-9bcb-4a62-ad85-d61f5b3a36ad.host.control-plane.oxide.internal
+ SRV port 12221 9ae90740-7fdb-4073-ae43-048f2fca3d69.host.control-plane.oxide.internal
+ SRV port 12221 d516d61b-fd96-46ad-a743-78eec814ee90.host.control-plane.oxide.internal
+ name: d516d61b-fd96-46ad-a743-78eec814ee90.host (records: 1)
+ AAAA fd00:1122:3344:101::2b
unchanged names: 47 (records: 59)

external DNS:
DNS zone: "oxide.example" (unchanged)
unchanged names: 5 (records: 12)
* DNS zone: "oxide.example":
* name: example-silo.sys (records: 3 -> 3)
- A 192.0.2.2
- A 192.0.2.3
- A 192.0.2.4
+ A 192.0.2.7
+ A 192.0.2.6
+ A 192.0.2.5
unchanged names: 4 (records: 6)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, this change reflects the bug fixed by this PR. Before, we had already updated DNS (wrongly) at this point in the process. This is the correct time to do it (we've just done the handoff).

@davepacheco
Copy link
Collaborator Author

I wound up backing out the change that showed the DNS changes across the handoff itself. That was using blueprint-diff-dns before I decided it probably doesn't make sense for that command to look at the current simulated state after all.

Base automatically changed from john/target-release-test-nexus-handoff to main September 24, 2025 14:13
@davepacheco davepacheco marked this pull request as ready for review September 24, 2025 15:43
- A 192.0.2.5
+ A 192.0.2.7
+ A 192.0.2.6
+ A 192.0.2.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I didn't notice this before (and this is not relevant to this PR's changes), but it looks like our DNS diffing code could use some tweaks? This was previously saying we removed 192.0.2.{4,5,6,7} and added 192.0.2.{5,6,7}; shouldn't it have just reported that we were removing 4 and keeping 5,6,7?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be an improvement, yeah.

"Deploy DNS records",
async move |cx| {
let Some(nexus_id) = nexus_id else {
return StepSkipped::new((), "not running as Nexus").into();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this impact reconfigurator-exec-unsafe in some way that needs documenting or warnings from it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does mean this step will no longer be done by that tool. That should be visible in the output (that this step was skipped). There are other steps it already doesn't do.

@davepacheco davepacheco merged commit cf57f89 into main Sep 24, 2025
17 checks passed
@davepacheco davepacheco deleted the dap/handoff-dns branch September 24, 2025 17:47
leftwo pushed a commit that referenced this pull request Sep 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

schema update: manage internal/external DNS
3 participants