The issue was found on rack3 where VMs were unable to get outbound connectivity consistently (oxidecomputer/customer-support#385). The NAT synchronization RPW appeared to be working on both switches but one of them turned out to be stuck in some way. The troubleshooting notes are duplicated below:
From what we can tell the last complete synchronization between switch0 and nexus on BRM42220088 was at 11:58:05.960Z. We can confirm this via dendrite logs:
{"msg":"starting ipv4 nat reconciliation","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:05.953117225Z","hostname":"oxz_switch","pid":2685,"task":"ipv4_nat","unit":"workflow_server"}
{"msg":"fetching nat generation","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:05.953145038Z","hostname":"oxz_switch","pid":2685}
{"msg":"we are currently at ipv4 nat generation: 97491","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:05.953154536Z","hostname":"oxz_switch","pid":2685,"task":"ipv4_nat","unit":"workflow_server"}
{"msg":"checking Nexus for updates","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:05.953163894Z","hostname":"oxz_switch","pid":2685,"task":"ipv4_nat","unit":"workflow_server"}
{"msg":"client request","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:05.95317251Z","hostname":"oxz_switch","pid":2685,"component":"NexusClient","unit":"workflow_server","body":"None","uri":"http://_nexus._tcp.control-plane.oxide.internal:12221/nat/ipv4/changeset/97491?limit=100","method":"GET"}
{"msg":"lookup_sockets_v6_raw srv","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:05.953186016Z","hostname":"oxz_switch","pid":2685,"unit":"workflow_server","dns_name":"_nexus._tcp.control-plane.oxide.internal"}
{"msg":"client response","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:05.960520925Z","hostname":"oxz_switch","pid":2685,"component":"NexusClient","unit":"workflow_server","result":"Ok(Response { url: \"http://_nexus._tcp.control-plane.oxide.internal:12221/nat/ipv4/changeset/97491?limit=100\", status: 200, headers: {\"content-type\": \"application/json\", \"x-request-id\": \"713ab46c-d961-44e9-bd5f-a1ee7cb0a4b3\", \"content-length\": \"2\", \"date\": \"Sat, 03 May 2025 11:58:05 GMT\"} })"}
{"msg":"request successful","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:05.960553846Z","hostname":"oxz_switch","pid":2685,"task":"ipv4_nat","unit":"workflow_server","response":"[]"}
{"msg":"no further updates found","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:05.96056415Z","hostname":"oxz_switch","pid":2685,"task":"ipv4_nat","unit":"workflow_server"}
And the corresponding Nexus logs:
11:58:05.955Z DEBG 1cfdb5b6-e568-436a-a85f-7fecf1b8eef2 (dropshot_internal): roles
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:10e::3]:12221
method = GET
remote_addr = [fd00:1122:3344:111::2]:36390
req_id = 713ab46c-d961-44e9-bd5f-a1ee7cb0a4b3
roles = RoleSet { roles: {} }
uri = /nat/ipv4/changeset/97491?limit=100
11:58:05.955Z DEBG 1cfdb5b6-e568-436a-a85f-7fecf1b8eef2 (dropshot_internal): authorize result
action = Query
actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:10e::3]:12221
method = GET
remote_addr = [fd00:1122:3344:111::2]:36390
req_id = 713ab46c-d961-44e9-bd5f-a1ee7cb0a4b3
resource = Database
result = Ok(())
uri = /nat/ipv4/changeset/97491?limit=100
11:58:05.960Z INFO 1cfdb5b6-e568-436a-a85f-7fecf1b8eef2 (dropshot_internal): request completed
file = /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.16.0/src/server.rs:867
latency_us = 5037
local_addr = [fd00:1122:3344:10e::3]:12221
method = GET
remote_addr = [fd00:1122:3344:111::2]:36390
req_id = 713ab46c-d961-44e9-bd5f-a1ee7cb0a4b3
response_code = 200
uri = /nat/ipv4/changeset/97491?limit=100
Based on the logs of the next request from dendrite:
{"msg":"starting ipv4 nat reconciliation","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:36.001800357Z","hostname":"oxz_switch","pid":2685,"task":"ipv4_nat","unit":"workflow_server"}
{"msg":"fetching nat generation","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:36.00181215Z","hostname":"oxz_switch","pid":2685}
{"msg":"we are currently at ipv4 nat generation: 97491","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:36.001820075Z","hostname":"oxz_switch","pid":2685,"task":"ipv4_nat","unit":"workflow_server"}
{"msg":"checking Nexus for updates","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:36.001828451Z","hostname":"oxz_switch","pid":2685,"task":"ipv4_nat","unit":"workflow_server"}
{"msg":"client request","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:36.001836346Z","hostname":"oxz_switch","pid":2685,"component":"NexusClient","unit":"workflow_server","body":"None","uri":"http://_nexus._tcp.control-plane.oxide.internal:12221/nat/ipv4/changeset/97491?limit=100","method":"GET"}
We never complete this synchronization. On the nexus side we do not see any corresponding request from the logs. This seems to indicate that the request was never received.
We also checked that /rpw/nat/ipv4/trigger is still working on both switch0 and switch1 which both take a write lock on the Arc<RwLock<Instant>> timer. This further indicates that we are not stuck waiting on a lock for it in the NAT synchronization.
There are some proposals in the linked ticket for tracing and tooling enhancements to provide more debug information about the failure.
The issue was found on rack3 where VMs were unable to get outbound connectivity consistently (oxidecomputer/customer-support#385). The NAT synchronization RPW appeared to be working on both switches but one of them turned out to be stuck in some way. The troubleshooting notes are duplicated below:
From what we can tell the last complete synchronization between
switch0andnexusonBRM42220088was at 11:58:05.960Z. We can confirm this via dendrite logs:And the corresponding Nexus logs:
Based on the logs of the next request from dendrite:
We never complete this synchronization. On the nexus side we do not see any corresponding request from the logs. This seems to indicate that the request was never received.
We also checked that
/rpw/nat/ipv4/triggeris still working on bothswitch0andswitch1which both take a write lock on theArc<RwLock<Instant>>timer. This further indicates that we are not stuck waiting on a lock for it in the NAT synchronization.There are some proposals in the linked ticket for tracing and tooling enhancements to provide more debug information about the failure.