Skip to content

Need to detect stuck NAT synchronization #47

@askfongjojo

Description

@askfongjojo

The issue was found on rack3 where VMs were unable to get outbound connectivity consistently (oxidecomputer/customer-support#385). The NAT synchronization RPW appeared to be working on both switches but one of them turned out to be stuck in some way. The troubleshooting notes are duplicated below:


From what we can tell the last complete synchronization between switch0 and nexus on BRM42220088 was at 11:58:05.960Z. We can confirm this via dendrite logs:

{"msg":"starting ipv4 nat reconciliation","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:05.953117225Z","hostname":"oxz_switch","pid":2685,"task":"ipv4_nat","unit":"workflow_server"}
{"msg":"fetching nat generation","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:05.953145038Z","hostname":"oxz_switch","pid":2685}
{"msg":"we are currently at ipv4 nat generation: 97491","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:05.953154536Z","hostname":"oxz_switch","pid":2685,"task":"ipv4_nat","unit":"workflow_server"}
{"msg":"checking Nexus for updates","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:05.953163894Z","hostname":"oxz_switch","pid":2685,"task":"ipv4_nat","unit":"workflow_server"}
{"msg":"client request","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:05.95317251Z","hostname":"oxz_switch","pid":2685,"component":"NexusClient","unit":"workflow_server","body":"None","uri":"http://_nexus._tcp.control-plane.oxide.internal:12221/nat/ipv4/changeset/97491?limit=100","method":"GET"}
{"msg":"lookup_sockets_v6_raw srv","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:05.953186016Z","hostname":"oxz_switch","pid":2685,"unit":"workflow_server","dns_name":"_nexus._tcp.control-plane.oxide.internal"}
{"msg":"client response","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:05.960520925Z","hostname":"oxz_switch","pid":2685,"component":"NexusClient","unit":"workflow_server","result":"Ok(Response { url: \"http://_nexus._tcp.control-plane.oxide.internal:12221/nat/ipv4/changeset/97491?limit=100\", status: 200, headers: {\"content-type\": \"application/json\", \"x-request-id\": \"713ab46c-d961-44e9-bd5f-a1ee7cb0a4b3\", \"content-length\": \"2\", \"date\": \"Sat, 03 May 2025 11:58:05 GMT\"} })"}
{"msg":"request successful","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:05.960553846Z","hostname":"oxz_switch","pid":2685,"task":"ipv4_nat","unit":"workflow_server","response":"[]"}
{"msg":"no further updates found","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:05.96056415Z","hostname":"oxz_switch","pid":2685,"task":"ipv4_nat","unit":"workflow_server"}

And the corresponding Nexus logs:

11:58:05.955Z DEBG 1cfdb5b6-e568-436a-a85f-7fecf1b8eef2 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:10e::3]:12221
    method = GET
    remote_addr = [fd00:1122:3344:111::2]:36390
    req_id = 713ab46c-d961-44e9-bd5f-a1ee7cb0a4b3
    roles = RoleSet { roles: {} }
    uri = /nat/ipv4/changeset/97491?limit=100
11:58:05.955Z DEBG 1cfdb5b6-e568-436a-a85f-7fecf1b8eef2 (dropshot_internal): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:10e::3]:12221
    method = GET
    remote_addr = [fd00:1122:3344:111::2]:36390
    req_id = 713ab46c-d961-44e9-bd5f-a1ee7cb0a4b3
    resource = Database
    result = Ok(())
    uri = /nat/ipv4/changeset/97491?limit=100
11:58:05.960Z INFO 1cfdb5b6-e568-436a-a85f-7fecf1b8eef2 (dropshot_internal): request completed
    file = /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.16.0/src/server.rs:867
    latency_us = 5037
    local_addr = [fd00:1122:3344:10e::3]:12221
    method = GET
    remote_addr = [fd00:1122:3344:111::2]:36390
    req_id = 713ab46c-d961-44e9-bd5f-a1ee7cb0a4b3
    response_code = 200
    uri = /nat/ipv4/changeset/97491?limit=100

Based on the logs of the next request from dendrite:

{"msg":"starting ipv4 nat reconciliation","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:36.001800357Z","hostname":"oxz_switch","pid":2685,"task":"ipv4_nat","unit":"workflow_server"}
{"msg":"fetching nat generation","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:36.00181215Z","hostname":"oxz_switch","pid":2685}
{"msg":"we are currently at ipv4 nat generation: 97491","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:36.001820075Z","hostname":"oxz_switch","pid":2685,"task":"ipv4_nat","unit":"workflow_server"}
{"msg":"checking Nexus for updates","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:36.001828451Z","hostname":"oxz_switch","pid":2685,"task":"ipv4_nat","unit":"workflow_server"}
{"msg":"client request","v":0,"name":"dpd","level":20,"time":"2025-05-03T11:58:36.001836346Z","hostname":"oxz_switch","pid":2685,"component":"NexusClient","unit":"workflow_server","body":"None","uri":"http://_nexus._tcp.control-plane.oxide.internal:12221/nat/ipv4/changeset/97491?limit=100","method":"GET"}

We never complete this synchronization. On the nexus side we do not see any corresponding request from the logs. This seems to indicate that the request was never received.

We also checked that /rpw/nat/ipv4/trigger is still working on both switch0 and switch1 which both take a write lock on the Arc<RwLock<Instant>> timer. This further indicates that we are not stuck waiting on a lock for it in the NAT synchronization.


There are some proposals in the linked ticket for tracing and tooling enhancements to provide more debug information about the failure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions