Add VolumeInfo#1928
Conversation
Add a new VolumeInfo enum that will be returned and provide a richer tree of information that matches the shape of the Volume. The intended consumer of this is the control plane in a few areas: - when performing region replacement or region snapshot replacement, the control plane needs to know when to consider the live repair or reconciliation successful, ultimately proceeding to cleaning up the temporary resources and continuing with another replacement. The upstairs has always been the source of this answer but the plan is to move from using activation as this signal to using this introduced enum - when activating with only 2 out of 3 downstairs available, the control plane needs to know the difference between an unhealthy volume and one that activated early with 2 out of 3 but eventually had all 3 mirrors online. - in the future, the control plane could query for health when performing updates or sled reboots, pausing until impacted Volumes become healthy again The VolumeInfo enum already existed so this PR renames that to VolumeExtentInfo. Eventually we should probably combine the two.
|
Example output from a healthy volume: "info": {
"volume": {
"sub_volumes": [
{
"upstairs": {
"state": "active",
"block_size": 512,
"upstairs_id": "08166582-285a-46a4-98cc-6f5eb966733b",
"session_id": "db05a92f-f220-4611-aa4b-3d448f885045",
"generation": 1776970947,
"read_only": false,
"encrypted": true,
"reconcile_in_progress": false,
"live_repair_in_progress": false,
"targets": [
{
"region_id": "b3092ac3-115a-4480-8d30-3c20fca254c2",
"target_addr": "127.0.0.1:44101",
"repair_addr": "[::]:48101",
"state": {
"type": "active"
}
},
{
"region_id": "73c9cd9f-8c8e-4cea-96ff-cae21dcc53dc",
"target_addr": "127.0.0.1:44102",
"repair_addr": "[::]:48102",
"state": {
"type": "active"
}
},
{
"region_id": "9063af2f-7331-41ba-8bc8-db5d308a1228",
"target_addr": "127.0.0.1:44103",
"repair_addr": "[::]:48103",
"state": {
"type": "active"
}
}
]
}
}
],
"read_only_parent": null
}
} |
|
Example where I took the first downstairs offline: "info": {
"volume": {
"sub_volumes": [
{
"upstairs": {
"state": "active",
"block_size": 512,
"upstairs_id": "08166582-285a-46a4-98cc-6f5eb966733b",
"session_id": "db05a92f-f220-4611-aa4b-3d448f885045",
"generation": 1776970947,
"read_only": false,
"encrypted": true,
"reconcile_in_progress": false,
"live_repair_in_progress": false,
"targets": [
{
"region_id": "b3092ac3-115a-4480-8d30-3c20fca254c2",
"target_addr": "127.0.0.1:44101",
"repair_addr": null,
"state": {
"type": "connecting",
"state": "negotiating",
"mode": "faulted"
}
},
{
"region_id": "73c9cd9f-8c8e-4cea-96ff-cae21dcc53dc",
"target_addr": "127.0.0.1:44102",
"repair_addr": "[::]:48102",
"state": {
"type": "active"
}
},
{
"region_id": "9063af2f-7331-41ba-8bc8-db5d308a1228",
"target_addr": "127.0.0.1:44103",
"repair_addr": "[::]:48103",
"state": {
"type": "active"
}
}
]
}
}
],
"read_only_parent": null
}
}then brought it back up: "info": {
"volume": {
"sub_volumes": [
{
"upstairs": {
"state": "active",
"block_size": 512,
"upstairs_id": "08166582-285a-46a4-98cc-6f5eb966733b",
"session_id": "db05a92f-f220-4611-aa4b-3d448f885045",
"generation": 1776970947,
"read_only": false,
"encrypted": true,
"reconcile_in_progress": false,
"live_repair_in_progress": true,
"targets": [
{
"region_id": "b3092ac3-115a-4480-8d30-3c20fca254c2",
"target_addr": "127.0.0.1:44101",
"repair_addr": "[::]:48101",
"state": {
"type": "live_repair"
}
},
{
"region_id": "73c9cd9f-8c8e-4cea-96ff-cae21dcc53dc",
"target_addr": "127.0.0.1:44102",
"repair_addr": "[::]:48102",
"state": {
"type": "active"
}
},
{
"region_id": "9063af2f-7331-41ba-8bc8-db5d308a1228",
"target_addr": "127.0.0.1:44103",
"repair_addr": "[::]:48103",
"state": {
"type": "active"
}
}
]
}
}
],
"read_only_parent": null
}
} |
|
And here's a reconcile: "info": {
"volume": {
"sub_volumes": [
{
"upstairs": {
"state": "go_active",
"block_size": 512,
"upstairs_id": "08166582-285a-46a4-98cc-6f5eb966733b",
"session_id": "9acfdd86-a830-42bd-8103-e4d46799de26",
"generation": 1776974377,
"read_only": false,
"encrypted": true,
"reconcile_in_progress": true,
"live_repair_in_progress": false,
"targets": [
{
"region_id": "b3092ac3-115a-4480-8d30-3c20fca254c2",
"target_addr": "127.0.0.1:44101",
"repair_addr": "[::]:48101",
"state": {
"type": "connecting",
"state": "reconcile",
"mode": "new"
}
},
{
"region_id": "73c9cd9f-8c8e-4cea-96ff-cae21dcc53dc",
"target_addr": "127.0.0.1:44102",
"repair_addr": "[::]:48102",
"state": {
"type": "connecting",
"state": "reconcile",
"mode": "new"
}
},
{
"region_id": "9063af2f-7331-41ba-8bc8-db5d308a1228",
"target_addr": "127.0.0.1:44103",
"repair_addr": "[::]:48103",
"state": {
"type": "connecting",
"state": "reconcile",
"mode": "new"
}
}
]
}
}
],
"read_only_parent": null
}
} |
leftwo
left a comment
There was a problem hiding this comment.
Minor question and request for multiple sub-volume test.
| ); | ||
| assert_eq!(*state, DownstairsInfoStatus::Active); | ||
| } | ||
| } |
There was a problem hiding this comment.
We should add a test here (or somewhere) that verifies a Volume with two sub-volumes will return with the proper info for each level.
| "email": "api@oxide.computer" | ||
| }, | ||
| "version": "1.0.0" | ||
| "version": "2.0.0" |
There was a problem hiding this comment.
If we have live update running, and a new pantry with old nexus, will the old nexus still be able to query the new pantry?
There was a problem hiding this comment.
Yep, old nexus will be sending requests with the api-version header 1.0.0, and the pantry will handle this by converting the 2.0.0 response to 1.0.0 for those requests.
Add a new VolumeInfo enum that will be returned and provide a richer tree of information that matches the shape of the Volume. The intended consumer of this is the control plane in a few areas:
when performing region replacement or region snapshot replacement, the control plane needs to know when to consider the live repair or reconciliation successful, ultimately proceeding to cleaning up the temporary resources and continuing with another replacement. The upstairs has always been the source of this answer but the plan is to move from using activation as this signal to using this introduced enum
when activating with only 2 out of 3 downstairs available, the control plane needs to know the difference between an unhealthy volume and one that activated early with 2 out of 3 but eventually had all 3 mirrors online.
in the future, the control plane could query for health when performing updates or sled reboots, pausing until impacted Volumes become healthy again
The VolumeInfo enum already existed so this PR renames that to VolumeExtentInfo. Eventually we should probably combine the two.