-
Notifications
You must be signed in to change notification settings - Fork 62
Description
In the last two Friday demos we tried a disk creation that appeared to work, but Nexus's state said "creating" for way longer than expected (minutes). The reason appeared to be exponential backoff in the saga, with each timeout triggering an attempt to create the disk again (using a request that would succeed if the disk were already created):
Lines 1228 to 1271 in fd3dab1
| let region_request = CreateRegion { | |
| block_size: region.block_size().to_bytes(), | |
| extent_count: region.extent_count().try_into().unwrap(), | |
| extent_size: region.blocks_per_extent().try_into().unwrap(), | |
| // TODO: Can we avoid casting from UUID to string? | |
| // NOTE: This'll require updating the crucible agent client. | |
| id: RegionId(region.id().to_string()), | |
| encrypted: region.encrypted(), | |
| cert_pem: None, | |
| key_pem: None, | |
| root_pem: None, | |
| }; | |
| let create_region = || async { | |
| let region = client | |
| .region_create(®ion_request) | |
| .await | |
| .map_err(|e| BackoffError::Permanent(e.into()))?; | |
| match region.state { | |
| RegionState::Requested => Err(BackoffError::transient(anyhow!( | |
| "Region creation in progress" | |
| ))), | |
| RegionState::Created => Ok(region), | |
| _ => Err(BackoffError::Permanent(anyhow!( | |
| "Failed to create region, unexpected state: {:?}", | |
| region.state | |
| ))), | |
| } | |
| }; | |
| let log_create_failure = |_, delay| { | |
| warn!( | |
| log, | |
| "Region requested, not yet created. Retrying in {:?}", delay | |
| ); | |
| }; | |
| let region = backoff::retry_notify( | |
| backoff::internal_service_policy(), | |
| create_region, | |
| log_create_failure, | |
| ) | |
| .await | |
| .map_err(|e| Error::internal_error(&e.to_string()))?; |
I don't think it makes sense to use exponential backoff for this case because there's no indication that the remote side is overloaded -- it just isn't finished yet. The internal_service_policy() being used at L1266 is intended for background connections or requests to internal services where we want to retry indefinitely -- definitely nothing latency-sensitive. We could create one that's much more aggressive (e.g., once/second), but really, this isn't retrying a failure, it's just waiting for something to happen. There are other ways to do this:
- For instance creation, we have the sled send us a notification when the instance state changes. I like this pattern because it also works for the case where the state changes unexpectedly (e.g., a bhyve crash), though I think it creates a different problem if the notification is received by a different Nexus than the one operating the saga.
- Josh suggested a long poll -- instead of having the request complete immediately with a status meaning "it's not ready yet", it could wait a bounded amount of time (say, 30 seconds). If the request completes within that, the server responds immediately saying so. If not, it responds with "not ready yet" and the client immediately retries.