-
Notifications
You must be signed in to change notification settings - Fork 72
Description
Context
We're working through #8503 . Part of this issue involves: Old Nexuses should be able to communicate with each other, and coordinate that they have quiesced, before handing off to the new Nexuses.
In #8740, an endpoint has been added to the internal API: quiesce_state. This allows us to query "has this Nexus node shut down?"
This issue
This new endpoint queries the state of sagas and the datastore. It assumes you have a fully-constructed (in the Rust sense of object construction) DataStore object that can be queried.
However, the existing construction of a DataStore relies on the schema being up-to-date:
omicron/nexus/db-queries/src/db/datastore/mod.rs
Lines 224 to 264 in 68a8c4b
| pub async fn new_with_timeout( | |
| log: &Logger, | |
| pool: Arc<Pool>, | |
| config: Option<&AllSchemaVersions>, | |
| try_for: Option<std::time::Duration>, | |
| ) -> Result<Self, String> { | |
| use nexus_db_model::SCHEMA_VERSION as EXPECTED_VERSION; | |
| let datastore = | |
| Self::new_unchecked(log.new(o!("component" => "datastore")), pool); | |
| let start = std::time::Instant::now(); | |
| // Keep looping until we find that the schema matches our expectation. | |
| retry_notify( | |
| retry_policy_internal_service(), | |
| || async { | |
| if let Some(try_for) = try_for { | |
| if std::time::Instant::now() > start + try_for { | |
| return Err(BackoffError::permanent(())); | |
| } | |
| } | |
| match datastore | |
| .ensure_schema(&log, EXPECTED_VERSION, config) | |
| .await | |
| { | |
| Ok(()) => return Ok(()), | |
| Err(e) => { | |
| warn!(log, "Failed to ensure schema version"; "error" => #%e); | |
| } | |
| }; | |
| return Err(BackoffError::transient(())); | |
| }, | |
| |_, _| {}, | |
| ) | |
| .await | |
| .map_err(|_| "Failed to read valid DB schema".to_string())?; | |
| Ok(datastore) | |
| } |
This is a problem if an "old Nexus" is booting after quiescing has started:
- It won't be able to respond to quiesce requests, because it won't have a DataStore
- It won't be able to construct a DataStore, because quiescing has started
This is a bit of a "dependency deadlock".
Proposal to Fix
If Nexus could serve HTTP requests before the DataStore has validated the schema version is up-to-date, this problem would be resolved. As an example, the DataStore could allow creation earlier than schema validation, but reject claims (this seems reasonable, and matches the pattern of quiesce).