Skip to content

To quiesce, Nexus must be able to use internal API before upgrading to latest schema #8750

@smklein

Description

@smklein

Context

We're working through #8503 . Part of this issue involves: Old Nexuses should be able to communicate with each other, and coordinate that they have quiesced, before handing off to the new Nexuses.

In #8740, an endpoint has been added to the internal API: quiesce_state. This allows us to query "has this Nexus node shut down?"

This issue

This new endpoint queries the state of sagas and the datastore. It assumes you have a fully-constructed (in the Rust sense of object construction) DataStore object that can be queried.

However, the existing construction of a DataStore relies on the schema being up-to-date:

pub async fn new_with_timeout(
log: &Logger,
pool: Arc<Pool>,
config: Option<&AllSchemaVersions>,
try_for: Option<std::time::Duration>,
) -> Result<Self, String> {
use nexus_db_model::SCHEMA_VERSION as EXPECTED_VERSION;
let datastore =
Self::new_unchecked(log.new(o!("component" => "datastore")), pool);
let start = std::time::Instant::now();
// Keep looping until we find that the schema matches our expectation.
retry_notify(
retry_policy_internal_service(),
|| async {
if let Some(try_for) = try_for {
if std::time::Instant::now() > start + try_for {
return Err(BackoffError::permanent(()));
}
}
match datastore
.ensure_schema(&log, EXPECTED_VERSION, config)
.await
{
Ok(()) => return Ok(()),
Err(e) => {
warn!(log, "Failed to ensure schema version"; "error" => #%e);
}
};
return Err(BackoffError::transient(()));
},
|_, _| {},
)
.await
.map_err(|_| "Failed to read valid DB schema".to_string())?;
Ok(datastore)
}

This is a problem if an "old Nexus" is booting after quiescing has started:

  • It won't be able to respond to quiesce requests, because it won't have a DataStore
  • It won't be able to construct a DataStore, because quiescing has started

This is a bit of a "dependency deadlock".

Proposal to Fix

If Nexus could serve HTTP requests before the DataStore has validated the schema version is up-to-date, this problem would be resolved. As an example, the DataStore could allow creation earlier than schema validation, but reject claims (this seems reasonable, and matches the pattern of quiesce).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions