Skip to content

Conversation

david-crespo
Copy link
Contributor

@david-crespo david-crespo commented Aug 22, 2025

Closes #8869

  • Add update_status endpoint at /v1/system/update/status that includes:
    • Current target release (same as existing target_release_view)
    • time_last_progress, representing the time_made_target of the latest bp_target — meant to indicate the last time the update system Did Something
    • components_by_release_version, a map where the keys are to_string'd TufRepoVersions and the values are counts of components on that version
  • Remove target_release_view endpoint, which is fully redundant with update status
    • Add a line to the doc comment on target_release_update that you can use update status to check the current target release
  • Rework the TargetRelease struct that was previously being returned from target_release_view and is now part of the update status response
    • Remove generation from TargetRelease because it is not meaningful to the end user
    • Other changes that are easier to explain in the inline comments

Listing blockers or problems preventing further update progress is mentioned as a goal of #8869, but they are currently stored as JSON blobs in the DB for debug purposes. They will soon be stored in a more regimented way that should make it easy to stick a list of blockers in the response body. If this PR gets merged without that, I'll make a dedicated issue for it.

Example status response

{
  "target_release": {
    "time_requested": "2025-09-24T22:37:43.266338Z",
    "version": "2.0.0"
  },
  "components_by_release_version": {
    "install dataset": 7,
    "unknown": 15
  },
  "time_last_blueprint": "2025-09-24T22:37:41.556717Z",
  "paused": false
}

Base automatically changed from iliana/not-experimental to main August 26, 2025 18:24

let counts = status.components_by_release_version;
assert_eq!(counts.get("install dataset").unwrap(), &7);
assert_eq!(counts.get("unknown").unwrap(), &15);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will probably want to change the Display impl to make these snake case. I could just write my own helper that does that and leave the Display impl alone, but it's probably bad to have to two ways of doing this.

impl Display for TufRepoVersion {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
TufRepoVersion::Unknown => write!(f, "unknown"),
TufRepoVersion::InstallDataset => {
write!(f, "install dataset")
}
TufRepoVersion::Version(version) => {
write!(f, "{}", version)
}
TufRepoVersion::Error(s) => {
write!(f, "{}", s)
}
}
}
}

/// rack should eventually correspond to the release described here.
///
/// Will only be null if a target release has never been set.
pub target_release: Option<TargetRelease>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've collapsed the TargetReleaseSource enum with its Unspecified and SystemVersion arms into Option<TargetRelease>, where TargetRelease is always a system version. This is based on my understanding that it's basically never going to be Unspecified after the first few weeks of update being released. By handling that with target_release: null, we ensure future values will be nice ones like target_release: { version: "2.0.0", time_requested: "..." } that don't distract the user with the possibility of other shapes. In the previous TargetReleaseSource enum, this was the distinction between { "type": "unspecified" } vs. { "type": "system_version", ... }. The option approach lets us make the unspecified arm nearly invisible.

/// The source of the target release.
pub release_source: TargetReleaseSource,
/// The specified release of the rack's system software
pub version: Version,
Copy link
Contributor Author

@david-crespo david-crespo Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the word "source" here very confusing. Now there is no source. There either is no target release or there is one, and when there is one, it always has a system version.

I also thought it was a little weird that you could have a target release like TargetRelease { time_requested: "...", release_source: Unspecified }, namely unspecified but still with a non-null time_requested. Based on the migration adding the unspecified value, time_requested is just whenever the migration ran, which to me is not very meaningful to the user. In this PR, time_requested is still inside TargetRelease, but since TargetRelease is now optional, you only have time_requested in the response when you have a release set.

-- System software is by default from the `install` dataset.
INSERT INTO omicron.public.target_release (
generation,
time_requested,
release_source,
tuf_repo_id
) VALUES (
1,
NOW(),
'unspecified',
NULL
) ON CONFLICT DO NOTHING;

#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize, JsonSchema)]
pub struct TargetRelease {
/// The target-release generation number.
pub generation: i64,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been removed. I don't think it's meaningful to the user.

@david-crespo david-crespo marked this pull request as ready for review September 24, 2025 22:57
.chain(rot_version)
.chain(std::iter::once(bootloader_version))
.chain(host_version)
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to do this by hand here, but "make a flat list of all the versions in the system" feels like a good operation to have a canonical version of that lives somewhere else.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not had a chance to look carefully at the changes to nexus/src/app/update.rs or the updates.rs integration test. The API changes like fine to me. We'll need to coordinate an update to the CLI as people are testing this.

/// configuration) was made a target by the update system. In other words,
/// it's the last time the system decided on a next step to take.
pub time_last_progress: DateTime<Utc>,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point I think we talked about a field here that reflects "the system is not doing anything because of a recent MUPdate".

Internally, this can be determined if the current target release generation number (not the version, the generation number of the target release object) is less than or equal to the target blueprint's target_release_minimum_generation. If that's the case, that means a MUPdate has happened some time after the last time the operator set the target release. In this case, the system stops doing anything update-related until the operator sets the target release again. We'll want to identify and call out this condition in the console.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this as paused in 3ca18a7

/// Count of components running each release version
pub components_by_release_version: BTreeMap<String, usize>,

/// Time of last meaningful change to update status
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Time of last meaningful change to update status
/// Time of most recent update-related activity (internal to the system)

I'm definitely open to other things. I don't love "last meaningful change to update status" because technically this is the time when the system most recently decided to do something new (not even necessarily update-related, and it definitely hasn't done it yet). The real definition is too implementation-specific though, hence my vague "update-related activity"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went for "Time of most recent update planning activity" and time_last_blueprint in 3ca18a7. As much as it feels bad to expose an internal term, I would feel worse calling it something else (like last_update_planning_step or last_update_step) and literally meaning a blueprint anyway. I'd rather just say blueprint.

tuf_repo::table
.select(tuf_repo::system_version)
.filter(tuf_repo::id.eq(tuf_repo_id.into_untyped_uuid()))
.first_async::<SemverVersion>(&*conn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this handle the case where there was no version at all? I feel like this should probably be a 500 error. It shouldn't happen because we don't let you delete TUF repos at all, and when we do, you won't be able to delete one that's being used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean when there is no row found in the table with this ID? Every row has a version. I tested it with a nonexistent UUID and got a 500 with unexpected database error: Record not found, so I think it's already doing what you want.

Comment on lines 187 to 189
let time_last_progress = self
.datastore()
.blueprint_target_get_current(opctx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the current target blueprint sitting in a watch channel from the blueprint_loader task. It'd be nice to use that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting. As far as I can tell this is the first case of an external API endpoint pulling data from a watch channel. Is it possible that it could be out of date with respect to the other stuff we're straight from the DB? I have a commit on the way that seems to work fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attempted in 321374d. It seems fine? It always makes me nervous to do something we don't do anywhere else.

Comment on lines +233 to +237
self.datastore()
.target_release_get_generation(opctx, Generation(prev_gen))
.await
.internal_context("fetching previous target release")?
.and_then(|r| r.tuf_repo_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a little sketchy? I'm not sure.

Example: after a MUPdate, you might set the target release to the same value it was before. Then the previous row will not be the previous release.

Today, the PlanningInput has this information. Maybe PlanningInputFromDb has a better way to do this?

Copy link
Contributor Author

@david-crespo david-crespo Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, that is where I copied this logic from. The comment suggests that #8056 helps but I am not sure it does.

let tuf_repo = TufRepoPolicy {
target_release_generation: target_release.generation.0,
description: target_release_desc,
};
// NOTE: We currently assume that only two generations are in play: the
// target release generation and its previous one. This depends on us
// not setting a new target release in the middle of an update: see
// https://github.com/oxidecomputer/omicron/issues/8056.
//
// We may need to revisit this decision in the future. See that issue
// for some discussion.
let old_repo = if let Some(prev) = target_release.generation.prev() {
let prev_release = datastore
.target_release_get_generation(opctx, Generation(prev))
.await
.internal_context("fetching previous target release")?;
let description = if let Some(prev_release) = prev_release {
if let Some(repo_id) = prev_release.tuf_repo_id {
TargetReleaseDescription::TufRepo(
datastore
.tuf_repo_get_by_id(opctx, repo_id.into())
.await
.internal_context(
"fetching previous target release repo",
)?
.into_external(),
)
} else {
TargetReleaseDescription::Initial
}
} else {
TargetReleaseDescription::Initial
};
TufRepoPolicy { target_release_generation: prev, description }
} else {
TufRepoPolicy::initial()
};

I certainly see your point, though, and it is a problem. In the internal API logic I'm borrowing, it seems like we will only turn up component versions that match either the old or the new one passed in, because we're matching against artifact hashes. In the MUPdate case where current and prev release are the same, components that are on the the MUPdate's version will all show up as TufRepoVersion::Unknown.

fn zone_image_source_to_version(
old: &TargetReleaseDescription,
new: &TargetReleaseDescription,
source: &OmicronZoneImageSource,
res: &ConfigReconcilerInventoryResult,
) -> TufRepoVersion {
if let ConfigReconcilerInventoryResult::Err { message } = res {
return TufRepoVersion::Error(message.clone());
}
let &OmicronZoneImageSource::Artifact { hash } = source else {
return TufRepoVersion::InstallDataset;
};
if let Some(new) = new.tuf_repo() {
if new.artifacts.iter().any(|meta| meta.hash == hash) {
return TufRepoVersion::Version(
new.repo.system_version.clone(),
);
}
}
if let Some(old) = old.tuf_repo() {
if old.artifacts.iter().any(|meta| meta.hash == hash) {
return TufRepoVersion::Version(
old.repo.system_version.clone(),
);
}
}
TufRepoVersion::Unknown
}

Copy link
Contributor Author

@david-crespo david-crespo Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgallagher says to use this instead, and sort by generation number to figure out what are old and new releases (not sure I actually need old and new anyway, I just need all the versions on hand).

https://github.com/oxidecomputer/omicron/blame/4d5bdc6d348b27761348d763c4085f060bcefc18/nexus/db-queries/src/db/datastore/target_release.rs#L227-L302

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

upgrade status in external API
4 participants