Skip to content

Conversation

davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented Sep 26, 2025

This new background task just sets time_pruned on TUF repos according to a simple policy. With #9109, this will trigger these repos' artifacts to be deleted.

Base automatically changed from iliana/tuf-prune-field to main September 27, 2025 05:19
@davepacheco

This comment was marked as outdated.

@davepacheco
Copy link
Collaborator Author

@jgallagher points out that mupdate resolution already only works with the latest target release, which eliminates the problem I was worried about.

@davepacheco davepacheco reopened this Sep 27, 2025
iliana added a commit that referenced this pull request Sep 30, 2025
Co-authored-by: David Pacheco <dap@oxidecomputer.com>
@davepacheco davepacheco marked this pull request as ready for review September 30, 2025 04:19
@davepacheco davepacheco added this to the 17 milestone Sep 30, 2025
@davepacheco davepacheco self-assigned this Sep 30, 2025
Comment on lines 80 to 81
tuf_repo_pruner.nkeep_extra_target_releases = 1
tuf_repo_pruner.nkeep_extra_newly_uploaded = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the config we ultimately want to have? (I remember discussing setting these each to 3 or the like but I could be misremembering.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a harcdoded minimum defined in constants in the background task. These are in addition to that. I did it this way so that you couldn't configure it in a way that can't work, but maybe it's more confusing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that you can't configure an unsafe value; maybe worth commenting these lines that "extra" here means "in addition to some minimums that Nexus always keeps"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh these are in addition to. Yeah, a comment would be good. (While reading through I had somehow thought the maximum of the hardcoded minimum and the configured amount would be used.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment.

match serde_json::from_value::<TufRepoPrunerStatus>(details.clone()) {
Err(error) => eprintln!(
"warning: failed to interpret task details: {:?}: {:?}",
error, details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InlineErrorChain for the error? Although I don't know that we could have nested causes here, so maybe doesn't matter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. (This code is copied/pasted a bunch here and I did not change the others.)

// That's still small enough to do in one go. If we're wrong and can't
// find enough distinct releases, we'll return an error. (This seems
// extremely unlikely.)
let limit = 4 * u16::from(count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit - I think if we make this i64::from(count) you wouldn't need to convert it again on line 163?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, this means changing the use below at L289 to use a fallible conversion (either i64 to usize or the other way around). I wanted to avoid that.


// Now insert a TUF repo and try again. That alone shouldn't change
// anything.
let repo1id = insert_test_tuf_repo(opctx, datastore, 1).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh you have some conflicts with main on this file; sorry. E.g., I added an insert_tuf_repo helper that should probably be removed in favor of your insert_test_tuf_repo?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've resolved the conflicts. I'd prefer to not commonize the helpers in this PR, mainly because of urgency. (Also, the helper I wrote is a little more streamlined -- it just returns the id. You're using the returned system_version in yours. I didn't want to work through the right middle ground and update all the callers today.)

let conn = self.pool_connection_authorized(opctx).await?;
let error = OptionalError::new();
self.transaction_retry_wrapper("tuf_repo_mark_pruned")
.transaction(&conn, |txn| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (?) this could be written as a CTE without too much pain. Is that worth doing and/or filing an issue for? I still don't have a good sense for weighing "we'd prefer fewer interactive transactions" against doing more work to avoid them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to do that now because of urgency but also because this code path seems very far from any meaningful contention (or impact of contention). I would lump this with many other interactive transactions we have, although likely less severe than most. The closest existing issue is probably #973.

Comment on lines +229 to +230
repo_prune,
other_repos_eligible_to_prune,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why prune only one instead of all of these? We'll prune the rest of other_repos... the next N times we activate, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct.

The true reason I did it this way is that:

  • The way this is structured right now, if you pruned more than one with the same TUF generation number and RecentTargetReleases, all subsequent ones would fail because the first one would have invalidated the assumptions that get checked (the pruning bumps the TUF generation).
  • Fixing this would require something like:
    • take a whole other lap immediately (re-fetch the generation and maybe recent target releases)
    • communicate to the caller somehow that the generation bumped in a predictable way (i.e., update their generation number for them)
    • providing an API to prune multiple in one go, which is appealing, but then you can provide an arbitrarily large set and it could also result in a table scan (I'm not positive if it's a problem if Cockroach is only doing it because it's fastest, not because it's the only way).

All of these seemed more trouble than they were worth, especially since in practice I think it would be extremely rare to be able to prune more than one. (I think that'd mean that you uploaded two TUF repos within 5 minutes and already had one that wasn't the target release; or you set the target release rapidly (which is practically impossible); etc.)

let datastore = cptestctx.server.server_context().nexus.datastore();
let opctx = OpContext::for_tests(logctx.log.new(o!()), datastore.clone());

// Wait for one activation of the task to avoid racing with it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually avoid racing, or could the task immediately fire again (and then prune between 645 and 648) if the timer goes off in the meantime?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's totally guaranteed, but it seems extremely unlikely since the timer is 5 minutes and we will have already run it once.

Comment on lines 80 to 81
tuf_repo_pruner.nkeep_extra_target_releases = 1
tuf_repo_pruner.nkeep_extra_newly_uploaded = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that you can't configure an unsafe value; maybe worth commenting these lines that "extra" here means "in addition to some minimums that Nexus always keeps"?

Copy link
Collaborator Author

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments!

Comment on lines 80 to 81
tuf_repo_pruner.nkeep_extra_target_releases = 1
tuf_repo_pruner.nkeep_extra_newly_uploaded = 1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment.

match serde_json::from_value::<TufRepoPrunerStatus>(details.clone()) {
Err(error) => eprintln!(
"warning: failed to interpret task details: {:?}: {:?}",
error, details
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. (This code is copied/pasted a bunch here and I did not change the others.)

// That's still small enough to do in one go. If we're wrong and can't
// find enough distinct releases, we'll return an error. (This seems
// extremely unlikely.)
let limit = 4 * u16::from(count);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, this means changing the use below at L289 to use a fallible conversion (either i64 to usize or the other way around). I wanted to avoid that.


// Now insert a TUF repo and try again. That alone shouldn't change
// anything.
let repo1id = insert_test_tuf_repo(opctx, datastore, 1).await;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've resolved the conflicts. I'd prefer to not commonize the helpers in this PR, mainly because of urgency. (Also, the helper I wrote is a little more streamlined -- it just returns the id. You're using the returned system_version in yours. I didn't want to work through the right middle ground and update all the callers today.)

let conn = self.pool_connection_authorized(opctx).await?;
let error = OptionalError::new();
self.transaction_retry_wrapper("tuf_repo_mark_pruned")
.transaction(&conn, |txn| {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to do that now because of urgency but also because this code path seems very far from any meaningful contention (or impact of contention). I would lump this with many other interactive transactions we have, although likely less severe than most. The closest existing issue is probably #973.

Comment on lines +229 to +230
repo_prune,
other_repos_eligible_to_prune,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct.

The true reason I did it this way is that:

  • The way this is structured right now, if you pruned more than one with the same TUF generation number and RecentTargetReleases, all subsequent ones would fail because the first one would have invalidated the assumptions that get checked (the pruning bumps the TUF generation).
  • Fixing this would require something like:
    • take a whole other lap immediately (re-fetch the generation and maybe recent target releases)
    • communicate to the caller somehow that the generation bumped in a predictable way (i.e., update their generation number for them)
    • providing an API to prune multiple in one go, which is appealing, but then you can provide an arbitrarily large set and it could also result in a table scan (I'm not positive if it's a problem if Cockroach is only doing it because it's fastest, not because it's the only way).

All of these seemed more trouble than they were worth, especially since in practice I think it would be extremely rare to be able to prune more than one. (I think that'd mean that you uploaded two TUF repos within 5 minutes and already had one that wasn't the target release; or you set the target release rapidly (which is practically impossible); etc.)

let datastore = cptestctx.server.server_context().nexus.datastore();
let opctx = OpContext::for_tests(logctx.log.new(o!()), datastore.clone());

// Wait for one activation of the task to avoid racing with it.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's totally guaranteed, but it seems extremely unlikely since the timer is 5 minutes and we will have already run it once.

@davepacheco davepacheco enabled auto-merge (squash) September 30, 2025 21:31
@davepacheco davepacheco merged commit 65b3308 into main Sep 30, 2025
17 checks passed
@davepacheco davepacheco deleted the dap/prune-repo-task branch September 30, 2025 22:01
@iliana iliana mentioned this pull request Oct 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants