-
Notifications
You must be signed in to change notification settings - Fork 58
add background task to prune TUF repos #9107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
@jgallagher points out that mupdate resolution already only works with the latest target release, which eliminates the problem I was worried about. |
Co-authored-by: David Pacheco <dap@oxidecomputer.com>
tuf_repo_pruner.nkeep_extra_target_releases = 1 | ||
tuf_repo_pruner.nkeep_extra_newly_uploaded = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the config we ultimately want to have? (I remember discussing setting these each to 3 or the like but I could be misremembering.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a harcdoded minimum defined in constants in the background task. These are in addition to that. I did it this way so that you couldn't configure it in a way that can't work, but maybe it's more confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that you can't configure an unsafe value; maybe worth commenting these lines that "extra" here means "in addition to some minimums that Nexus always keeps"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh these are in addition to. Yeah, a comment would be good. (While reading through I had somehow thought the maximum of the hardcoded minimum and the configured amount would be used.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment.
dev-tools/omdb/src/bin/omdb/nexus.rs
Outdated
match serde_json::from_value::<TufRepoPrunerStatus>(details.clone()) { | ||
Err(error) => eprintln!( | ||
"warning: failed to interpret task details: {:?}: {:?}", | ||
error, details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InlineErrorChain
for the error? Although I don't know that we could have nested causes here, so maybe doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. (This code is copied/pasted a bunch here and I did not change the others.)
// That's still small enough to do in one go. If we're wrong and can't | ||
// find enough distinct releases, we'll return an error. (This seems | ||
// extremely unlikely.) | ||
let limit = 4 * u16::from(count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nit - I think if we make this i64::from(count)
you wouldn't need to convert it again on line 163?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, this means changing the use below at L289 to use a fallible conversion (either i64 to usize or the other way around). I wanted to avoid that.
|
||
// Now insert a TUF repo and try again. That alone shouldn't change | ||
// anything. | ||
let repo1id = insert_test_tuf_repo(opctx, datastore, 1).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh you have some conflicts with main on this file; sorry. E.g., I added an insert_tuf_repo
helper that should probably be removed in favor of your insert_test_tuf_repo
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've resolved the conflicts. I'd prefer to not commonize the helpers in this PR, mainly because of urgency. (Also, the helper I wrote is a little more streamlined -- it just returns the id. You're using the returned system_version in yours. I didn't want to work through the right middle ground and update all the callers today.)
let conn = self.pool_connection_authorized(opctx).await?; | ||
let error = OptionalError::new(); | ||
self.transaction_retry_wrapper("tuf_repo_mark_pruned") | ||
.transaction(&conn, |txn| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think (?) this could be written as a CTE without too much pain. Is that worth doing and/or filing an issue for? I still don't have a good sense for weighing "we'd prefer fewer interactive transactions" against doing more work to avoid them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to do that now because of urgency but also because this code path seems very far from any meaningful contention (or impact of contention). I would lump this with many other interactive transactions we have, although likely less severe than most. The closest existing issue is probably #973.
repo_prune, | ||
other_repos_eligible_to_prune, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why prune only one instead of all of these? We'll prune the rest of other_repos...
the next N times we activate, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct.
The true reason I did it this way is that:
- The way this is structured right now, if you pruned more than one with the same TUF generation number and
RecentTargetReleases
, all subsequent ones would fail because the first one would have invalidated the assumptions that get checked (the pruning bumps the TUF generation). - Fixing this would require something like:
- take a whole other lap immediately (re-fetch the generation and maybe recent target releases)
- communicate to the caller somehow that the generation bumped in a predictable way (i.e., update their generation number for them)
- providing an API to prune multiple in one go, which is appealing, but then you can provide an arbitrarily large set and it could also result in a table scan (I'm not positive if it's a problem if Cockroach is only doing it because it's fastest, not because it's the only way).
All of these seemed more trouble than they were worth, especially since in practice I think it would be extremely rare to be able to prune more than one. (I think that'd mean that you uploaded two TUF repos within 5 minutes and already had one that wasn't the target release; or you set the target release rapidly (which is practically impossible); etc.)
let datastore = cptestctx.server.server_context().nexus.datastore(); | ||
let opctx = OpContext::for_tests(logctx.log.new(o!()), datastore.clone()); | ||
|
||
// Wait for one activation of the task to avoid racing with it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually avoid racing, or could the task immediately fire again (and then prune between 645 and 648) if the timer goes off in the meantime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's totally guaranteed, but it seems extremely unlikely since the timer is 5 minutes and we will have already run it once.
tuf_repo_pruner.nkeep_extra_target_releases = 1 | ||
tuf_repo_pruner.nkeep_extra_newly_uploaded = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that you can't configure an unsafe value; maybe worth commenting these lines that "extra" here means "in addition to some minimums that Nexus always keeps"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments!
tuf_repo_pruner.nkeep_extra_target_releases = 1 | ||
tuf_repo_pruner.nkeep_extra_newly_uploaded = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment.
dev-tools/omdb/src/bin/omdb/nexus.rs
Outdated
match serde_json::from_value::<TufRepoPrunerStatus>(details.clone()) { | ||
Err(error) => eprintln!( | ||
"warning: failed to interpret task details: {:?}: {:?}", | ||
error, details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. (This code is copied/pasted a bunch here and I did not change the others.)
// That's still small enough to do in one go. If we're wrong and can't | ||
// find enough distinct releases, we'll return an error. (This seems | ||
// extremely unlikely.) | ||
let limit = 4 * u16::from(count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, this means changing the use below at L289 to use a fallible conversion (either i64 to usize or the other way around). I wanted to avoid that.
|
||
// Now insert a TUF repo and try again. That alone shouldn't change | ||
// anything. | ||
let repo1id = insert_test_tuf_repo(opctx, datastore, 1).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've resolved the conflicts. I'd prefer to not commonize the helpers in this PR, mainly because of urgency. (Also, the helper I wrote is a little more streamlined -- it just returns the id. You're using the returned system_version in yours. I didn't want to work through the right middle ground and update all the callers today.)
let conn = self.pool_connection_authorized(opctx).await?; | ||
let error = OptionalError::new(); | ||
self.transaction_retry_wrapper("tuf_repo_mark_pruned") | ||
.transaction(&conn, |txn| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to do that now because of urgency but also because this code path seems very far from any meaningful contention (or impact of contention). I would lump this with many other interactive transactions we have, although likely less severe than most. The closest existing issue is probably #973.
repo_prune, | ||
other_repos_eligible_to_prune, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct.
The true reason I did it this way is that:
- The way this is structured right now, if you pruned more than one with the same TUF generation number and
RecentTargetReleases
, all subsequent ones would fail because the first one would have invalidated the assumptions that get checked (the pruning bumps the TUF generation). - Fixing this would require something like:
- take a whole other lap immediately (re-fetch the generation and maybe recent target releases)
- communicate to the caller somehow that the generation bumped in a predictable way (i.e., update their generation number for them)
- providing an API to prune multiple in one go, which is appealing, but then you can provide an arbitrarily large set and it could also result in a table scan (I'm not positive if it's a problem if Cockroach is only doing it because it's fastest, not because it's the only way).
All of these seemed more trouble than they were worth, especially since in practice I think it would be extremely rare to be able to prune more than one. (I think that'd mean that you uploaded two TUF repos within 5 minutes and already had one that wasn't the target release; or you set the target release rapidly (which is practically impossible); etc.)
let datastore = cptestctx.server.server_context().nexus.datastore(); | ||
let opctx = OpContext::for_tests(logctx.log.new(o!()), datastore.clone()); | ||
|
||
// Wait for one activation of the task to avoid racing with it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's totally guaranteed, but it seems extremely unlikely since the timer is 5 minutes and we will have already run it once.
This new background task just sets
time_pruned
on TUF repos according to a simple policy. With #9109, this will trigger these repos' artifacts to be deleted.