add background task to prune TUF repos #9107

davepacheco · 2025-09-26T23:01:10Z

This new background task just sets time_pruned on TUF repos according to a simple policy. With #9109, this will trigger these repos' artifacts to be deleted.

…e-repo-task

davepacheco · 2025-09-27T21:26:32Z

@jgallagher points out that mupdate resolution already only works with the latest target release, which eliminates the problem I was worried about.

…ving changed

nexus/db-queries/src/db/datastore/update.rs

Co-authored-by: David Pacheco <dap@oxidecomputer.com>

iliana · 2025-09-30T05:47:06Z

smf/nexus/multi-sled/config-partial.toml

+tuf_repo_pruner.nkeep_extra_target_releases = 1
+tuf_repo_pruner.nkeep_extra_newly_uploaded = 1


Is this the config we ultimately want to have? (I remember discussing setting these each to 3 or the like but I could be misremembering.)

There's a harcdoded minimum defined in constants in the background task. These are in addition to that. I did it this way so that you couldn't configure it in a way that can't work, but maybe it's more confusing?

I like that you can't configure an unsafe value; maybe worth commenting these lines that "extra" here means "in addition to some minimums that Nexus always keeps"?

Oh these are in addition to. Yeah, a comment would be good. (While reading through I had somehow thought the maximum of the hardcoded minimum and the configured amount would be used.)

Added a comment.

jgallagher · 2025-09-30T15:34:09Z

dev-tools/omdb/src/bin/omdb/nexus.rs

+    match serde_json::from_value::<TufRepoPrunerStatus>(details.clone()) {
+        Err(error) => eprintln!(
+            "warning: failed to interpret task details: {:?}: {:?}",
+            error, details


InlineErrorChain for the error? Although I don't know that we could have nested causes here, so maybe doesn't matter.

Added. (This code is copied/pasted a bunch here and I did not change the others.)

jgallagher · 2025-09-30T15:35:22Z

nexus/db-queries/src/db/datastore/target_release.rs

+        // That's still small enough to do in one go.  If we're wrong and can't
+        // find enough distinct releases, we'll return an error.  (This seems
+        // extremely unlikely.)
+        let limit = 4 * u16::from(count);


Tiny nit - I think if we make this i64::from(count) you wouldn't need to convert it again on line 163?

As I understand it, this means changing the use below at L289 to use a fallible conversion (either i64 to usize or the other way around). I wanted to avoid that.

jgallagher · 2025-09-30T15:37:28Z

nexus/db-queries/src/db/datastore/target_release.rs

+
+        // Now insert a TUF repo and try again.  That alone shouldn't change
+        // anything.
+        let repo1id = insert_test_tuf_repo(opctx, datastore, 1).await;


Ahh you have some conflicts with main on this file; sorry. E.g., I added an insert_tuf_repo helper that should probably be removed in favor of your insert_test_tuf_repo?

I've resolved the conflicts. I'd prefer to not commonize the helpers in this PR, mainly because of urgency. (Also, the helper I wrote is a little more streamlined -- it just returns the id. You're using the returned system_version in yours. I didn't want to work through the right middle ground and update all the callers today.)

jgallagher · 2025-09-30T15:43:08Z

nexus/db-queries/src/db/datastore/update.rs

+        let conn = self.pool_connection_authorized(opctx).await?;
+        let error = OptionalError::new();
+        self.transaction_retry_wrapper("tuf_repo_mark_pruned")
+            .transaction(&conn, |txn| {


I think (?) this could be written as a CTE without too much pain. Is that worth doing and/or filing an issue for? I still don't have a good sense for weighing "we'd prefer fewer interactive transactions" against doing more work to avoid them.

I don't want to do that now because of urgency but also because this code path seems very far from any meaningful contention (or impact of contention). I would lump this with many other interactive transactions we have, although likely less severe than most. The closest existing issue is probably #973.

nexus/db-queries/src/db/datastore/update.rs

nexus/src/app/background/tasks/tuf_repo_pruner.rs

jgallagher · 2025-09-30T15:55:54Z

nexus/src/app/background/tasks/tuf_repo_pruner.rs

+        repo_prune,
+        other_repos_eligible_to_prune,


Why prune only one instead of all of these? We'll prune the rest of other_repos... the next N times we activate, right?

That's correct.

The true reason I did it this way is that:

The way this is structured right now, if you pruned more than one with the same TUF generation number and RecentTargetReleases, all subsequent ones would fail because the first one would have invalidated the assumptions that get checked (the pruning bumps the TUF generation).

Fixing this would require something like:

take a whole other lap immediately (re-fetch the generation and maybe recent target releases)

communicate to the caller somehow that the generation bumped in a predictable way (i.e., update their generation number for them)

providing an API to prune multiple in one go, which is appealing, but then you can provide an arbitrarily large set and it could also result in a table scan (I'm not positive if it's a problem if Cockroach is only doing it because it's fastest, not because it's the only way).

All of these seemed more trouble than they were worth, especially since in practice I think it would be extremely rare to be able to prune more than one. (I think that'd mean that you uploaded two TUF repos within 5 minutes and already had one that wasn't the target release; or you set the target release rapidly (which is practically impossible); etc.)

nexus/src/app/background/tasks/tuf_repo_pruner.rs

jgallagher · 2025-09-30T16:00:20Z

nexus/tests/integration_tests/updates.rs

+    let datastore = cptestctx.server.server_context().nexus.datastore();
+    let opctx = OpContext::for_tests(logctx.log.new(o!()), datastore.clone());
+
+    // Wait for one activation of the task to avoid racing with it.


Does this actually avoid racing, or could the task immediately fire again (and then prune between 645 and 648) if the timer goes off in the meantime?

I don't think it's totally guaranteed, but it seems extremely unlikely since the timer is 5 minutes and we will have already run it once.

jgallagher · 2025-09-30T16:01:48Z

smf/nexus/multi-sled/config-partial.toml

+tuf_repo_pruner.nkeep_extra_target_releases = 1
+tuf_repo_pruner.nkeep_extra_newly_uploaded = 1


I like that you can't configure an unsafe value; maybe worth commenting these lines that "extra" here means "in addition to some minimums that Nexus always keeps"?

davepacheco

Thanks for the comments!

davepacheco · 2025-09-30T20:05:25Z

smf/nexus/multi-sled/config-partial.toml

+tuf_repo_pruner.nkeep_extra_target_releases = 1
+tuf_repo_pruner.nkeep_extra_newly_uploaded = 1


Added a comment.

davepacheco · 2025-09-30T20:06:10Z

dev-tools/omdb/src/bin/omdb/nexus.rs

+    match serde_json::from_value::<TufRepoPrunerStatus>(details.clone()) {
+        Err(error) => eprintln!(
+            "warning: failed to interpret task details: {:?}: {:?}",
+            error, details


Added. (This code is copied/pasted a bunch here and I did not change the others.)

davepacheco · 2025-09-30T20:07:53Z

nexus/db-queries/src/db/datastore/target_release.rs

+        // That's still small enough to do in one go.  If we're wrong and can't
+        // find enough distinct releases, we'll return an error.  (This seems
+        // extremely unlikely.)
+        let limit = 4 * u16::from(count);


As I understand it, this means changing the use below at L289 to use a fallible conversion (either i64 to usize or the other way around). I wanted to avoid that.

davepacheco · 2025-09-30T20:09:07Z

nexus/db-queries/src/db/datastore/target_release.rs

+
+        // Now insert a TUF repo and try again.  That alone shouldn't change
+        // anything.
+        let repo1id = insert_test_tuf_repo(opctx, datastore, 1).await;


I've resolved the conflicts. I'd prefer to not commonize the helpers in this PR, mainly because of urgency. (Also, the helper I wrote is a little more streamlined -- it just returns the id. You're using the returned system_version in yours. I didn't want to work through the right middle ground and update all the callers today.)

davepacheco · 2025-09-30T20:13:01Z

nexus/db-queries/src/db/datastore/update.rs

+        let conn = self.pool_connection_authorized(opctx).await?;
+        let error = OptionalError::new();
+        self.transaction_retry_wrapper("tuf_repo_mark_pruned")
+            .transaction(&conn, |txn| {


I don't want to do that now because of urgency but also because this code path seems very far from any meaningful contention (or impact of contention). I would lump this with many other interactive transactions we have, although likely less severe than most. The closest existing issue is probably #973.

nexus/db-queries/src/db/datastore/update.rs

davepacheco · 2025-09-30T20:19:56Z

nexus/src/app/background/tasks/tuf_repo_pruner.rs

+        repo_prune,
+        other_repos_eligible_to_prune,


That's correct.

The true reason I did it this way is that:

The way this is structured right now, if you pruned more than one with the same TUF generation number and RecentTargetReleases, all subsequent ones would fail because the first one would have invalidated the assumptions that get checked (the pruning bumps the TUF generation).

Fixing this would require something like:

take a whole other lap immediately (re-fetch the generation and maybe recent target releases)

communicate to the caller somehow that the generation bumped in a predictable way (i.e., update their generation number for them)

providing an API to prune multiple in one go, which is appealing, but then you can provide an arbitrarily large set and it could also result in a table scan (I'm not positive if it's a problem if Cockroach is only doing it because it's fastest, not because it's the only way).

All of these seemed more trouble than they were worth, especially since in practice I think it would be extremely rare to be able to prune more than one. (I think that'd mean that you uploaded two TUF repos within 5 minutes and already had one that wasn't the target release; or you set the target release rapidly (which is practically impossible); etc.)

nexus/src/app/background/tasks/tuf_repo_pruner.rs

davepacheco · 2025-09-30T20:21:00Z

nexus/tests/integration_tests/updates.rs

+    let datastore = cptestctx.server.server_context().nexus.datastore();
+    let opctx = OpContext::for_tests(logctx.log.new(o!()), datastore.clone());
+
+    // Wait for one activation of the task to avoid racing with it.


I don't think it's totally guaranteed, but it seems extremely unlikely since the timer is 5 minutes and we will have already run it once.

iliana and others added 3 commits September 26, 2025 12:51

add time_pruned to the tuf_repo table

4a99d91

WIP: boilerplate for adding a new background task

fb842cf

WIP: flesh out tuf_repo_pruner background task

f660f72

davepacheco requested review from iliana and jgallagher September 26, 2025 23:01

fix clippy, rustfmt

693d70d

Base automatically changed from iliana/tuf-prune-field to main September 27, 2025 05:19

iliana mentioned this pull request Sep 27, 2025

skip pruned TUF repos when creating artifact config #9109

Merged

davepacheco added 2 commits September 27, 2025 10:50

Merge commit '7c89519be9d22cc3967bda2272aef2376c4631c2' into dap/prun…

d74f55e

…e-repo-task

Merge branch 'main' into dap/prune-repo-task

2be6921

This comment was marked as outdated.

Sign in to view

davepacheco closed this Sep 27, 2025

davepacheco reopened this Sep 27, 2025

take most recently uploaded, not least

6a35140

This was referenced Sep 29, 2025

must not be able to set target release to a pruned release #9114

Closed

store previous release in target_release to avoid big scans #9115

Open

davepacheco added 7 commits September 29, 2025 09:45

WIP test

0c14313

marking repos pruned must be conditional on generation numbers not ha…

bf87209

…ving changed

nit / cleanup

9452cb9

clean up these APIs

cf1e259

pull in new index from 9109

393a215

add test for listing TUF repos

aefd1dd

fix lint

b6c5ef2

smklein reviewed Sep 29, 2025

View reviewed changes

davepacheco added 4 commits September 29, 2025 15:30

fix lint, add tests

e1f57e9

use retryable stuff properly

5b3edcf

fix up task tests

b5d657f

fix the bug again

d381e0a

iliana added a commit that referenced this pull request Sep 30, 2025

pull in datastore changes from #9107

9502b4d

Co-authored-by: David Pacheco <dap@oxidecomputer.com>

davepacheco added 2 commits September 29, 2025 20:23

add integration test for repo pruning

fab9379

add logging, omdb support

0a77507

davepacheco marked this pull request as ready for review September 30, 2025 04:19

davepacheco added this to the 17 milestone Sep 30, 2025

davepacheco self-assigned this Sep 30, 2025

iliana reviewed Sep 30, 2025

View reviewed changes

Merge branch 'main' into dap/prune-repo-task

1a02869

jgallagher reviewed Sep 30, 2025

View reviewed changes

review feedback

f9ea559

davepacheco commented Sep 30, 2025

View reviewed changes

jgallagher approved these changes Sep 30, 2025

View reviewed changes

davepacheco enabled auto-merge (squash) September 30, 2025 21:31

davepacheco merged commit 65b3308 into main Sep 30, 2025
17 checks passed

davepacheco deleted the dap/prune-repo-task branch September 30, 2025 22:01

iliana mentioned this pull request Oct 2, 2025

TUF repository deletion #7135

Closed

		tuf_repo_pruner.nkeep_extra_target_releases = 1
		tuf_repo_pruner.nkeep_extra_newly_uploaded = 1

add background task to prune TUF repos #9107

add background task to prune TUF repos #9107

Uh oh!

Conversation

davepacheco commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

davepacheco commented Sep 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davepacheco left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

davepacheco commented Sep 26, 2025 •

edited

Loading