Optimize re-sync content by jobselko · Pull Request #7754 · pulp/pulpcore

jobselko · 2026-05-29T16:56:15Z

📜 Checklist

Commits are cleanly separated with meaningful messages (simple features and bug fixes should be squashed to one commit)
A changelog entry or entries has been added for any significant changes
Follows the Pulp policy on AI Usage
(For new features) - User documentation and test coverage has been added

dralley · 2026-06-01T04:33:54Z

+            cache_hits_by_type = defaultdict(lambda: Q(pk__in=[]))
+
            for d_content in batch:
                if d_content.content._state.adding:


Theoretically content could be passed through already saved, in which case I think we're probably not touch ing it. That might be an existing bug, though not a particularly serious one.

I am not sure I understand this. Why do we need to call touch when content already exists? Is it because of orphan clean up?

Yes, it basically resets the orphan cleanup protection timer.

dralley · 2026-06-01T04:43:18Z

This is really an existing issue, but we're executing the content_q query (w/ natural keys) twice (once to touch, and once for the model swap), and also executing a touch query twice (once with cache hits, once with existing packages that were not in the latest version cache).

Is it possible to collect PKs within the loop, combine them with the cache hit PKs, and have one touch block below this loop?

Or (maybe question for @mdellweg), does that touch really need to happen before the swap for a timing-related reason?

Updated. I kept touch before swap for now

dralley · 2026-06-01T04:44:42Z

Looks good! A few things to look at, maybe we can make this even more efficient

jobselko · 2026-06-02T16:52:24Z

Measured benefit using https://fixtures.pulpproject.org/file-perf/ (20k files) with immediate policy sync:

Without patch:
Sync time: 508 seconds
Re-sync time: 569 seconds

With patch:
Sync time: 513 seconds
Re-sync time: 68 seconds

The biggest impact comes from the QueryExistingArtifacts fix.

Syncing with on-demand policy is already fast and is not significantly affected by this change (Sync time: 92 seconds, Re-sync time: 42 seconds).

jobselko · 2026-06-02T16:56:02Z

@dralley I will squash the commits before merging, but I am keeping them separate for easier review now.

dralley · 2026-06-03T01:52:02Z

@jobselko Keep the QueryExistingArtifacts commit separate and split it off into an independent PR. I think we might end up backporting that one, because it's very self-contained and has a high impact. Satellite might want it.

The rest can be squashed and stay in this PR.

Great work! This is a huge performance improvement 🚀 👍

dralley · 2026-06-03T03:28:04Z

+                        )
+
+            for model_type, results in db_results_by_type.items():
+                for result in results:


A little worried that the additional loops here would outweigh the reduced # of queries

I can rework it, but that would result in "swap before touch". Alternatively, I can revert the last commit and keep the two queries. Which option would you prefer?

@mdellweg I feel like it was either you or Dennis that dealt with the touch issues last, do you know if touch need to happen before the existing model swap for correctness reasons?

dralley · 2026-06-03T15:09:04Z

@jobselko Can you squash commits 1, 2, and 4, and address the merge conflict?

I think it's just the wip - reduce redundant queries that needs more attention before merge. We could make that a separate PR also, to get everything else merged today for the release?

jobselko · 2026-06-03T15:28:34Z

Discussed with @dralley that we are not going to merge this today.

Assisted By: Claude Opus 4.6

gerrod3 · 2026-06-04T14:31:36Z

-                    model_type.objects.filter(content_q).iterator()
+                    model_type.objects.filter(content_q).defer(*deferred).iterator()
                ):
+                    db_results_by_type[model_type].append(result)


Suggested change

db_results_by_type[model_type].append(result)

db_results_by_type[model_type].append(result)

for d_content in d_content_by_nat_key[result.natural_key()]:

d_content.content = result

Move this here and delete the added loops on line 107-108.

But this would cause "swap before touch" which we wanted to avoid.

We need to verify that it's OK to swap before touching. It MIGHT be ok, but IIRC the touch stuff was particularly subtle and the order it was done in might matter.

gerrod3 · 2026-06-04T14:31:54Z

+            all_types = set(cache_hits_by_type.keys()) | set(db_results_by_type.keys())
+            for model_type in all_types:
+                pks = cache_hits_by_type.get(model_type, set())
+                pks = pks | {r.pk for r in db_results_by_type.get(model_type, [])}


Suggested change

pks = pks | {r.pk for r in db_results_by_type.get(model_type, [])}

pks |= {r.pk for r in db_results_by_type.get(model_type, [])}

gerrod3 · 2026-06-04T14:38:48Z

+    `deferred_fields` - a mapping of content model class to a list of field names
+    to exclude from queries via Django's `defer()`.


Is deferred the right way? Models can have a lot of fields, but only a few are ever used during a sync. Wouldn't only be the more practical option?

See this discussion (hidden because it was marked resolved) #7754 (comment)

TL;DR yes .only() could be theoretically more efficient, but it could also easily cause a regression without plugin intervention. You have to keep track of which fields are being used rigorously.

OTOH .defer() would not cause a regression, and since it's really only a small number of fields on a small number of models that are a problem, it is less invasive.

jobselko self-assigned this May 29, 2026

github-actions Bot added no-changelog no-issue labels May 29, 2026

dralley reviewed Jun 1, 2026

View reviewed changes

Comment thread pulpcore/plugin/stages/content_stages.py

dralley reviewed Jun 1, 2026

View reviewed changes

Comment thread pulpcore/plugin/stages/content_stages.py Outdated

dralley reviewed Jun 1, 2026

View reviewed changes

github-actions Bot added the multi-commit label Jun 1, 2026

dralley mentioned this pull request Jun 1, 2026

Sync optimization: do existing content check in first stage pulp/pulp_rpm#4471

Merged

4 tasks

jobselko marked this pull request as ready for review June 2, 2026 16:54

jobselko force-pushed the optimize_replication branch 2 times, most recently from a05fba5 to c4c68af Compare June 2, 2026 17:20

dralley reviewed Jun 3, 2026

View reviewed changes

Comment thread pulpcore/plugin/stages/content_stages.py Outdated

dralley reviewed Jun 3, 2026

View reviewed changes

Comment thread pulpcore/plugin/stages/content_stages.py Outdated

jobselko force-pushed the optimize_replication branch from c4c68af to c54ccc1 Compare June 3, 2026 13:03

jobselko mentioned this pull request Jun 3, 2026

[PULP-1668] Optimize QueryExistingArtifacts queries #7760

Merged

4 tasks

jobselko force-pushed the optimize_replication branch from 3aa4e1a to 9f40b0d Compare June 3, 2026 15:34

jobselko added 4 commits June 4, 2026 13:42

Optimize re-sync content

1f5e3b3

Assisted By: Claude Opus 4.6

wip - fix review findings

57a04c8

wip - reduce redundant queries

22fa096

Rework and expose deferred_fields

37ce3a4

jobselko force-pushed the optimize_replication branch from 9f40b0d to 37ce3a4 Compare June 4, 2026 11:43

gerrod3 reviewed Jun 4, 2026

View reviewed changes

	pks = pks \| {r.pk for r in db_results_by_type.get(model_type, [])}
	pks \|= {r.pk for r in db_results_by_type.get(model_type, [])}

		`deferred_fields` - a mapping of content model class to a list of field names
		to exclude from queries via Django's `defer()`.

Conversation

jobselko commented May 29, 2026

📜 Checklist

Uh oh!

Uh oh!

Uh oh!

dralley Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dralley Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dralley commented Jun 1, 2026

Uh oh!

jobselko commented Jun 2, 2026

Uh oh!

jobselko commented Jun 2, 2026

Uh oh!

dralley commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dralley commented Jun 3, 2026

Uh oh!

jobselko commented Jun 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dralley Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dralley Jun 1, 2026 •

edited

Loading

dralley Jun 1, 2026 •

edited

Loading

dralley commented Jun 3, 2026 •

edited

Loading

dralley Jun 4, 2026 •

edited

Loading