Skip to content

Optimize re-sync content#7754

Open
jobselko wants to merge 4 commits into
pulp:mainfrom
jobselko:optimize_replication
Open

Optimize re-sync content#7754
jobselko wants to merge 4 commits into
pulp:mainfrom
jobselko:optimize_replication

Conversation

@jobselko
Copy link
Copy Markdown
Member

📜 Checklist

  • Commits are cleanly separated with meaningful messages (simple features and bug fixes should be squashed to one commit)
  • A changelog entry or entries has been added for any significant changes
  • Follows the Pulp policy on AI Usage
  • (For new features) - User documentation and test coverage has been added

See: Pull Request Walkthrough

Comment thread pulpcore/plugin/stages/content_stages.py
Comment thread pulpcore/plugin/stages/content_stages.py Outdated
cache_hits_by_type = defaultdict(lambda: Q(pk__in=[]))

for d_content in batch:
if d_content.content._state.adding:
Copy link
Copy Markdown
Contributor

@dralley dralley Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically content could be passed through already saved, in which case I think we're probably not touch ing it. That might be an existing bug, though not a particularly serious one.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand this. Why do we need to call touch when content already exists? Is it because of orphan clean up?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it basically resets the orphan cleanup protection timer.

Copy link
Copy Markdown
Contributor

@dralley dralley Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really an existing issue, but we're executing the content_q query (w/ natural keys) twice (once to touch, and once for the model swap), and also executing a touch query twice (once with cache hits, once with existing packages that were not in the latest version cache).

Is it possible to collect PKs within the loop, combine them with the cache hit PKs, and have one touch block below this loop?

Or (maybe question for @mdellweg), does that touch really need to happen before the swap for a timing-related reason?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. I kept touch before swap for now

@dralley
Copy link
Copy Markdown
Contributor

dralley commented Jun 1, 2026

Looks good! A few things to look at, maybe we can make this even more efficient

@jobselko
Copy link
Copy Markdown
Member Author

jobselko commented Jun 2, 2026

Measured benefit using https://fixtures.pulpproject.org/file-perf/ (20k files) with immediate policy sync:

Without patch:
Sync time: 508 seconds
Re-sync time: 569 seconds

With patch:
Sync time: 513 seconds
Re-sync time: 68 seconds

The biggest impact comes from the QueryExistingArtifacts fix.

Syncing with on-demand policy is already fast and is not significantly affected by this change (Sync time: 92 seconds, Re-sync time: 42 seconds).

@jobselko jobselko marked this pull request as ready for review June 2, 2026 16:54
@jobselko
Copy link
Copy Markdown
Member Author

jobselko commented Jun 2, 2026

@dralley I will squash the commits before merging, but I am keeping them separate for easier review now.

@jobselko jobselko force-pushed the optimize_replication branch 2 times, most recently from a05fba5 to c4c68af Compare June 2, 2026 17:20
@dralley
Copy link
Copy Markdown
Contributor

dralley commented Jun 3, 2026

@jobselko Keep the QueryExistingArtifacts commit separate and split it off into an independent PR. I think we might end up backporting that one, because it's very self-contained and has a high impact. Satellite might want it.

The rest can be squashed and stay in this PR.

Great work! This is a huge performance improvement 🚀 👍

)

for model_type, results in db_results_by_type.items():
for result in results:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little worried that the additional loops here would outweigh the reduced # of queries

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rework it, but that would result in "swap before touch". Alternatively, I can revert the last commit and keep the two queries. Which option would you prefer?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdellweg I feel like it was either you or Dennis that dealt with the touch issues last, do you know if touch need to happen before the existing model swap for correctness reasons?

Comment thread pulpcore/plugin/stages/content_stages.py Outdated
Comment thread pulpcore/plugin/stages/content_stages.py Outdated
@dralley
Copy link
Copy Markdown
Contributor

dralley commented Jun 3, 2026

@jobselko Can you squash commits 1, 2, and 4, and address the merge conflict?

I think it's just the wip - reduce redundant queries that needs more attention before merge. We could make that a separate PR also, to get everything else merged today for the release?

@jobselko
Copy link
Copy Markdown
Member Author

jobselko commented Jun 3, 2026

Discussed with @dralley that we are not going to merge this today.

@jobselko jobselko force-pushed the optimize_replication branch from 3aa4e1a to 9f40b0d Compare June 3, 2026 15:34
@jobselko jobselko force-pushed the optimize_replication branch from 9f40b0d to 37ce3a4 Compare June 4, 2026 11:43
model_type.objects.filter(content_q).iterator()
model_type.objects.filter(content_q).defer(*deferred).iterator()
):
db_results_by_type[model_type].append(result)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
db_results_by_type[model_type].append(result)
db_results_by_type[model_type].append(result)
for d_content in d_content_by_nat_key[result.natural_key()]:
d_content.content = result

Move this here and delete the added loops on line 107-108.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this would cause "swap before touch" which we wanted to avoid.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to verify that it's OK to swap before touching. It MIGHT be ok, but IIRC the touch stuff was particularly subtle and the order it was done in might matter.

all_types = set(cache_hits_by_type.keys()) | set(db_results_by_type.keys())
for model_type in all_types:
pks = cache_hits_by_type.get(model_type, set())
pks = pks | {r.pk for r in db_results_by_type.get(model_type, [])}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pks = pks | {r.pk for r in db_results_by_type.get(model_type, [])}
pks |= {r.pk for r in db_results_by_type.get(model_type, [])}

Comment on lines +39 to +40
`deferred_fields` - a mapping of content model class to a list of field names
to exclude from queries via Django's `defer()`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is deferred the right way? Models can have a lot of fields, but only a few are ever used during a sync. Wouldn't only be the more practical option?

Copy link
Copy Markdown
Contributor

@dralley dralley Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this discussion (hidden because it was marked resolved) #7754 (comment)

TL;DR yes .only() could be theoretically more efficient, but it could also easily cause a regression without plugin intervention. You have to keep track of which fields are being used rigorously.

OTOH .defer() would not cause a regression, and since it's really only a small number of fields on a small number of models that are a problem, it is less invasive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants