Optimize re-sync content#7754
Conversation
| cache_hits_by_type = defaultdict(lambda: Q(pk__in=[])) | ||
|
|
||
| for d_content in batch: | ||
| if d_content.content._state.adding: |
There was a problem hiding this comment.
Theoretically content could be passed through already saved, in which case I think we're probably not touch ing it. That might be an existing bug, though not a particularly serious one.
There was a problem hiding this comment.
I am not sure I understand this. Why do we need to call touch when content already exists? Is it because of orphan clean up?
There was a problem hiding this comment.
Yes, it basically resets the orphan cleanup protection timer.
There was a problem hiding this comment.
This is really an existing issue, but we're executing the content_q query (w/ natural keys) twice (once to touch, and once for the model swap), and also executing a touch query twice (once with cache hits, once with existing packages that were not in the latest version cache).
Is it possible to collect PKs within the loop, combine them with the cache hit PKs, and have one touch block below this loop?
Or (maybe question for @mdellweg), does that touch really need to happen before the swap for a timing-related reason?
There was a problem hiding this comment.
Updated. I kept touch before swap for now
|
Looks good! A few things to look at, maybe we can make this even more efficient |
|
Measured benefit using https://fixtures.pulpproject.org/file-perf/ (20k files) with immediate policy sync: Without patch: With patch: The biggest impact comes from the Syncing with on-demand policy is already fast and is not significantly affected by this change (Sync time: 92 seconds, Re-sync time: 42 seconds). |
|
@dralley I will squash the commits before merging, but I am keeping them separate for easier review now. |
a05fba5 to
c4c68af
Compare
|
@jobselko Keep the The rest can be squashed and stay in this PR. Great work! This is a huge performance improvement 🚀 👍 |
| ) | ||
|
|
||
| for model_type, results in db_results_by_type.items(): | ||
| for result in results: |
There was a problem hiding this comment.
A little worried that the additional loops here would outweigh the reduced # of queries
There was a problem hiding this comment.
I can rework it, but that would result in "swap before touch". Alternatively, I can revert the last commit and keep the two queries. Which option would you prefer?
There was a problem hiding this comment.
@mdellweg I feel like it was either you or Dennis that dealt with the touch issues last, do you know if touch need to happen before the existing model swap for correctness reasons?
c4c68af to
c54ccc1
Compare
|
@jobselko Can you squash commits 1, 2, and 4, and address the merge conflict? I think it's just the |
|
Discussed with @dralley that we are not going to merge this today. |
3aa4e1a to
9f40b0d
Compare
Assisted By: Claude Opus 4.6
9f40b0d to
37ce3a4
Compare
| model_type.objects.filter(content_q).iterator() | ||
| model_type.objects.filter(content_q).defer(*deferred).iterator() | ||
| ): | ||
| db_results_by_type[model_type].append(result) |
There was a problem hiding this comment.
| db_results_by_type[model_type].append(result) | |
| db_results_by_type[model_type].append(result) | |
| for d_content in d_content_by_nat_key[result.natural_key()]: | |
| d_content.content = result |
Move this here and delete the added loops on line 107-108.
There was a problem hiding this comment.
But this would cause "swap before touch" which we wanted to avoid.
There was a problem hiding this comment.
We need to verify that it's OK to swap before touching. It MIGHT be ok, but IIRC the touch stuff was particularly subtle and the order it was done in might matter.
| all_types = set(cache_hits_by_type.keys()) | set(db_results_by_type.keys()) | ||
| for model_type in all_types: | ||
| pks = cache_hits_by_type.get(model_type, set()) | ||
| pks = pks | {r.pk for r in db_results_by_type.get(model_type, [])} |
There was a problem hiding this comment.
| pks = pks | {r.pk for r in db_results_by_type.get(model_type, [])} | |
| pks |= {r.pk for r in db_results_by_type.get(model_type, [])} |
| `deferred_fields` - a mapping of content model class to a list of field names | ||
| to exclude from queries via Django's `defer()`. |
There was a problem hiding this comment.
Is deferred the right way? Models can have a lot of fields, but only a few are ever used during a sync. Wouldn't only be the more practical option?
There was a problem hiding this comment.
See this discussion (hidden because it was marked resolved) #7754 (comment)
TL;DR yes .only() could be theoretically more efficient, but it could also easily cause a regression without plugin intervention. You have to keep track of which fields are being used rigorously.
OTOH .defer() would not cause a regression, and since it's really only a small number of fields on a small number of models that are a problem, it is less invasive.
📜 Checklist
See: Pull Request Walkthrough