Skip to content

Mitigate a disk consumption issue during sync#7113

Merged
ggainey merged 1 commit intopulp:mainfrom
dralley:artifact-disk-reqs
Nov 21, 2025
Merged

Mitigate a disk consumption issue during sync#7113
ggainey merged 1 commit intopulp:mainfrom
dralley:artifact-disk-reqs

Conversation

@dralley
Copy link
Copy Markdown
Contributor

@dralley dralley commented Nov 19, 2025

The ArtifactSaver stage is acting as a bottleneck due to batching, and as a result artifacts downloaded by the ArtifactDownloader stage aren't being flushed out quickly enough. Using a default batch size of 500 is too much for some stages.

closes #7064

Comment thread pulpcore/plugin/stages/api.py Outdated
yield content

async def batches(self, minsize=500):
async def batches(self, minsize=settings.MAX_CONCURRENT_CONTENT):
Copy link
Copy Markdown
Contributor Author

@dralley dralley Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this probably makes sense, but I have some additional thoughts... I wonder if it's not sufficient.

  • Maybe ArtifactSaver stage should use an independently configurable batch size, such that it can be made lower without compromising the performance of the rest of the pipeline?
  • Does it actually make sense to use settings.MAX_CONCURRENT_CONTENT here? The description of it (see Allow batch size of artifacts processed during sync to be configured #7037) makes it sound very much like that would be the case. But currently that value is only used within the ArtifactDownloader stage and maybe the meaning is distinct enough to warrant being independent anyway

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So GenericDownloader's default is 200 and this iterator's is 500 - but I suspect that both of those numbers are in the category of "well this seems to be a reasonable number". I don't think the possible gain in fine-grained-optimization is worth the complexity of having them be separate. FWIW, I'd go with this implementation, unless/until there's some serious proof it's a bad idea.

Copy link
Copy Markdown
Contributor Author

@dralley dralley Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main question on my side is, we might want the value to be significantly below 200 on the "how many artifacts do we want to save at a time (how many should be batched up)". Maybe we do, and maybe there's a decent reason to keep the other value around 200 or at least a good reason not to drop it as low as we might want the batching before the ArtifactSaver stage.

But otherwise I agree with you, I'd love for it to be this simple.

The ArtifactSaver stage is acting as a bottleneck due to batching, and
as a result artifacts downloaded by the ArtifactDownloader stage aren't
being flushed out quickly enough. Using a default batch size of 500
is too much for some stages.

closes pulp#7064
@dralley dralley marked this pull request as ready for review November 20, 2025 21:07
@dralley
Copy link
Copy Markdown
Contributor Author

dralley commented Nov 20, 2025

I tested making the ArtifactSaver stage completely unbatched, it reduced sync performance by about 10%. That's a bit high, so I went with reducing the size to MAX_CONTENT_CONCURRENCY, and then reduced that value as well. I think it makes sense to not be batching more than the ArtifactDownloader stage will even allow to be downloaded at once.

@balasankarc Was there a particular reason for choosing the number 200? Is there any good reason for keeping these independent (e.g. reducing the batching even more while keeping the concurrency higher), or otherwise to not reduce MAX_CONTENT_CONCURRENCY default to 25?

@ggainey
Copy link
Copy Markdown
Contributor

ggainey commented Nov 20, 2025

@balasankarc Was there a particular reason for choosing the number 200? Is there any good reason for keeping these independent (e.g. reducing the batching even more while keeping the concurrency higher), or otherwise to not reduce MAX_CONTENT_CONCURRENCY default to 25?

Pretty sure it was 200 because that was the original hard-coded value : https://github.com/pulp/pulpcore/pull/7037/files#diff-28560cb4dba596314e3577becbd1974a015d47c08f8f17b38e6030a701999169L142 - and it's been that way basically Forever.

@ggainey
Copy link
Copy Markdown
Contributor

ggainey commented Nov 20, 2025

I agree with having just the one tuning parameter. 25 is probably kinder to our upstream-remotes than 200 was as well. But if downloading defaults to 25-concurrent where it used to be 200 concurrent, will that be noticeable to users that aren't disk/memory constrained? Or at least, "noticeable enough" to leave the default at 200 and let the users who actually are tight on disk/memory change it?

@dralley
Copy link
Copy Markdown
Contributor Author

dralley commented Nov 21, 2025

@ggainey Remotes already had tighter default restrictions on the concurrent connections / downloads than this limit, so I doubt it would make any real difference. The batch size is only place where there's any practical amount of additional overhead.

@ggainey ggainey merged commit d5136c3 into pulp:main Nov 21, 2025
13 checks passed
@patchback
Copy link
Copy Markdown

patchback bot commented Nov 21, 2025

Backport to 3.93: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.93/d5136c36ecd47c99fccff1cd41ac02438e58366c/pr-7113

Backported as #7115

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[PULP-980] Syncing of repositories with big packages can cause disk storage to be exhausted

2 participants