Skip to content

Test multithreaded file hashing#6693

Merged
bmbouter merged 1 commit into
pulp:mainfrom
dralley:thread-file-hashing
Jun 24, 2025
Merged

Test multithreaded file hashing#6693
bmbouter merged 1 commit into
pulp:mainfrom
dralley:thread-file-hashing

Conversation

@dralley
Copy link
Copy Markdown
Contributor

@dralley dralley commented Jun 17, 2025

Note that since the hashing is done by C code inside of hashlib which releases the GIL, threading actually does work here despite the normal GIL rules.

@dkliban
Copy link
Copy Markdown
Member

dkliban commented Jun 19, 2025

I used the pyinstrument task profiler when uploading a 145MB RPM. Before this patch was applied the task took 1.17s. After the patch the run time was 0.99s. I am attaching an archive that contains the output of pyinstrument.
pyinstrument.zip

bmbouter
bmbouter previously approved these changes Jun 19, 2025
Copy link
Copy Markdown
Member

@bmbouter bmbouter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 on merging this.

@dralley dralley marked this pull request as ready for review June 19, 2025 15:47
@dralley dralley marked this pull request as draft June 19, 2025 15:52
Comment thread pulpcore/app/files.py
with ThreadPoolExecutor(max_workers=6) as executor:
while data := file.read(1048576):
for hasher in models.Artifact.DIGEST_FIELDS:
executor.submit(instance.hashers[hasher].update, data)
Copy link
Copy Markdown
Contributor Author

@dralley dralley Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically speaking I suppose it's possible to build up data in memory if we were submitting tasks faster than they could be completed, as the refcounts wouldn't go to zero on the data immediately. I'm not sure that's too much of a problem in practice? But I'm calling it out.

Copy link
Copy Markdown
Contributor Author

@dralley dralley Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could do here what I've done below, and add a concurrent.futures.wait() to wrap everything up before the loop resets. That trades off with throughput though probably but it's not worse than what we're doing currently so 🤷

Since the hashing is done by C code inside of hashlib which
releases the GIL, threading actually does work here despite the normal
GIL rules.

This adds multithreading to uploads and downloads of files.
Comment thread pulpcore/download/base.py
values are header content. None when not using the HttpDownloader or sublclass.
"""

THREADPOOL = ThreadPoolExecutor()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared between all downloaders. I figured that was better than trying to set up a separate threadpool for every downloader.

@dralley
Copy link
Copy Markdown
Contributor Author

dralley commented Jun 19, 2025

Added: multithreading in downloader digest calculation. Which should improve download performance and in particular on-demand downloads (since that happens on the fly), not just uploads.

@dralley dralley requested a review from bmbouter June 19, 2025 16:58
@dralley
Copy link
Copy Markdown
Contributor Author

dralley commented Jun 19, 2025

Another 25% of the runtime could probably be shaved off by

Just writing that for posterity. May or may not be worth it

@dralley dralley marked this pull request as ready for review June 20, 2025 16:26
Copy link
Copy Markdown
Member

@bmbouter bmbouter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looking good!

@bmbouter bmbouter merged commit 4fee569 into pulp:main Jun 24, 2025
13 checks passed
@dralley dralley deleted the thread-file-hashing branch June 24, 2025 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants