New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The export process spends a significant amount of time on hashing #4447
Comments
@mdellweg @ggainey does any of you know why in case of chunked export we still provide global checksum https://github.com/dralley/pulpcore/blob/main/pulpcore/app/tasks/export.py#L502? As @dralley points out apparently it is expensive to perform check-summing. |
It appears that we reconstruct the whole file using its constituent parts here So the chunk checksums serve to provide an early warning that something is wrong, and the total file checksum serves to ensure that the reconstruction occurred correctly. We also delete chunks after adding them to the reconstructed file, to avoid requiring twice as much disk space, so you can't "go back" easily. We can at least parallelize the checksumming of the chunks easily so I'm going to do that. But I'm not certain we can fully get rid of the global checksum without replacing it somehow. Since this is not security sensitive we could use a faster checksum such as crc32 or crc64 though, especially if we keep using sha256 on chunks. |
Is there any way, we can concatenate the chunks on the fly without changing anything on disk (think import from ro filesystem)? Like running cat and piping the result into django-import? |
I don't think so, I don't expect there's any way to work with multiple files as if they were one in that way. Also we don't necessarily stream in one direction, it's treated as a tarfile, which has a header, and we work with the tarfile such that I believe it goes back to reference the header. |
It appears that we reconstruct the whole file using its constituent parts here So the chunk checksums serve to provide an early warning that something is wrong, and the total file checksum serves to ensure that the reconstruction occurred correctly. We also delete chunks after adding them to the reconstructed file, to avoid requiring twice as much disk space, so you can't "go back" easily. We can at least parallelize the checksumming of the chunks easily so I'm going to do that. But I'm not certain we can fully get rid of the global checksum without replacing it somehow. Since this is not security sensitive we could use a faster checksum such as crc32 or crc64 though, especially if we keep using sha256 on chunks. |
crc32 is much faster to calculate when hardware acceleration for sha256 isn't present closes pulp#4447
crc32 is much faster to calculate when hardware acceleration for sha256 isn't present closes pulp#4447
Describe the bug
The export process hashes chunks, but then also creates a "global hash" of all of the files processed in-order. This means that multiple sha256 checksums are being calculated in parallel, across a very large amount of data (potentially terabytes).
At least on small exports, this adds up to a significant portion of the total import time.
What we could do instead is checksum the chunks once, and then use a checksum-of-checksums for the global checksum, as discussed here.
#4435 (comment)
To Reproduce
Perform an export with profiling enabled, observe the profile, see that ~1/3rd of it is spent performing sha256 checksums. We can reduce that by half.
The text was updated successfully, but these errors were encountered: