The export process spends a significant amount of time on hashing #4447

dralley · 2023-09-18T15:28:30Z

Describe the bug

The export process hashes chunks, but then also creates a "global hash" of all of the files processed in-order. This means that multiple sha256 checksums are being calculated in parallel, across a very large amount of data (potentially terabytes).

At least on small exports, this adds up to a significant portion of the total import time.

What we could do instead is checksum the chunks once, and then use a checksum-of-checksums for the global checksum, as discussed here.

#4435 (comment)

To Reproduce

Perform an export with profiling enabled, observe the profile, see that ~1/3rd of it is spent performing sha256 checksums. We can reduce that by half.

ipanova · 2023-09-18T16:12:23Z

@mdellweg @ggainey does any of you know why in case of chunked export we still provide global checksum https://github.com/dralley/pulpcore/blob/main/pulpcore/app/tasks/export.py#L502? As @dralley points out apparently it is expensive to perform check-summing.
To me it feels redundant to why we'd provide global checksum in case of chunked export as long as we provide in the json file mapping between each chunk and its checksim which on destination host transfer can be verified https://github.com/dralley/pulpcore/blob/main/pulpcore/app/tasks/export.py#L476

dralley · 2023-09-18T22:45:30Z

It appears that we reconstruct the whole file using its constituent parts here

So the chunk checksums serve to provide an early warning that something is wrong, and the total file checksum serves to ensure that the reconstruction occurred correctly. We also delete chunks after adding them to the reconstructed file, to avoid requiring twice as much disk space, so you can't "go back" easily.

We can at least parallelize the checksumming of the chunks easily so I'm going to do that. But I'm not certain we can fully get rid of the global checksum without replacing it somehow.

Since this is not security sensitive we could use a faster checksum such as crc32 or crc64 though, especially if we keep using sha256 on chunks.

mdellweg · 2023-09-19T08:20:54Z

We also delete chunks after adding them to the reconstructed file, to avoid requiring twice as much disk space, so you can't "go back" easily.

Is there any way, we can concatenate the chunks on the fly without changing anything on disk (think import from ro filesystem)? Like running cat and piping the result into django-import?

dralley · 2023-09-20T05:28:10Z

I don't think so, I don't expect there's any way to work with multiple files as if they were one in that way. Also we don't necessarily stream in one direction, it's treated as a tarfile, which has a header, and we work with the tarfile such that I believe it goes back to reference the header.

dralley · 2023-10-02T20:20:27Z

It appears that we reconstruct the whole file using its constituent parts here

So the chunk checksums serve to provide an early warning that something is wrong, and the total file checksum serves to ensure that the reconstruction occurred correctly. We also delete chunks after adding them to the reconstructed file, to avoid requiring twice as much disk space, so you can't "go back" easily.

We can at least parallelize the checksumming of the chunks easily so I'm going to do that. But I'm not certain we can fully get rid of the global checksum without replacing it somehow.

Since this is not security sensitive we could use a faster checksum such as crc32 or crc64 though, especially if we keep using sha256 on chunks.

crc32 is much faster to calculate when hardware acceleration for sha256 isn't present closes pulp#4447

dralley added Performance Issue Triage-Needed labels Sep 18, 2023

dkliban removed the Triage-Needed label Sep 19, 2023

dralley changed the title ~~When chunking is used, the export process hashes files multiple times~~ The export process spends a significant amount of time on hashing Sep 25, 2023

dralley added a commit to dralley/pulpcore that referenced this issue Oct 17, 2023

Use crc32 checksums in place of sha256

0a85f79

crc32 is much faster to calculate when hardware acceleration for sha256 isn't present closes pulp#4447

dralley added a commit to dralley/pulpcore that referenced this issue Oct 17, 2023

Use crc32 checksums in place of sha256

71ddaf3

crc32 is much faster to calculate when hardware acceleration for sha256 isn't present closes pulp#4447

dralley closed this as completed in 2c3ede7 Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The export process spends a significant amount of time on hashing #4447

The export process spends a significant amount of time on hashing #4447

dralley commented Sep 18, 2023

ipanova commented Sep 18, 2023 •

edited

dralley commented Sep 18, 2023

mdellweg commented Sep 19, 2023

dralley commented Sep 20, 2023

dralley commented Oct 2, 2023

The export process spends a significant amount of time on hashing #4447

The export process spends a significant amount of time on hashing #4447

Comments

dralley commented Sep 18, 2023

ipanova commented Sep 18, 2023 • edited

dralley commented Sep 18, 2023

mdellweg commented Sep 19, 2023

dralley commented Sep 20, 2023

dralley commented Oct 2, 2023

ipanova commented Sep 18, 2023 •

edited