Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The export process spends a significant amount of time on hashing #4447

Closed
dralley opened this issue Sep 18, 2023 · 5 comments
Closed

The export process spends a significant amount of time on hashing #4447

dralley opened this issue Sep 18, 2023 · 5 comments

Comments

@dralley
Copy link
Contributor

dralley commented Sep 18, 2023

Describe the bug

The export process hashes chunks, but then also creates a "global hash" of all of the files processed in-order. This means that multiple sha256 checksums are being calculated in parallel, across a very large amount of data (potentially terabytes).

At least on small exports, this adds up to a significant portion of the total import time.

What we could do instead is checksum the chunks once, and then use a checksum-of-checksums for the global checksum, as discussed here.

#4435 (comment)

To Reproduce

Perform an export with profiling enabled, observe the profile, see that ~1/3rd of it is spent performing sha256 checksums. We can reduce that by half.

@ipanova
Copy link
Member

ipanova commented Sep 18, 2023

@mdellweg @ggainey does any of you know why in case of chunked export we still provide global checksum https://github.com/dralley/pulpcore/blob/main/pulpcore/app/tasks/export.py#L502? As @dralley points out apparently it is expensive to perform check-summing.
To me it feels redundant to why we'd provide global checksum in case of chunked export as long as we provide in the json file mapping between each chunk and its checksim which on destination host transfer can be verified https://github.com/dralley/pulpcore/blob/main/pulpcore/app/tasks/export.py#L476

@dralley
Copy link
Contributor Author

dralley commented Sep 18, 2023

It appears that we reconstruct the whole file using its constituent parts here

So the chunk checksums serve to provide an early warning that something is wrong, and the total file checksum serves to ensure that the reconstruction occurred correctly. We also delete chunks after adding them to the reconstructed file, to avoid requiring twice as much disk space, so you can't "go back" easily.

We can at least parallelize the checksumming of the chunks easily so I'm going to do that. But I'm not certain we can fully get rid of the global checksum without replacing it somehow.

Since this is not security sensitive we could use a faster checksum such as crc32 or crc64 though, especially if we keep using sha256 on chunks.

@mdellweg
Copy link
Member

We also delete chunks after adding them to the reconstructed file, to avoid requiring twice as much disk space, so you can't "go back" easily.

Is there any way, we can concatenate the chunks on the fly without changing anything on disk (think import from ro filesystem)? Like running cat and piping the result into django-import?

@dralley
Copy link
Contributor Author

dralley commented Sep 20, 2023

I don't think so, I don't expect there's any way to work with multiple files as if they were one in that way. Also we don't necessarily stream in one direction, it's treated as a tarfile, which has a header, and we work with the tarfile such that I believe it goes back to reference the header.

@dralley dralley changed the title When chunking is used, the export process hashes files multiple times The export process spends a significant amount of time on hashing Sep 25, 2023
@dralley
Copy link
Contributor Author

dralley commented Oct 2, 2023

It appears that we reconstruct the whole file using its constituent parts here

So the chunk checksums serve to provide an early warning that something is wrong, and the total file checksum serves to ensure that the reconstruction occurred correctly. We also delete chunks after adding them to the reconstructed file, to avoid requiring twice as much disk space, so you can't "go back" easily.

We can at least parallelize the checksumming of the chunks easily so I'm going to do that. But I'm not certain we can fully get rid of the global checksum without replacing it somehow.

Since this is not security sensitive we could use a faster checksum such as crc32 or crc64 though, especially if we keep using sha256 on chunks.

dralley added a commit to dralley/pulpcore that referenced this issue Oct 17, 2023
crc32 is much faster to calculate when hardware acceleration for sha256
isn't present

closes pulp#4447
dralley added a commit to dralley/pulpcore that referenced this issue Oct 17, 2023
crc32 is much faster to calculate when hardware acceleration for sha256
isn't present

closes pulp#4447
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

4 participants