-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize computing md5 for large files #2073
Comments
@Witiko Great suggestion! For large files the parallelization is quite straightforward: just need to modify |
Thank you, @efiop. I will take a look. Is there any other parallelized code in dvc at the moment, so that I can keep things consistent? Rather than parallelize the internals of |
Thanks 🙂 We actually hardly use any parallelization these days 🙁 The only example I can think of is push/pull/fetch/status from dvc/remote/RemoteLOCAL, which are using ThreadPool's. But looks like in this particular case using processes should be more suitable because of GIL.
So yes, the approach would depend on the particular issue that you are trying to solve. If we are talking about a small number of large standalone files, then parallelizing file_md5 to tackle parts of the same file should bring the most benefit. If we are talking about large number of small files, then, indeed, we'll need to make file_md5 handle each file in parallel.
There is. We are actually doing it already in https://github.com/iterative/dvc/blob/master/dvc/utils/__init__.py#L67 by reading chunks of the file and then gradually updating the md5 until we get a final answer for the whole file. |
That's different, however, because you process the chunks linearly, i.e. there is an internal state that depends on all the previously read chunks. A thread that starts reading in the middle of a file has not seen the previous chunks, and therefore it starts with an incorrect state. In the end, each thread will produce its own MD5, and there is no obvious way to combine these into the MD5 of the entire file. Another thought: since dvc is not expected to produce a large CPU load, the parallelization should perhaps be an opt-in feature. We can do this by adding an option similar to Git's |
But you can get total size of a file, split it into chunks(not literally, but by passing a proper As to parallelization, we can use |
It's best to discuss this with a code example at hand. Currently, dvc does the following for the entire input file. def file_md5(fname):
hash_md5 = hashlib.md5()
with open(fname, "rb") as fobj:
while True:
chunk = fobj.read(LOCAL_CHUNK_SIZE)
if not chunk:
break
hash_md5.update(chunk)
return (hash_md5.hexdigest(), hash_md5.digest()) This produces a single md5 hash:
You can fseek to an offset and compute the md5 hash for just a part of the input file: def file_part_md5(fname, offset, size):
hash_md5 = hashlib.md5()
with open(fname, "rb") as fobj:
fobj.seek(offset)
while read < size:
chunk = fobj.read(LOCAL_CHUNK_SIZE)
if not chunk:
break
read += len(chunk)
hash_md5.update(chunk)
return (hash_md5.hexdigest(), hash_md5.digest()) This produces several MD5 hashes:
How would you combine these md5 hashes to obtain one md5 hash for the entire file?
There is also the
I am not sure if using all CPUs is the best default. It is true that we want to compute the checksums as fast as possible. However, users will not expect that dvc maxes out their CPUs. In my experience, using more than one CPU is almost always an opt-in for command-line tools, not a default. Take this real-life example: I am using dvc on a shared server with 64 CPU cores. Dvc is a versioning tool and I don't expect it to do anything computationally-intensive. Therefore, I do not use |
You are absolutely right @Witiko , there is no way to do that, since md5 calculation is a linear process. Sorry, I've totally forgotten about it and didn't even remember that s3 has the same problem with multipart uploads and they are using an effective md5 as an etag for the resulting object. Indeed, we need to parallelize by files.
Actually, thinking about it, we should probably set it in
But if cmd tool offers an option for that, then you are most likely to always use it, no? 🙂 E.g. things like
That is a really great point! I'm just trying to think if your scenario is the one that should set |
btw as @shcheklein rightfully noted, we could possibly use that |
btw @Witiko , which problem is more relevant for you? Small number of large files or big number of small files? |
That is a good point, although this makes the naming a little unintuitive (why does |
That is true, but having to specify |
With the number of cores in modern consumer CPUs, I think we can go as high as 4, which should almost max out the SSD I/O. Note that some of the data will be cached in the RAM, and the user may also be using a fast NVMe disk, so there is still some utility in explicitly asking for more than 4 threads.
I think that having some sane value of |
That is an interesting suggestion. Where does @shcheklein discuss this?
The most relevant problem for me is computing hashes for a large number of large files, because that's what takes the longest. Parallelizing over files speeds up the hash computation for a large number of large files, but it offers only a small speed-up for a large number of small files and a small number of large files. This actually points to a bigger problem with the cache, local remotes, and HTTP remotes: you may not be able to pull a local or HTTP remote if it contains a file that is too large for your filesystem (such as the 4G limit of FAT32). We could solve that by chunking the files that we store in the cache, but that also means that we could no longer hardlink and symlink files to the cache. Moreover, it would also make the format of the cache, local remotes, and HTTP remotes backwards-incompatible, so this would be quite a significant change to the fundamental design of dvc! If we started chunking the files that we store in the cache, and we also started packing small files together, then parallelizing over files would speed up the hash computation for any number of files (both large and small). However, as I said above, this would be a major change. |
but
Touche 🙂
4 by default seems like a little too much in general case. Maybe we could use NCPU and decide based on that so that (except 1CPU case) we don't take more than 50% of CPUs by default? This does seem like implicit magic that I'm not a big fan of, but also seems suitable to try to provide the best experience by default.
Or even
In private messages 🙂 Maybe he'll join us here soon. The suggested approach doesn't mean that we need to actually spit that cache file into separate physical chunks, but rather that we would virtually do that when computing the checksum. As long as we keep the chunk size constant, we should be able to provide the same level of by-checksum deduplication that we have right now.
Lets start by solving your problem, that is the most straightforward thing for now. We could look into speeding up checksum computation for 1 big file later.
As @shcheklein noted, FAT example might be stratching it a little, but maybe we are not aware of someone actually using it these days with dvc. Unless you are dealing with truly giant files, you should be fine. And if you are dealing with truly giant files you are problably splitting them yourself or using something like hdfs, which dvc is also able to work with through external dependencies/outputs. And we won't be able to use
We have been thinking about splitting files on a smaller scale as per #829 too. One of the problems with that is that you would be doubling your storage, since you'll need to reconstruct the files back in the cache and then link them to the workspace. The other way would be to just not use links and reconstruct files directly into workspace, but then again you won't be able to use reflink/hardlink/symlink and will be tied to Btw, @Witiko , how about we schedule a video call someday? We would all really love to meet you and talk about your use cases. 🙂 How about 5 pm (Prague timezone) next Wednesday? |
That is true, and I like your |
This is only true for hardlinks and symlinks. You can reflink just a chunk of a file (see |
I agree that this is perhaps not a concern for us.
I agree. I just thought that the idea of chunking the cache is worth discussing, but it is actually a separate issue that is independent on this one. Thanks to our detailed discussion, I think I have a good idea about how we'd like to approach fixing the current issue. 🙂 |
@efiop: Contributing to open-source projects is a way for me to unwind, and I consider video calls a major stress factor. 🙂 My use case can be summarized in one sentence: I am producing large language models and I need to store them and publish them together with the training and evaluation code. If you'd like to discuss further, I can join you and others on Discord. |
Agreed :)
Oh wow, I didn't know about that! Yeah, quick google search doesn't seem to find any evidance that osx supports it, but maybe it just needs a more careful research. So looks like if we would to go that way, it would probably be linux-only feature. Also remotes will become incomatible without doubling your local or remote storage to keep joined files alongside chunks 🙁
Didn't mean to discourage 🙂 The discussion turned out amazing!
Sure, no worries, chat is good too 🙂 Please feel free to join our comunity on Discord, we would really appreciate having you there 🙂 |
I am sorry about the delay in solving this issue, I was quite busy for the past two weeks. |
@Witiko No worries 🙂 |
@efiop: It will take a couple of days to complete, there are a lot of callers (and callers of callers) of |
This was turning out to be a pretty extensive rewrite. Anything that called |
After executing
dvc run
, I have been staring at theComputing md5 for a large file
message for the past five hours. However, only a single CPU and about 20% of the SSD disk's maximum read speed are utilized. Therefore, a 5× speed-up can be achieved just by computing MD5 in parallel.We may have to get rid of the per-file progress bars, but I think that a single overall progress bar would be more informative anyways, especially with hundreds of files being added to the cache. What do you think?
The text was updated successfully, but these errors were encountered: