-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dramatic slowdown with increasing number of files #2203
Comments
Hi @filipgrano ! Could you please elaborate on your scenario? We might be able to suggest a workaround for now. |
The most simple scenario I can come up with where this would be a serious issue: We have quite elaborate scenarios with multistage parallel data-pipelines producing thousands of files with the input files divided between workers, but I think the above highlights the issue. |
I also found the same performance issue and degradation with dvc checkout of single file when there is large number of files. I will post benchmarks shortly. |
As promised, checkout of single file from large file-sets benchmarks below. Cache contains all and only the files in repo. Deleted single file and did This is an even worse case. Checkout of single file takes as long as adding all the files recursively. DVC version 0.50.11 file from 451 files / 14s = 0.071 files/s DVC version 0.35.71 file from 451 files / 2s = 0.500 files/s |
I meant the scenario where you need to add that number of files recursively, instead of as the whole dir. If you are using
Got it. So maybe it would be possible to put that parallelization into a script, that you would run with
so it would count it as a whole output? So the culprit for now is the way we collect the DAG. Each time almost any dvc command is running, it is collecting all the dvc files in your repo in order to build a DAG. When we are doing that we are also doing things like verifying that you don't have any cycles in your DAG and stuff like that. 0.35.7 is pretty old, and we've added some additional checks since then. We'll need to investigate the issue more carefully. As to checking out one particular dvc file or a specific set of those, we might optimize it for now by not computing the whole dag, since we already know specific targets to checkout. |
Maybe we should discuss introducing some feature that will allow user to extract particular output?
and for particular file from data directory:
It has to be noted, that introducing such command would result in some problems how to treat that dependency. What I have in mind is that if we allow user to download just one image from whole dataset under the same path ( We could probably approach that by treating such pull similar way as we do We would need to do it in 2 steps:
@filipgrano out of curiosity: |
#!/bin/bash
set -evux
rm -rf perf
mkdir perf
pushd perf
git init
dvc init
mkdir data
for i in $(seq 1 5000);
do
echo $i > data/$i.txt
done Running performance test:
So, it's actually crazy number of calls to path info. |
@shcheklein That rather indicates that we spend a lot of time in PathInfo. Ill run similar test for 0.35.7 so we can try to compare.
DVC version: master:
|
That would work if we ran on a single machine or with shared storage. Our workers run in a k8s on separate machines and use dvc with local cache and remote storage to synchronize data.
Yes. |
Might be a silly question, but why does the DAG need to be built when simply adding files? How can they be part of any dependencies if they are freshly added? |
We need to make sure that there are no cycles or duplicated outputs. E.g.
or
|
Got a 3x-4x improvement with this simple trick:
@Suor can you confirm that it makes sense? next is to reduce (or cache?) number of isin calls in the first place, if possible |
@filipgrano it looks promising ^^ . We can send an update as soon as I confirm that there is no issues. |
@shcheklein I would rather fall back to comparing strings or go over |
I wonder if we should support some partial pulling though. Because:
|
@Suor parents take only 16s for 49995000 calls. 16s out of 500s of execution. I would say it's not a problem at all. 49995000 is the problem. I hope all these checks inside graph can be rewritten to be linear (using some hashing). |
@Suor of course, I'm not against using _cparts if it's easy enough. |
If everything is done right, PR #2230 gives ~20x improvement:
Most of the time goes to the |
Should help with iterative#2203
@filipgrano please, check the latest release 0.51.1. It should be available via pip to install and give you ~20x performance improvement. Please, let us know if that works for you. |
@filipgrano closing this since it should be fixed in the latest 0.51+ version. |
Ubuntu 18.04, python 3.6.8, dvc version 0.35.7. and 0.50.1 installed with pip3. Fresh repos and clear cache for each test.
Adding files recursively gets a lot slower with the number of files. Noticed this when trying to add a directory with 15207 files, waited an hour and stopped it before it finished. Tested below with subsets of files to show behavior. This is especially noticeable on the newest version of dvc, while 0.35.7 works significantly better.
Unfortunately adding whole directories (without recursive) is not an option for us, because we need to split the data-set between workers before pulling. We're stuck on older version of dvc because of this issue.
DVC version 0.50.1
Tested with subset of files to show behaviour:
451 files / 16s = 28.2 files/s
1242 files / 87s = 14.3 files/s
2474 files / 327s = 7.5 files/s
4188 files / 927s = 4.5 files/s
Adding the whole directory with 4188 files (using the non-recursive option):
4188 files / 12s = 349 files/s
DVC version 0.35.7
451 files / 4s = 113 files/s
1242 files / 9s = 138 files/s
2474 files / 20s = 124 files/s
4188 files / 49s = 85 files/s
15204 files / 445s = 34 files/s
Adding the whole directory with 4188 files (using the non-recursive option):
4188 files / 8s = 524 files/s
Adding the whole directory with 15204 files (using the non-recursive option):
15204 files / 26s = 585 files/s
The text was updated successfully, but these errors were encountered: