Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up dvc status for large projects #3280

Closed
Ykid opened this issue Feb 4, 2020 · 21 comments · Fixed by #3286 or #3323
Closed

Speed up dvc status for large projects #3280

Ykid opened this issue Feb 4, 2020 · 21 comments · Fixed by #3286 or #3323
Assignees
Labels
awaiting response we are waiting for your reply, please respond! :) bug Did we break something? p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks research

Comments

@Ykid
Copy link

Ykid commented Feb 4, 2020

Please provide information about your setup
DVC version(i.e. dvc --version), Platform and method of installation (pip, homebrew, pkg Mac, exe (Windows), DEB(Linux), RPM(Linux))

dvc version

DVC version: 0.82.8
Python version: 3.7.6
Platform: Linux-5.0.0-37-generic-x86_64-with-debian-buster-sid
Binary: False
Package: pip
Cache: reflink - not supported, hardlink - not supported, symlink - supported
Filesystem type (cache directory): ('ext4', '/dev/sda1')
Filesystem type (workspace): ('ext4', '/dev/nvme0n1p2')

expect dvc status can return in a shorter time, currently it takes 30s.

a few stats

du -shL my-project # 107G
dvc status -v

log.txt

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Feb 4, 2020
@pared
Copy link
Contributor

pared commented Feb 4, 2020

I think we should try to reproduce such big repo and see what takes so much time, I suspect db access.

@shcheklein
Copy link
Member

@pared @Ykid could we run it with a profiler please:

python -m cProfile -o status.prof -m dvc status -v

@Suor
Copy link
Contributor

Suor commented Feb 4, 2020

Reading through discord dialog and looking at the log I don't think this is about big repo. This is probably about many imported files, which create many clones on dvc status.

@Ykid
Copy link
Author

Ykid commented Feb 5, 2020

@Suor grep --include=\*.{dvc,} -rn my-data-dir -e ".git" | wc -l gives me 64. I think more or less like that. I have a dvc data registry and import some directories and files from there into the project. So does it sound better if I do dvc get followed by dvc add ? those imported files is not going to be updated very frequently after all.

num of files versioned by dvc is around 9k from find -L my-data-dir -type f ! -name "*.dvc" | wc -l. dvc pull shows 8.31k ( the number changes in each line of dvc pull though, but 8.31k is the most time consuming one )

may I know if there's any way to improve it ?

@Suor
Copy link
Contributor

Suor commented Feb 5, 2020

So does it sound better if I do dvc get followed by dvc add

This is obviously an issue on our end. I will think how this may be sped up.

@Suor Suor self-assigned this Feb 5, 2020
@Suor Suor added p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks labels Feb 5, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Feb 5, 2020
@Suor
Copy link
Contributor

Suor commented Feb 5, 2020

@Ykid Can you say how many different sources do you use in those 64 import stages? Source is a pair of (url, rev) in deps.repo.

@Ykid
Copy link
Author

Ykid commented Feb 6, 2020

@Suor There is one url as it is our data registry. I follow data registries to set it up. For the number of pairs, here it is:

  • 60: (repo, revA)
  • 4: (repo, revB)

@Suor
Copy link
Contributor

Suor commented Feb 6, 2020

@Ykid thanks.

@Suor
Copy link
Contributor

Suor commented Feb 6, 2020

A note from discord - git repo is big, both history and checked out things:

84K ./tools
1.8M  ./.dvc
988M  ./notebooks
20K ./my-project.egg-info
16K ./configs
870M  ./.git
16M ./my-project
92K ./dockerfiles
28K ./tests
6.3M  ./data
24K ./pipeline
1.9G  .

The last change only caches single instance of repo, not all of them, which prevents us from needing new git clone, but makes a copy of the whole repo sans dvc cache each time. The reason for that is we do git pull each time, i.e. modifying the directory, which means caching cannot be used for it. Things are furthermore complicated by summon publishing stuff, which modifies the directory returned by external_repo(), which also requires a separate copy.

I am trying to untangle it now.

@Ykid
Copy link
Author

Ykid commented Feb 7, 2020

The last change only caches single instance of repo, not all of them

May I know what this means ?

Suor added a commit to Suor/dvc that referenced this issue Feb 8, 2020
So we have 3 things cached now separately:
- clean clones, to not ask for creds repatedly
- cache dirs, also shared between erepos with same origin
- checked out clones if they are read only, addressed by (url, hexsha)

Several additions to `Git` along the way:
- Git.is_sha() static method
- .pull() and .push() work with multiple returned records correctly
- .get_rev() and .resolve_rev() work faster
- .resolve_rev() looks for remote branches
- .has_rev()

Fixes iterative#3280.
@efiop efiop added this to To do in DVC Sprint 28 Jan - 11 Feb 2020 via automation Feb 11, 2020
@skshetry skshetry moved this from To do to In progress in DVC Sprint 28 Jan - 11 Feb 2020 Feb 11, 2020
@Suor
Copy link
Contributor

Suor commented Feb 11, 2020

May I know what this means ?

Sorry for slow response.

This means that when you have many imports from the same repo dvc will make a fresh copy of its clone many times, while clone is only done once. This is not the issue generally, but since you have huge git repo - all those notebooks doesn't really play nice - it takes time.

Anyway, this should be fixed after #3286 lands. You can try it right now with:

pip install git+https://github.com/Suor/dvc.git@erepo-ro

If you do, can you please tell how well does it work for you?

@efiop efiop moved this from In progress to Review in progress in DVC Sprint 28 Jan - 11 Feb 2020 Feb 11, 2020
DVC Sprint 28 Jan - 11 Feb 2020 automation moved this from Review in progress to Done Feb 11, 2020
efiop pushed a commit that referenced this issue Feb 11, 2020
* erepo: cache all read only external repos by hexsha

So we have 3 things cached now separately:
- clean clones, to not ask for creds repatedly
- cache dirs, also shared between erepos with same origin
- checked out clones if they are read only, addressed by (url, hexsha)

Several additions to `Git` along the way:
- Git.is_sha() static method
- .pull() and .push() work with multiple returned records correctly
- .get_rev() and .resolve_rev() work faster
- .resolve_rev() looks for remote branches
- .has_rev()

Fixes #3280.

* git: improve .resolve_rev()

It follows `git checkout` logic now - if name can be unambiguously
resolved across known remotes then it's done.
@Ykid
Copy link
Author

Ykid commented Feb 13, 2020

@Suor

DVC version: 0.82.9+f73900
Python version: 3.7.4
Platform: Linux-5.0.0-37-generic-x86_64-with-debian-buster-sid
Binary: False
Package: None
Cache: reflink - not supported, hardlink - not supported, symlink - supported

ERROR: failed to obtain data status - 'Git' object has no attribute 'is_known'

Traceback (most recent call last):
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/command/status.py", line 50, in run
    with_deps=self.args.with_deps,
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/repo/__init__.py", line 31, in wrapper
    ret = f(repo, *args, **kwargs)
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/repo/status.py", line 133, in status
    return _local_status(self, targets, with_deps=with_deps)
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/repo/status.py", line 36, in _local_status
    return _joint_status(stages)
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/repo/status.py", line 25, in _joint_status
    status.update(stage.status(check_updates=True))
  File "/home/user/miniconda3/lib/python3.7/site-packages/funcy/decorators.py", line 39, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/stage.py", line 161, in rwlocked
    return call()
  File "/home/user/miniconda3/lib/python3.7/site-packages/funcy/decorators.py", line 60, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/stage.py", line 1015, in status
    deps_status = self._status(self.deps)
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/stage.py", line 1004, in _status
    ret.update(entry.status())
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/dependency/repo.py", line 59, in status
    current_checksum = self._get_checksum(locked=True)
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/dependency/repo.py", line 50, in _get_checksum
    with self._make_repo(locked=locked) as repo:
  File "/home/user/miniconda3/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/external_repo.py", line 27, in external_repo
    path = _cached_clone(url, rev, for_write=for_write)
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/external_repo.py", line 173, in _cached_clone
    clone_path = _clone_default_branch(url, rev)
  File "/home/user/miniconda3/lib/python3.7/site-packages/funcy/decorators.py", line 39, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/home/user/miniconda3/lib/python3.7/site-packages/funcy/flow.py", line 244, in wrap_with
    return call()
  File "/home/user/miniconda3/lib/python3.7/site-packages/funcy/decorators.py", line 60, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/external_repo.py", line 205, in _clone_default_branch
    if not Git.is_sha(rev) or not git.is_known(rev):
AttributeError: 'Git' object has no attribute 'is_known'

@efiop efiop reopened this Feb 13, 2020
@efiop efiop added bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP. labels Feb 13, 2020
@efiop
Copy link
Member

efiop commented Feb 13, 2020

Reopening and escalating the priority. @Suor Please take a look ASAP, we need to release a new version with the fix ASAP as well.

@casperdcl
Copy link
Contributor

casperdcl commented Feb 13, 2020

Looks like is_known should've been is_tracked. Surely we should have some sort of test which should've found this bug

@efiop efiop added this to To do in DVC Sprint 11 Feb - 25 Feb 2020 via automation Feb 13, 2020
Suor added a commit to Suor/dvc that referenced this issue Feb 14, 2020
@Suor
Copy link
Contributor

Suor commented Feb 14, 2020

Handled in #3323.

@skshetry skshetry moved this from To do to In progress in DVC Sprint 11 Feb - 25 Feb 2020 Feb 14, 2020
DVC Sprint 11 Feb - 25 Feb 2020 automation moved this from In progress to Done Feb 14, 2020
efiop pushed a commit that referenced this issue Feb 14, 2020
@efiop
Copy link
Member

efiop commented Feb 15, 2020

@Ykid 0.85.0 is out on pip and conda, please upgrade, give it a try and let us know if it fixed the issue for you. Thanks for the feedback! 🙂

@Ykid
Copy link
Author

Ykid commented Feb 15, 2020

The bug related to git is fixed, but there seem to be not much performance improved. :(.

DVC version: 0.85.0
Python version: 3.7.6
Platform: Linux-5.0.0-37-generic-x86_64-with-debian-buster-sid
Binary: False
Package: pip
Cache: reflink - not supported, hardlink - not supported, symlink - supported
Filesystem type (cache directory): ('ext4', '/dev/sda1')
Filesystem type (workspace): ('ext4', '/dev/nvme0n1p2')
time dvc status

real	0m30.635s
user	0m21.884s
sys	0m6.815s

@efiop
Copy link
Member

efiop commented Feb 15, 2020

@Suor Please take a look.

@efiop efiop reopened this Feb 15, 2020
DVC Sprint 11 Feb - 25 Feb 2020 automation moved this from Done to In progress Feb 15, 2020
@efiop efiop removed the p0-critical Critical issue. Needs to be fixed ASAP. label Feb 15, 2020
@Suor
Copy link
Contributor

Suor commented Feb 25, 2020

So the optimization works for me, no unneeded clones or copies done. git checkout takes significantly more time than expected though. Need to investigate @Ykid situation more before jumping on some advanced optimizations.

@Suor
Copy link
Contributor

Suor commented Feb 25, 2020

@Ykid I made a branch, which has more erepo logging, can you try it to see what is actually happening on your side and how much time that takes?

pip install git+https://github.com/Suor/dvc.git@erepo-log
dvc status -v
# And paste output here

@efiop efiop added this to To do in DVC 25 Feb - 10 March 2020 via automation Feb 25, 2020
@Suor Suor added the awaiting response we are waiting for your reply, please respond! :) label Feb 25, 2020
@Suor Suor moved this from To do to In progress in DVC 25 Feb - 10 March 2020 Feb 28, 2020
@efiop
Copy link
Member

efiop commented Mar 10, 2020

Closing due to inactivity.

@efiop efiop closed this as completed Mar 10, 2020
DVC 25 Feb - 10 March 2020 automation moved this from In progress to Done Mar 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :) bug Did we break something? p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks research
Projects
No open projects
7 participants