Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DVC diff: doesn't respond for an average dataset (~20k) #6173

Closed
MarcelNasser opened this issue Jun 15, 2021 · 10 comments
Closed

DVC diff: doesn't respond for an average dataset (~20k) #6173

MarcelNasser opened this issue Jun 15, 2021 · 10 comments
Assignees
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks ui user interface / interaction

Comments

@MarcelNasser
Copy link

MarcelNasser commented Jun 15, 2021

Bug Report

DVC diff: doesn't respond for an average dataset (~20k)

Description

When, I run dvc diff on my datataset the command just freeze.

Reproduce

  1. dvc init
  2. dvc get-url https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz
  3. tar -xvf imagenette2-160.tgz
  4. dvc add imagenette2-160
  5. git commit -a -m "add imagenette dataset"
  6. dvc commit
  7. dvc diff

Expected

List of changes between dvc commit.

Environment information

Output of dvc doctor:

$ dvc doctor

DVC version: 2.3.0 (snap)

Platform: Python 3.6.9 on Linux-5.8.0-55-generic-x86_64-with-Ubuntu-20.04-focal
Supports: azure, gdrive, gs, http, https, s3, ssh
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p2
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/nvme0n1p2
Repo: dvc, git

@skshetry
Copy link
Member

@MarcelNasser, could you please see running with dvc diff -v? I have noticed that it does not really print status, so it might look like it's frozen, when it's just being slow (possibly related to #5746).

@skshetry
Copy link
Member

DVC does not show any useful information even with -vv. That's really bad. If the issue is it being slow, it's probably due to #5746 and might just seem to be frozen. Anyway, could you please try running dvc diff -vv, and when it seems frozen, use Ctrl + C multiple times till it exits? Please post the log and the traceback that you get.

@MarcelNasser
Copy link
Author

MarcelNasser commented Jun 15, 2021

@skshetry the output of dvc diff -vv and ctr+C

bug.LOG

@karajan1001
Copy link
Contributor

karajan1001 commented Jun 17, 2021

Reproduced on my computer. Using dvc diff -vv, it is continously print

2021-06-17 17:12:37,902 TRACE: Assuming '/Users/gao/Code/test/test_6173/.dvc/cache/c5/261f96ce891b0da90173fb43f3950d.dir' is unchanged since it is read-only
2021-06-17 17:12:37,974 TRACE: Assuming '/Users/gao/Code/test/test_6173/.dvc/cache/c5/261f96ce891b0da90173fb43f3950d.dir' is unchanged since it is read-only
2021-06-17 17:12:38,046 TRACE: Assuming '/Users/gao/Code/test/test_6173/.dvc/cache/c5/261f96ce891b0da90173fb43f3950d.dir' is unchanged since it is read-only
2021-06-17 17:12:38,126 TRACE: Assuming '/Users/gao/Code/test/test_6173/.dvc/cache/c5/261f96ce891b0da90173fb43f3950d.dir' is unchanged since it is read-only
2021-06-17 17:12:38,198 TRACE: Assuming '/Users/gao/Code/test/test_6173/.dvc/cache/c5/261f96ce891b0da90173fb43f3950d.dir' is unchanged since it is read-only
2021-06-17 17:12:38,269 TRACE: Assuming '/Users/gao/Code/test/test_6173/.dvc/cache/c5/261f96ce891b0da90173fb43f3950d.dir' is unchanged since it is read-only
2021-06-17 17:12:38,349 TRACE: Assuming '/Users/gao/Code/test/test_6173/.dvc/cache/c5/261f96ce891b0da90173fb43f3950d.dir' is unchanged since it is read-only
2021-06-17 17:12:38,422 TRACE: Assuming '/Users/gao/Code/test/test_6173/.dvc/cache/c5/261f96ce891b0da90173fb43f3950d.dir' is unchanged since it is read-only
2021-06-17 17:12:38,494 TRACE: Assuming '/Users/gao/Code/test/test_6173/.dvc/cache/c5/261f96ce891b0da90173fb43f3950d.dir' is unchanged since it is read-only
2021-06-17 17:12:38,572 TRACE: Assuming '/Users/gao/Code/test/test_6173/.dvc/cache/c5/261f96ce891b0da90173fb43f3950d.dir' is unchanged since it is read-only
2021-06-17 17:12:38,643 TRACE: Assuming '/Users/gao/Code/test/test_6173/.dvc/cache/c5/261f96ce891b0da90173fb43f3950d.dir' is unchanged since it is read-only
2021-06-17 17:12:38,714 TRACE: Assuming '/Users/gao/Code/test/test_6173/.dvc/cache/c5/261f96ce891b0da90173fb43f3950d.dir' is unchanged since it is read-only
2021-06-17 17:12:38,792 TRACE: Assuming '/Users/gao/Code/test/test_6173/.dvc/cache/c5/261f96ce891b0da90173fb43f3950d.dir' is unchanged since it is read-only

I looked into it the time was cost on repo_fs.walk method, and seems that neither dvc add and dvc status would call it.
So It might be a different problem with #5746.It is even slower than dvc add on my computer.

@karajan1001 karajan1001 added bug Did we break something? p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks labels Jun 17, 2021
@skshetry skshetry added this to the UI improvements milestone Jun 21, 2021
@skshetry skshetry added the ui user interface / interaction label Jun 21, 2021
@skshetry skshetry self-assigned this Jun 21, 2021
@skshetry
Copy link
Member

skshetry commented Jun 21, 2021

Assigning myself for the UI part that will be worked on in the coming days. More than the performance, diff not providing any information at all while it's running is unacceptable.

@karajan1001
Copy link
Contributor

The total time cose profile result.
image

About half of the cost in from_list and from_dict
image
Another half in pygtrie
image

More details in file.
dump.prof.zip

@shcheklein
Copy link
Member

I hit the same issue with the https://github.com/iterative/get-started-experiments . It is ~70K or something files. dvc diff doesn't respond at all, nothing in logs. I wasn't able to wait to let it complete (1h+).

We depend on dvc diff in VS Code (cc @mattseddon ), and this repo is supposed to be simple :), but we hit issues (cc @iesahin )

@efiop
Copy link
Member

efiop commented Jul 9, 2021

Looks like we've lost .dir caching and loading it every time from scratch for each file in a dir, which creates these problems. We've introduced some changes in recent days that might fix that. Taking a closer look...

@efiop
Copy link
Member

efiop commented Jul 9, 2021

Yeah, caching it reduces the time to around 1m total for dvc-bench (which is still a lot, but at least not hours). There is some obj-related code that we are now ready to improve (specifically get_dir_cache that is no longer needed, really) there, so I'll try to prepare a little bit more wholesome PR than just a quick hack. Will get back to this tomorrow morning.

@efiop efiop added this to To do in DVC 10 Aug - 24 Aug 2021 via automation Aug 10, 2021
@efiop efiop self-assigned this Aug 10, 2021
@efiop efiop moved this from To do to In progress in DVC 10 Aug - 24 Aug 2021 Aug 17, 2021
@efiop efiop added this to To do in DVC 24 Aug - 07 Sep 2021 via automation Aug 24, 2021
@efiop efiop moved this from In progress to Done in DVC 10 Aug - 24 Aug 2021 Aug 24, 2021
@efiop efiop moved this from To do to In progress in DVC 24 Aug - 07 Sep 2021 Aug 24, 2021
@skshetry skshetry added this to To do in DVC 07 Sep - 21 Sep 2021 via automation Sep 7, 2021
@skshetry skshetry moved this from In progress to Done in DVC 24 Aug - 07 Sep 2021 Sep 7, 2021
@efiop
Copy link
Member

efiop commented Sep 13, 2021

Closing. It is now around 5sec for get-started-experiments, which is still a lot, but that will be addressed after #6594

For the record: we have some benchmarks in https://docs.iterative.ai/dvc-bench/

@efiop efiop closed this as completed Sep 13, 2021
DVC 07 Sep - 21 Sep 2021 automation moved this from To do to Done Sep 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks ui user interface / interaction
Projects
No open projects
Development

No branches or pull requests

5 participants