Skip to content

Conversation

isidentical
Copy link
Contributor

@isidentical isidentical commented Apr 6, 2021

dvc update --to-remote now only updates the missing cache files instead of syncing all over, as a pre-requisite this patch also merges transfer() logic with stage() and introduces the stage(upload=...) option for transferring to the remote cache.

Fixes #4992, part of #5768

@efiop efiop self-requested a review April 6, 2021 11:23
@efiop
Copy link
Contributor

efiop commented Apr 8, 2021

@isidentical Cool optimization! 🔥 We've talked about getting rid of transfer() in favor of stage() before, I think it needs to be done as a pre-requisite for this optimization, as it seems transfer itself gets more complicated and actually update case is almost indistinguishable from normal stage(). And because of this additional similarity, I think it is worth doing that in this PR.

@isidentical isidentical force-pushed the update-to-remote-opt branch from 49f2229 to 5f9b078 Compare April 12, 2021 11:30
@isidentical isidentical force-pushed the update-to-remote-opt branch from e2d1692 to 420404e Compare April 12, 2021 17:26
@isidentical isidentical marked this pull request as ready for review April 13, 2021 12:51
@efiop efiop requested a review from pmrowla April 13, 2021 13:26
@isidentical isidentical changed the title [WIP] update: only update the missing hashes with --to-remote update: only update the missing hashes with --to-remote Apr 14, 2021
@isidentical isidentical requested a review from efiop April 14, 2021 08:48
@isidentical
Copy link
Contributor Author

Benchmark results: master, this branch. The import-url --to-remote performance is reduced from 3.50 min to 2.10 min, and the update --to-remote performance is reduced from 9~ min to 3~ min.

@isidentical isidentical force-pushed the update-to-remote-opt branch 5 times, most recently from 91beae2 to db3e79b Compare April 16, 2021 08:25
@isidentical isidentical reopened this Apr 16, 2021
@isidentical isidentical force-pushed the update-to-remote-opt branch 2 times, most recently from 4705d25 to 64145cd Compare April 17, 2021 08:51
Comment on lines +16 to +23
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you could executor.map, or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to manually unpack some attributes of entry (entry.path_info, entry.fs etc) + add **kwargs to the calls. In theory we can do something like executor.map(lambda entry: odb.add(...), ...), what would be the advantage with map though? It would just block the progress bar if there is some big file in the middle, which might not be the best UI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seemed nicer and shorter. E.g. see how we wrap the worker in tqdm in dvc/objects/stage.py, where we use it in map, so it doesn't block the pbar.

@isidentical isidentical force-pushed the update-to-remote-opt branch from 64145cd to df3f0e5 Compare April 19, 2021 08:56
@efiop
Copy link
Contributor

efiop commented Apr 19, 2021

@isidentical Btw, very interesting failures on windows. Looks like we have the same objects (e.g. two files with the same content) saved in parallel, which results in that move exception. We need to simply ignore it (somewhere in objects/db/base.py likely), as it is an expected behavior for when someone (e.g. another user) is saving stuff in parallel to the same cache dir.

EDIT: that will fix #4992

@isidentical isidentical force-pushed the update-to-remote-opt branch 2 times, most recently from 41d5990 to 1a2cf91 Compare April 19, 2021 09:45
@isidentical isidentical force-pushed the update-to-remote-opt branch from 1a2cf91 to adf14ef Compare April 19, 2021 12:56
@isidentical isidentical force-pushed the update-to-remote-opt branch from adf14ef to 66228ac Compare April 19, 2021 13:18
Co-authored-by: Ruslan Kuprieiev <kupruser@gmail.com>
Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! 🔥 As we've discussed above, there are a few things like using verify in odb.add and putting stage()&save() logic somewhere in one method, but let's deal with those on top.

@efiop efiop merged commit 68897aa into iterative:master Apr 19, 2021
@efiop efiop added the optimize Optimizes DVC label Apr 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimize Optimizes DVC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cache: check and gracefully handle move to existing file

2 participants