Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud versioning: fix push/fetch performance #8766

Merged
merged 2 commits into from Jan 10, 2023

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Jan 5, 2023

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Will close #8359

@pmrowla pmrowla added performance improvement over resource / time consuming tasks A: cloud-versioning Related to cloud-versioned remotes labels Jan 5, 2023
@pmrowla pmrowla self-assigned this Jan 5, 2023
@pmrowla
Copy link
Contributor Author

pmrowla commented Jan 5, 2023

@dberenbaum this PR also adds the change where we will not check whether existing version_id's are still valid for version_aware push. For the typical use case we can make this assumption, but I'm still not really convinced we should be doing it since it is not always safe (users may be auto-expiring old object versions or they may have specific use cases like #5269)

It is straightforward to remove that optimization in the future though if we get user feedback, so we can probably merge this for now.

@codecov
Copy link

codecov bot commented Jan 5, 2023

Codecov Report

Base: 93.54% // Head: 93.55% // Increases project coverage by +0.00% πŸŽ‰

Coverage data is based on head (3232756) compared to base (6ac8810).
Patch coverage: 0.00% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8766   +/-   ##
=======================================
  Coverage   93.54%   93.55%           
=======================================
  Files         457      457           
  Lines       36255    36253    -2     
  Branches     5262     5261    -1     
=======================================
  Hits        33915    33915           
+ Misses       1835     1833    -2     
  Partials      505      505           
Impacted Files Coverage Ξ”
dvc/repo/fetch.py 71.01% <0.00%> (ΓΈ)
dvc/repo/push.py 52.38% <0.00%> (ΓΈ)
dvc/repo/worktree.py 9.92% <0.00%> (+0.14%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

β˜” View full report at Codecov.
πŸ“’ Do you have feedback about the report comment? Let us know in this issue.

@dberenbaum
Copy link
Contributor

dberenbaum commented Jan 5, 2023

@pmrowla I agree that assuming the version exists might not be a safe assumption. Do you have other ideas for how to speed up checking status for version_aware remotes?

Edit: maybe related to #7268. If we had some option to force verification, it might be fine to default to assuming the version exists.

@pmrowla
Copy link
Contributor Author

pmrowla commented Jan 6, 2023

Do you have other ideas for how to speed up checking status for version_aware remotes?

The main problem here is that for regular DVC we are able to speed up the status check with:

  • for directories we push .dir files last, and assume that if a .dir file exists on the remote, the rest of the dir contents still exists
  • use the remote size estimation (based on listing a single hash prefix) to check whether its faster to list the entire bucket or do individual exists calls

But we can't use either of these optimizations with cloud versioning. Assuming that if we have a version ID we don't need to push is similar to the .dir optimization, but the difference there would be that we at least check one file in the remote (the .dir file), and for regular DVC remotes the expectation is that DVC is the only thing modifying the entire bucket (which is not necessarily the case with cloud versioning).

Since we can't actually do any size estimation for versioned remotes, I suppose we could try running both the sequential exists calls in parallel with the list_object_versions, and then wait for whichever one finishes first. It's not a very elegant solution, and it will use a lot more API calls than necessary (due to always running at least some of the exists checks), but it would give the desired result.

@dberenbaum
Copy link
Contributor

Can we leave out the status optimizations to not hold this up? Let's discuss status optimization separately.

@pmrowla pmrowla force-pushed the cloud-versioning-perf branch 2 times, most recently from 0a83c61 to c1ac943 Compare January 9, 2023 04:39
@pmrowla
Copy link
Contributor Author

pmrowla commented Jan 9, 2023

@dberenbaum I removed the status optimization in this PR, it should be ready to merge now

@dberenbaum
Copy link
Contributor

Thanks @pmrowla! Great to see this resolved so fast.

@pmrowla pmrowla merged commit c94d285 into iterative:main Jan 10, 2023
@pmrowla pmrowla deleted the cloud-versioning-perf branch January 10, 2023 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: cloud-versioning Related to cloud-versioned remotes performance improvement over resource / time consuming tasks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cloud versioning: slow
2 participants