Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud versioning: version_aware status slow #8774

Closed
dberenbaum opened this issue Jan 6, 2023 · 6 comments · Fixed by #8842
Closed

cloud versioning: version_aware status slow #8774

dberenbaum opened this issue Jan 6, 2023 · 6 comments · Fixed by #8842
Assignees
Labels
A: cloud-versioning Related to cloud-versioned remotes p1-important Important, aka current backlog of things to do

Comments

@dberenbaum
Copy link
Contributor

dberenbaum commented Jan 6, 2023

After iterative/dvc-data#246, the only blocker should be pushing incremental changes to a version_aware remote (not an issue with worktree). For example, pushing when nothing has changed:

$ time dvc push -r cache
Everything is up to date.
dvc push -r cache  1.74s user 0.16s system 83% cpu 2.290 total
$ time dvc push -r worktree
Everything is up to date.
dvc push -r worktree  1.95s user 0.14s system 75% cpu 2.788 total
$ time dvc push -r versioned
Everything is up to date.
dvc push -r versioned  5.23s user 0.38s system 50% cpu 11.095 total

The version_aware remote hangs with a status bar stuck at 0% for most of the runtime. I'm guessing DVC is looking up each file by version and this takes a long time. Is it needed? I think we discussed that it might not be necessary to validate that the data exists on cloud if there is a version_id in the .dvc file. That might be a strong assumption to make, but it feels painfully slow compared to other remote types to check the status.

Originally posted by @dberenbaum in #8359 (comment)

@dberenbaum
Copy link
Contributor Author

Do you have other ideas for how to speed up checking status for version_aware remotes?

The main problem here is that for regular DVC we are able to speed up the status check with:

  • for directories we push .dir files last, and assume that if a .dir file exists on the remote, the rest of the dir contents still exists
  • use the remote size estimation (based on listing a single hash prefix) to check whether its faster to list the entire bucket or do individual exists calls

But we can't use either of these optimizations with cloud versioning. Assuming that if we have a version ID we don't need to push is similar to the .dir optimization, but the difference there would be that we at least check one file in the remote (the .dir file), and for regular DVC remotes the expectation is that DVC is the only thing modifying the entire bucket (which is not necessarily the case with cloud versioning).

Since we can't actually do any size estimation for versioned remotes, I suppose we could try running both the sequential exists calls in parallel with the list_object_versions, and then wait for whichever one finishes first. It's not a very elegant solution, and it will use a lot more API calls than necessary (due to always running at least some of the exists checks), but it would give the desired result.

Originally posted by @pmrowla in #8766 (comment)

@dberenbaum
Copy link
Contributor Author

  • use the remote size estimation (based on listing a single hash prefix)

Can you explain more how the remote size estimation works? I didn't get it from this comment.

to check whether its faster to list the entire bucket or do individual exists calls

Which one do we do for version_aware remotes?

  • for directories we push .dir files last, and assume that if a .dir file exists on the remote, the rest of the dir contents still exists

I guess this is where most of the time difference comes. What do you think about checking the first file entry in each .dvc file? And do you think it would be less of an issue if we implemented #7268?

@pmrowla
Copy link
Contributor

pmrowla commented Jan 9, 2023

  • use the remote size estimation (based on listing a single hash prefix)

Can you explain more how the remote size estimation works? I didn't get it from this comment.

So depending on the # of objects whose status we have to check, and the total number of objects in the bucket, sometimes it is faster to query them individually w/exists, and sometimes it is faster to just get the entire listing of objects in the remote (which returns ~1000 at a time per API call depending on cloud, but requires iterating over the entire remote's worth of list API calls).

Essentially, for a relatively large # of objects to query, and a relatively small total remote size, listing is faster. For a relatively small # of objects to query and relatively large total remote size, it is faster to use individual exists calls. The problem is that we don't know the total remote size beforehand (there is no way to get the actual size other than doing the entire listing and then counting the total results).

For regular DVC remotes, we can get a rough estimate of the remote size, since we know the structure of the remote (content addressable storage according to MD5 hashes). MD5 is evenly distributed, so objects in the remote are also evenly distributed. For regular DVC remotes, this means we can do a listing of a subset of the remote by object key prefix, and use the # of returned results for that subset to estimate the size of the entire remote. Basically, if we know that the 00/ subdirectory contains 1 object, we can estimate that the entire remote contains roughly 256 objects (since there are 256 possible 2-hex-digit prefixes in total). For clouds like S3, we use longer prefixes as well (like 00/0 instead of just 00/) to reduce the size of the subset listing.

Once we have an estimated size, we can make a fairly good guess as to whether it will be faster for us to finish listing the rest of the remote, or just switch to using individual exists calls.

For cloud versioned remotes, there is no way for us to estimate the size of the remote, since we do not know anything about the # of objects within a given directory, and we do not know anything about the # of object versions for each object within a given directory.

to check whether its faster to list the entire bucket or do individual exists calls

Which one do we do for version_aware remotes?

For version_aware remotes we always use exists instead of listing the entire bucket's worth of object versions

  • for directories we push .dir files last, and assume that if a .dir file exists on the remote, the rest of the dir contents still exists

I guess this is where most of the time difference comes. What do you think about checking the first file entry in each .dvc file? And do you think it would be less of an issue if we implemented #7268?

With one of the selling points for cloud versioned remotes being that we leave things human-readable and in their original structure so that users can integrate it into their existing workflows that may be adding/removing things to the bucket without using DVC, we can't make the same kind of assumptions that let us do things like the .dir optimization.

The .dir optimization for regular remotes only works because we can make a relatively safe assumption that DVC is the only thing making modifications to the bucket (whether it's from dvc push or dvc gc -c), and DVC is specifically written to ensure that .dir files are only present in the remote if the entire file contents is also present in the remote.

Adding the force push workaround would help, but that's still really only a workaround, and it only helps when the user actually notices that some files were previously not pushed as expected.

@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do A: cloud-versioning Related to cloud-versioned remotes labels Jan 9, 2023
@dberenbaum
Copy link
Contributor Author

Seems like dql has a similar problem, and AFAIK they plan to ask users when they want to actually reindex what's in the cloud, so it seems consistent for now to assume everything exists until forced to check again.

One problem I found when you included that assumption in #8766 is that when testing with both a version_aware and worktree remote for the same project, pushing to one of the remotes would make the other look like everything was up to date since there was a version_id for each object. This comes back to related issues like #8356 and #8653 (comment), so it seems again like we need to record individual locations/remotes in the .dvc file.

@pmrowla pmrowla self-assigned this Jan 11, 2023
@pmrowla
Copy link
Contributor

pmrowla commented Jan 18, 2023

@dberenbaum there's WIP PRs for this you can try, from a venv with dvc main:

pip install --no-deps git+https://github.com/iterative/dvc-objects.git@refs/pull/186/head#egg=dvc-objects
pip install --no-deps git+https://github.com/iterative/dvc-data.git@refs/pull/268/head#egg=dvc-data

@dberenbaum
Copy link
Contributor Author

Thanks @pmrowla. I'm getting mixed results testing it: not much difference with the 2800-file dataset in s3://dave-sandbox-versioning/coco-small-test but substantial improvement with the 7000 file dataset in s3://dave-sandbox-versioning/coco-small-test. Since the bigger datasets are obviously more important, seems like it should be good enough for now.

More importantly, those changes seem to be breaking the status check with worktree remotes. If I push to a worktree remote where there are no changes, I get:

$ time dvc push -v -r versioned cats-dogs
2023-01-18 16:56:23,191 DEBUG: Checking if stage 'cats-dogs' is in 'dvc.yaml'
2023-01-18 16:56:23,244 DEBUG: indexing latest worktree for 'dave-sandbox-versioning/test/versioned'
2023-01-18 16:56:24,835 DEBUG: Pushing worktree changes to 'dave-sandbox-versioning/test/versioned'
2023-01-18 16:56:24,875 ERROR: unexpected error - []
------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/dave/Code/dvc/dvc/cli/__init__.py", line 184, in main
    ret = cmd.do_run()
  File "/Users/dave/Code/dvc/dvc/cli/command.py", line 26, in do_run
    return self.run()
  File "/Users/dave/Code/dvc/dvc/commands/data_sync.py", line 59, in run
    processed_files_count = self.repo.push(
  File "/Users/dave/Code/dvc/dvc/repo/__init__.py", line 66, in wrapper
    return f(repo, *args, **kwargs)
  File "/Users/dave/Code/dvc/dvc/repo/push.py", line 50, in push
    pushed += _push_worktree(
  File "/Users/dave/Code/dvc/dvc/repo/push.py", line 119, in _push_worktree
    return push_worktree(repo, remote, targets=targets, jobs=jobs, **kwargs)
  File "/Users/dave/Code/dvc/dvc/repo/worktree.py", line 131, in push_worktree
    pushed = checkout(
  File "/Users/dave/miniforge3/envs/dvc/lib/python3.10/site-packages/dvc_data/index/checkout.py", line 33, in checkout
    fs.remove(
  File "/Users/dave/miniforge3/envs/dvc/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 462, in rm
    self.fs.rm(path, recursive=recursive, **kwargs)
  File "/Users/dave/Code/filesystem_spec/fsspec/asyn.py", line 113, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/Users/dave/Code/filesystem_spec/fsspec/asyn.py", line 98, in sync
    raise return_result
  File "/Users/dave/Code/filesystem_spec/fsspec/asyn.py", line 53, in _runner
    result[0] = await coro
  File "/Users/dave/Code/s3fs/s3fs/core.py", line 1836, in _rm
    paths = await self._expand_path(path, recursive=recursive)
  File "/Users/dave/Code/filesystem_spec/fsspec/asyn.py", line 775, in _expand_path
    raise FileNotFoundError(path)
FileNotFoundError: []
------------------------------------------------------------
2023-01-18 16:56:25,565 DEBUG: Removing '/private/tmp/.FMvftB2E4PK2krxeXtrBXY.tmp'
2023-01-18 16:56:25,565 DEBUG: Removing '/private/tmp/.FMvftB2E4PK2krxeXtrBXY.tmp'
2023-01-18 16:56:25,566 DEBUG: Removing '/private/tmp/.FMvftB2E4PK2krxeXtrBXY.tmp'
2023-01-18 16:56:25,566 DEBUG: Removing '/private/tmp/repo/.dvc/cache/.BaJXjH3u6dmDrHxELW2Qjg.tmp'
2023-01-18 16:56:25,567 DEBUG: Version info for developers:
DVC version: 2.41.2.dev35+g3676f35b7
---------------------------------
Platform: Python 3.10.2 on macOS-13.1-arm64-arm-64bit
Subprojects:
        dvc_data = 0.33.1.dev1+g0ed3d9f
        dvc_objects = 0.17.1.dev3+g5bfa5c2
        dvc_render = 0.0.17
        dvc_task = 0.1.11
        dvclive = 1.3.2
        scmrepo = 0.1.6
Supports:
        azure (adlfs = 2022.9.1, knack = 0.9.0, azure-identity = 1.7.1),
        gdrive (pydrive2 = 1.15.0),
        gs (gcsfs = 2022.11.0),
        hdfs (fsspec = 2022.11.0+0.gacad158.dirty, pyarrow = 7.0.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        oss (ossfs = 2021.8.0),
        s3 (s3fs = 2022.11.0+6.g804057f, boto3 = 1.24.59),
        ssh (sshfs = 2022.6.0),
        webdav (webdav4 = 0.9.4),
        webdavs (webdav4 = 0.9.4),
        webhdfs (fsspec = 2022.11.0+0.gacad158.dirty)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2023-01-18 16:56:25,607 DEBUG: Analytics is disabled.
dvc push -v -r versioned cats-dogs  4.23s user 0.39s system 105% cpu 4.390 total

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: cloud-versioning Related to cloud-versioned remotes p1-important Important, aka current backlog of things to do
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants