Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud versioning: pushing to a worktree remote hangs #8836

Closed
dberenbaum opened this issue Jan 18, 2023 · 19 comments · Fixed by #8842
Closed

cloud versioning: pushing to a worktree remote hangs #8836

dberenbaum opened this issue Jan 18, 2023 · 19 comments · Fixed by #8842
Labels
A: cloud-versioning Related to cloud-versioned remotes A: data-sync Related to dvc get/fetch/import/pull/push p1-important Important, aka current backlog of things to do

Comments

@dberenbaum
Copy link
Contributor

dberenbaum commented Jan 18, 2023

Bug Report

Description

When I push to a worktree remote, I often get stuck in this state for minutes (I have not left it hanging long enough to see if it eventually completes):

Screenshot 2023-01-18 at 8 38 21 AM

Reproduce

I'm not yet sure how to create a minimal example that reproduces it, but it happens often when testing. Here's the steps I have taken to reproduce it:

  1. Run dvc get dvc get git@github.com:iterative/dataset-registry.git use-cases/cats-dogs to get some data.
  2. Run dvc add cats-dogs to track the data.
  3. Setup a worktree remote in a dvc repo.
  4. dvc push to that worktree remote.

And here's a video of it:

Screen.Recording.2023-01-18.at.9.01.20.AM.mov

Output of dvc doctor:

$ dvc doctor
DVC version: 2.41.2.dev36+g5f558d003
---------------------------------
Platform: Python 3.10.2 on macOS-13.1-arm64-arm-64bit
Subprojects:
        dvc_data = 0.33.0
        dvc_objects = 0.17.0
        dvc_render = 0.0.17
        dvc_task = 0.1.11
        dvclive = 1.3.2
        scmrepo = 0.1.6
Supports:
        azure (adlfs = 2022.9.1, knack = 0.9.0, azure-identity = 1.7.1),
        gdrive (pydrive2 = 1.15.0),
        gs (gcsfs = 2022.11.0),
        hdfs (fsspec = 2022.11.0+0.gacad158.dirty, pyarrow = 7.0.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        oss (ossfs = 2021.8.0),
        s3 (s3fs = 2022.11.0+6.g804057f, boto3 = 1.24.59),
        ssh (sshfs = 2022.6.0),
        webdav (webdav4 = 0.9.4),
        webdavs (webdav4 = 0.9.4),
        webhdfs (fsspec = 2022.11.0+0.gacad158.dirty)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do A: data-sync Related to dvc get/fetch/import/pull/push A: cloud-versioning Related to cloud-versioned remotes labels Jan 18, 2023
@daavoo
Copy link
Contributor

daavoo commented Jan 19, 2023

@dberenbaum could you maybe share the profile dvc push --viztracer --viztracer-depth 8?
It should still generate a profile when manually interrupted

I have been testing a larger number of files and for me, it's taking quite some time in building the data index

old_index = build_data_index(view, remote.path, remote.fs)
, when the folder added as remote worktree has a significant number of files

@pmrowla pmrowla reopened this Jan 19, 2023
@pmrowla
Copy link
Contributor

pmrowla commented Jan 19, 2023

It's slow because we have to walk the worktree remote so we can do the diff to see what needs to be pushed and what needs to be deleted in the remote. We probably need to display some kind of message while it is building the index, but we can't use a regular progressbar because we don't know how many files there will be in the remote.

@dberenbaum
Copy link
Contributor Author

dberenbaum commented Jan 19, 2023

Compare the video above to this one, which pushes a dataset ~3x bigger:

Screen.Recording.2023-01-19.at.2.46.36.PM.mov

In the first video, I've checked that the "hanging" happens for ~5 minutes, while it lasts almost no time in the second video.

The only difference I can tell is that there are more previous versions of objects in the first video. Why should that matter here if we are only checking the current versions?

@pmrowla
Copy link
Contributor

pmrowla commented Jan 20, 2023

It's because in S3 you have to list all object versions if you want any versioning information at all

@dberenbaum
Copy link
Contributor Author

dberenbaum commented Jan 20, 2023

🤔 Why isn't it a problem for version_aware remotes? Don't we need version info there also?

And do we need the versions when pushing to a worktree remote? If we just want to check the current version, can we use etags?

Edit: There also might be workarounds like https://stackoverflow.com/a/72281706

@dberenbaum
Copy link
Contributor Author

@daavoo For some reason I'm getting an error trying to view this, but here's the JSON file:

viztracer.zip

@pmrowla
Copy link
Contributor

pmrowla commented Jan 20, 2023

🤔 Why isn't it a problem for version_aware remotes? Don't we need version info there also?

For version_aware we only need to check versions that we know about (because we think we pushed them already). For worktree we have to also check for files that exist in the remote but not in our repo (and then delete them). For version_aware we can just ignore files we don't know about at all.

The version_aware check will be faster in cases where using individual exists calls completes faster than listing the entire remote's worth of existing objects. In worktree we can't use exists since that doesn't help us w/deleting files that are in the remote but not in the DVC workspace.

@daavoo
Copy link
Contributor

daavoo commented Jan 20, 2023

@daavoo For some reason I'm getting an error trying to view this, but here's the JSON file:

I think is because it's too big. In my computer Firefox breaks but chrome manages to render it

@daavoo
Copy link
Contributor

daavoo commented Jan 20, 2023

@daavoo For some reason I'm getting an error trying to view this, but here's the JSON file:

viztracer.zip

Wow I have never seen this 😅 The profile is completely blanket inside checkout for 200s before reaching the sum:

Grabacion.de.pantalla.2023-01-20.a.las.15.00.40.mov

Anyhow, looks like you ran before the changes made in #8842 :

Captura de Pantalla 2023-01-20 a las 15 04 12

Could you rerun with the latest version?

@pmrowla
Copy link
Contributor

pmrowla commented Jan 20, 2023

Edit: There also might be workarounds like https://stackoverflow.com/a/72281706

head-object only works for individual files, and it's what we already do for querying an individual object

If we just want to check the current version, can we use etags?

This could work, but will probably require some adjustment in dvc-data because we will essentially need to mix non-versioned filesystem calls with versioned ones.

The issue is that when listing a prefix, (which is the only way we can find files we don't know about) you can either do list_objects_v2 which does not return any version information at all, or list_object_versions which returns all object versions. The current fsspec s3fs implementation always uses list_object_versions when doing any directory listing, since it is the only way for s3fs to make sure it gets version information for everything.

So to get the "etags only" listing from S3 we need a non-versioned s3fs instance (to use list_objects_v2).

Also, this is only a problem on S3. The azure and gcs APIs are designed properly so that you can list only the latest versions of a prefix and get a response that includes version IDs for those latest versions. (Using the etag method won't improve performance vs using IDs on azure or gcs)

@dberenbaum
Copy link
Contributor Author

dberenbaum commented Jan 20, 2023

> For version_aware we only need to check versions that we know about (because we think we pushed them already). For worktree we have to also check for files that exist in the remote but not in our repo (and then delete them). For version_aware we can just ignore files we don't know about at all.

Don't we only need to get the current versions for that? Why do we need the version IDs of those files to know what to delete?

Edit: addressed above by @pmrowla

I've narrowed down the problem a little -- it only happens on the initial push/when there is no existing cloud metadata.

Could you rerun with the latest version?

On the latest version, dvc push seems to wrongly think everything is up to date:

$ dvc push -r worktree cats-dogs
Everything is up to date.

$ cat cats-dogs.dvc
outs:
- md5: 22e3f61e52c0ba45334d973244efc155.dir
  size: 64128504
  nfiles: 2800
  path: cats-dogs

$ dvc config -l
core.analytics=false
remote.worktree.url=s3://dave-sandbox-versioning/test/worktree
remote.worktree.worktree=true

$ aws s3 ls --recursive s3://dave-sandbox-versioning/test/worktree/
# returns nothing

@daavoo
Copy link
Contributor

daavoo commented Jan 20, 2023

On the latest version, dvc push seems to wrongly think everything is up to date:

#8852

@dberenbaum
Copy link
Contributor Author

#8842 seems to have fixed the "hanging" issue and made worktree and version_aware remotes perform similarly.

However, it also slowed down the overall operation, spending a lot of time on "Updating meta for new files":

Screen.Recording.2023-01-20.at.10.41.40.AM.mov

I'm assuming it's related to listing all object versions, but I'm not clear why it's so much worse in the new version.

Here are results that I think you can reproduce more or less by:

  1. Enable access to AWS sandbox.
  2. git clone https://github.com/dberenbaum/cloud-versioning-test
  3. dvc pull
  4. dvc push -r versioned or dvc push -r worktree

Before #8842:

Screenshot 2023-01-20 at 10 35 53 AM

After #8842:

Screenshot 2023-01-20 at 10 37 00 AM

@skshetry
Copy link
Member

@dberenbaum, can you try running viztracer with --log_async please? See https://viztracer.readthedocs.io/en/latest/concurrency.html#asyncio.

@skshetry
Copy link
Member

Anyway, it seems like it's calling fs.info() after checkout calls for all the files to update metadata which is taking time.

@dberenbaum
Copy link
Contributor Author

@dberenbaum, can you try running viztracer with --log_async please? See https://viztracer.readthedocs.io/en/latest/concurrency.html#asyncio.

viztracer.zip

@pmrowla
Copy link
Contributor

pmrowla commented Jan 20, 2023

However, it also slowed down the overall operation, spending a lot of time on "Updating meta for new files":

@dberenbaum is your actual real-world time performance worse than it was before?

The "updating meta" behavior is the same as it was before, the only difference is that gets a separate progressbar now. Previously it was included in the overall push and showed up as an individual upload progressbar sitting at 100% before the "total push" bar would actually increment for a completed file upload

edit: actually I see the issue, the info calls were previously batched and are done sequentially now, will send a PR with fixes

@pmrowla
Copy link
Contributor

pmrowla commented Jan 20, 2023

The updating meta issue should be fixed with dvc-data==0.35.1 (#8857)

@dberenbaum
Copy link
Contributor Author

Thanks everyone for your help solving this so quickly! Looks good now, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: cloud-versioning Related to cloud-versioned remotes A: data-sync Related to dvc get/fetch/import/pull/push p1-important Important, aka current backlog of things to do
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants