Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gs: support directories as external dependencies/outputs #2814

Closed
willypicard opened this issue Nov 19, 2019 · 6 comments · Fixed by #2853
Closed

gs: support directories as external dependencies/outputs #2814

willypicard opened this issue Nov 19, 2019 · 6 comments · Fixed by #2853
Labels
awaiting response we are waiting for your reply, please respond! :) feature request Requesting a new feature good first issue help wanted p2-medium Medium priority, should be done, but less important

Comments

@willypicard
Copy link

I have a similar issue to #2678 but for GS.

I have a bucket with the following structure

my_bucket
       ├── data
       │     ├── img1.png
       │     ├── img2.png
       │     ├── ...
       └── cache

I have then created a clean project

$ git init
$ dvc init
$ dvc remote add gscache gs://my_bucket/cache
$ dvc config cache.gs gscache
$ dvc add gs://my_bucket/data

The output is as follows:

100%|██████████|Add                                                                                                                            1/1 [00:00<00:00,  1.21file/s]
ERROR: output 'gs://my_bucket/data' does not exist

Adding a single file works (dvc add gs://my_bucket/data/img1.png).

A more verbose version:

$ dvc add gs://my_bucket/data -v 
DEBUG: PRAGMA user_version;
DEBUG: fetched: [(3,)]
DEBUG: CREATE TABLE IF NOT EXISTS state (inode INTEGER PRIMARY KEY, mtime TEXT NOT NULL, size TEXT NOT NULL, md5 TEXT NOT NULL, timestamp TEXT NOT NULL)
DEBUG: CREATE TABLE IF NOT EXISTS state_info (count INTEGER)
DEBUG: CREATE TABLE IF NOT EXISTS link_state (path TEXT PRIMARY KEY, inode INTEGER NOT NULL, mtime TEXT NOT NULL)
DEBUG: INSERT OR IGNORE INTO state_info (count) SELECT 0 WHERE NOT EXISTS (SELECT * FROM state_info)
DEBUG: PRAGMA user_version = 3;
100%|██████████|Add                                                                                                                            1/1 [00:01<00:00,  1.63s/file]
DEBUG: SELECT count from state_info WHERE rowid=?
DEBUG: fetched: [(0,)]
DEBUG: UPDATE state_info SET count = ? WHERE rowid = ?
ERROR: output 'gs://my_bucket/data' does not exist
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/egnyte/anaconda3/envs/dvc/lib/python3.7/site-packages/dvc/command/add.py", line 25, in run
    fname=self.args.file,
  File "/home/egnyte/anaconda3/envs/dvc/lib/python3.7/site-packages/dvc/repo/__init__.py", line 35, in wrapper
    ret = f(repo, *args, **kwargs)
  File "/home/egnyte/anaconda3/envs/dvc/lib/python3.7/site-packages/dvc/repo/scm_context.py", line 4, in run
    result = method(repo, *args, **kw)
  File "/home/egnyte/anaconda3/envs/dvc/lib/python3.7/site-packages/dvc/repo/add.py", line 53, in add
    stage.save()
  File "/home/egnyte/anaconda3/envs/dvc/lib/python3.7/site-packages/dvc/stage.py", line 716, in save
    out.save()
  File "/home/egnyte/anaconda3/envs/dvc/lib/python3.7/site-packages/dvc/output/base.py", line 219, in save
    raise self.DoesNotExistError(self)
dvc.output.base.OutputDoesNotExistError: output 'gs://my_bucket/data' does not exist
------------------------------------------------------------

dvc --version = 0.68.1. I am using ubuntu, I installed using conda, python 3.7.5.

@efiop
Copy link
Contributor

efiop commented Nov 19, 2019

Hi @willypicard !

#2678 is about a specific bug we have in directory support for s3. The issue you are reporting is related to #1654 , as we don't currently support gs directories as external outputs or dependencies. Maybe you could elaborate on what your scenario is, so we could better understand if support for gs dirs is what you really need? 🙂

@efiop efiop added the feature request Requesting a new feature label Nov 19, 2019
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Nov 19, 2019
@efiop efiop changed the title Unable to dvc add more than 1 file at a time in gs bucket gs: support directories as external dependencies/outputs Nov 19, 2019
@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Nov 19, 2019
@willypicard
Copy link
Author

I am using kubeflow to preprocess datasets and I would like to use dvc as a tool to manage my datasets and models. I have large datasets stored on GCS and i would like them to be versioned and on GCS. So it would be convenient to provide the directory containing the dataset instead of each file in it (some of my datasets contains hundreds of thousands of files).

@efiop
Copy link
Contributor

efiop commented Nov 19, 2019

How big are those datasets? Just checking if you are also aware of a possibility to mount that bucket through s3fuse and work with it as with any local files. 🙂

@willypicard
Copy link
Author

I have a dataset that is 250GB large. So rather large...
S3fuse might be an option. However, it would seem "natural" to be able to use dvc add gs://my_bucket/mydataset as we can locally. And it would be a great feature for cloud-based tools such as kubeflow.

@willypicard
Copy link
Author

And obviously it is also possible to use gsutil ls -r gs://my_bucket/data to retrieve the list of files and run dvc add on each of them. But it is utterly not elegant.

@efiop
Copy link
Contributor

efiop commented Nov 19, 2019

@willypicard Got it. Makes sense, let's implement it. 🤝 Unfortunately, we don't have enough space in the current sprint, so if you would be willing to give it a shot, we'll try to help with everything we can. It is really not complex, as we already have all the generalized logic in place, so the only things that one would need to implement are:

Make RemoteGS.exists() support directories
Implement RemoteGS.walk_files()
Implement RemoteGS.isdir()

One could look at s3.py from https://github.com/iterative/dvc/pull/2619/files as an example. 🙂 Let us know what you think. Thanks for the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :) feature request Requesting a new feature good first issue help wanted p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants