Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DVC list not working with remote google cloud storage #2309

Closed
tupui opened this issue Mar 9, 2021 · 16 comments · Fixed by #2317 or #2306
Closed

DVC list not working with remote google cloud storage #2309

tupui opened this issue Mar 9, 2021 · 16 comments · Fixed by #2317 or #2306

Comments

@tupui
Copy link

tupui commented Mar 9, 2021

There are two things here:

  1. I expected to be able to just do dvc list and it would pick up the default remove.
  2. dvc list gs://... is not working (cloning is failing)

I can list the content by using dvc status --remote so I would like to be able to do the same with listing.

@skshetry skshetry transferred this issue from iterative/dvc.org Mar 9, 2021
@skshetry
Copy link
Member

skshetry commented Mar 9, 2021

@tupui, dvc list is a virtual ls of your repo, it does not work with other remotes at all. It provides list of your dvc tracked and git tracked files in a virtual list.

Eg:

$ dvc list <path/to/your/repo>

@skshetry skshetry closed this as completed Mar 9, 2021
@skshetry skshetry reopened this Mar 9, 2021
@tupui
Copy link
Author

tupui commented Mar 9, 2021

Ok but this is not clear from reading this: https://dvc.org/doc/use-cases/data-registries#listing-data

@efiop
Copy link
Contributor

efiop commented Mar 9, 2021

@tupui That paragraph is a part of use-case, which doesn't have to be throughout. What about https://dvc.org/doc/command-reference/list ? If you have any suggestions, please let us know.

@efiop efiop transferred this issue from iterative/dvc Mar 9, 2021
@tupui
Copy link
Author

tupui commented Mar 9, 2021

@tupui That paragraph is a part of use-case, which doesn't have to be throughout. What about https://dvc.org/doc/command-reference/list ? If you have any suggestions, please let us know.

Same from here: https://dvc.org/doc/command-reference/list#example-list-all-files-and-directories-in-a-data-registry

It is saying data-registry and for me my GCS is a data-registry. In any case, I find this behaviour to be not consistent.

@tupui
Copy link
Author

tupui commented Mar 9, 2021

This is confusing for me:

The optional path argument is used to specify a directory to list within the source repository at url (including paths inside tracked directories). It's similar to providing a path to list to commands such as ls or aws s3 ls.

From this, I feel that I can have a list of the files at GCS too.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Mar 16, 2021

Hi @tupui

not clear from reading this: https://dvc.org/doc/use-cases/data-registries#listing-data

That reads "To explore the contents of a DVC repo ... use the dvc list command", can you help us understand what part made you think list accepts arbitrary remote locations (e.g. gs://...) as arguments? Thanks

Same from here: https://dvc.org/doc/command-reference/list#example-list-all-files-and-directories-in-a-data-registry
It is saying data-registry and for me my GCS is a data-registry.

It opens with "Let's imagine a DVC repo used as a data registry" so it actually specifies what we mean by "data registry", which is a major DVC pattern we have documented (as a Use Case). And the URL is a Github repo (https://github.com/iterative/dataset-registry).

This is confusing

... within the source repository at url

Again it states that url is a source "repository".

But if you have a specific suggestion on how to improve the texts that confuse you, maybe that would be easier to address.

I find this behavior to be not consistent.

Can you elaborate on why please? Maybe you're thinking of a sort of dvc list-url command to match get-url and import-url? Thanks

@tupui
Copy link
Author

tupui commented Mar 16, 2021

can you help us understand what part made you think list accepts arbitrary remote locations (e.g. gs://...) as arguments? Thanks

Well why this should not be the case?

But if you have a specific suggestion on how to improve the texts that confuse you, maybe that would be easier to address.

Can you elaborate on why please? Maybe you're thinking of a sort of dvc list-url command to match get-url and import-url? Thanks

The confusion is coming from the fact that you have an API for git related things which you call DVC repo and some other API for the data registry. The term DVC repo is confusing as for me it meant the data. Maybe just calling it Git repo would have been better. But now that I know better the difference, it's hard to tell. Another way to address the issue would be to clearly have git in the command arguments to know that we are handling git things and not data related things.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Mar 16, 2021

Why this should not be the case?

This answer doesn't help understand why you expect that. dvc list simply isn't meant for this. But free to open a feature request for dvc list-url on https://github.com/iterative/dvc/issues/new/choose.

I find this behavior to be not consistent...
you have an API for git related things which you call DVC repo and some other API for the data registry
Another way to address the issue would be to clearly have git in the command arguments

Not sure about that. Sounds like this is a core discussion (cc @efiop sending back for now).

The term DVC repo is confusing as for me it meant the data.
Maybe just calling it Git repo

Thanks for the feedback, will consider the suggestion in #2306 (or a following update). We're also working on a basic concepts section (#550) which should hopefully help with that too. For now there are already pervasive tooltips around the docs e.g.

image

Thanks

@jorgeorpinel jorgeorpinel transferred this issue from iterative/dvc.org Mar 16, 2021
@shcheklein
Copy link
Member

@jorgeorpinel hey, what are the action points for this ticket in the DVC core? :)

@tupui
Copy link
Author

tupui commented Mar 17, 2021

For the tooltip, I think this should be uniformize at least. So DVC repo or DVC Project in both cases. Also, for me this is not a DVC repo as it is a repo with my code and with data. When you write DVC repo or project, I expect to only have DVC related things under this definition. Hence doing a dvc list would list DVC things.

you have an API for git related things which you call DVC repo and some other API for the data registry

On this I'm not sure we can compare the internal Python Repo API (not documented but available for use) with the dvc list/import/get commands. Not sure I got that right though.

Here I am not talking about the Python API sorry. I was referring to the CLI.

IMO, there would be two ways to address this. Either keep the ability to either get info about GIT things and only DVC things. Or remove this distinction for the user.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Mar 17, 2021

what are the action points for this ticket in the DVC core?

It looks like a discussion on the command behavior to me @shcheklein.

Title: dvc list not working with GC
Quesion: Should DVC list arbitrary remote contents? either with list or list-url for example

For now the one doc suggestion I was able to find so far is already included in /pull/2306 (for now — that one may get split). @tupui's last comment does focus on docs since we guided the conversation in that direction, so I guess now it's a mix.

@jorgeorpinel

This comment has been minimized.

@shcheklein
Copy link
Member

To my mind this ticket was about dvc list. There is not intention to change its behavior. And to be honest I doubt we can list remote in any meaningful way. So it feels that this primarily about making docs extra clear about this (if needed).

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Mar 17, 2021

OK, closing the core discussion / feature-request then. Moving to docs again to continue that part of the conversation.

p.s. The core points below don't have to be answered yet:

  • I find this behavior to be not consistent...
  • you have an API for git related things which you call DVC repo and some other API for the data registry
  • Another way to address the issue would be to clearly have git in the command arguments
  • Either keep the ability to either get info about GIT things and only DVC things. Or remove this distinction for the user.

I'll focus on the docs side for now.

@jorgeorpinel jorgeorpinel transferred this issue from iterative/dvc Mar 17, 2021
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Mar 17, 2021

@tupui I'm trying to summarize but I still have some questions for you. Thanks in advance:

you have an API for git related things which you call DVC repo and some other API for the data registry

What do you mean by "an API for data registry"? Is there a link to docs where you found that? A data registry is a pattern (see https://dvc.org/doc/use-cases/data-registries) which can be implemented on DVC (or not).

👉 Maybe this is the main confusion (below) 👈

We do have a sample DVC repo that implements a data reg (https://github.com/iterative/dataset-registry), which is used in an example of the list reference (et al.). But the example specifies "Let's imagine a DVC repo used as a data registry".

(tooltips) should be uniformize at least. So DVC repo or DVC Project in both cases

Indeed 👍 but it's tricky because we need both terms for different contexts. We are reviewing all the concepts/tooltips in #550 though, and thinking about this.

for me this is not a DVC repo as it is a repo with my code ... I expect to only have DVC related things under this definition... dvc list would list DVC things

I think I understand what you mean now 🙂 but going back to #2309 (comment), that's not what list does. The cmd ref opens with "List repository contents, including files ... tracked by DVC and by Git." (otherwise you'd have to git ls-remote also and combine the lists somehow). It also specifies url is a DVC or Git repo.

So thus far I can't find any other updates need to docs per this. Please if you have specific suggestions share them here.

@jorgeorpinel
Copy link
Contributor

UPDATE: I decided to update the list reference to make it more explicit. See e1a4435.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants