-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Hi,
I've been excited to explore DVC as I think it can offer a lot for some of the things I'm working on. I've had a couple of issues with dvc get and dvc get-url that I think are probably bugs. I've put them together to avoid creating lots of issues; apologies if you'd rather they were separate, I can move them into different threads.
I am testing as of v1.1.1.
dvc get does not recreate directories correctly
After v1.0 dvc get pointed to a directory does not result in correctly placed files. All of the files are placed in the root, rathe than subdirectories. I've only tested this with a git repository that is not a DVC repository. It worked in v0.94, but does not work correctly in v1.0 or v1.1. dvc import works correctly.
The following example gets this directory:
$ dvc get https://github.com/explosion/projects nel-emerson
$ tree nel-emerson/
nel-emerson/
├── README.md
├── __init__.py
├── el_recipe.py
├── el_tutorial.py
├── emerson_annotated_text.jsonl
├── emerson_input_text.txt
├── entities.csv
├── nel_schema.png
├── notebook_video.ipynb
└── requirements.txt
0 directories, 10 files
And here's dvc import, which works correctly:
$ dvc import https://github.com/explosion/projects nel-emerson
$ tree nel-emerson/
nel-emerson/
├── README.md
├── __init__.py
├── input
│ └── entities.csv
├── prodigy
│ ├── emerson_annotated_text.jsonl
│ └── emerson_input_text.txt
├── requirements.txt
└── scripts
├── el_recipe.py
├── el_tutorial.py
├── nel_schema.png
└── notebook_video.ipynb
3 directories, 10 files
dvc get-url fails on URLs with weak ETAGs
I understand why import-url needs the strong ETag, but I don't think it makes sense for get-url, so I suspect this is a bug. If so it's a regression after v1.0, as it didn't happen in v0.94.
$ dvc get-url https://raw.githubusercontent.com/explosion/projects/master/nel-emerson/requirements.txt
ERROR: failed to get 'https://raw.githubusercontent.com/explosion/projects/master/nel-emerson/requirements.txt' - Weak ETags are not supported. (Etag: 'W/"24c3b7f31292b04ac8fa9938c6e14baed94a089944652e8abf7dbb9459ae5b56"', URL: 'https://raw.githubusercontent.com/explosion/projects/master/nel-emerson/requirements.txt')
dvc get-url does not follow redirects
This is maybe more of a feature request than a bug report, although if it's expected behaviour, I think at least the error message needs to be different. I also think it's an important feature, as it's pretty common to have an https URL sitting in front of an AWS bucket or other file storage, as direct storage URLs don't allow much flexibility for moving data.
There's also a specific, very common scenario where this comes up: binary data such as trained models are often attached to Github releases as release artefacts. These artefacts are addressed as URLs that redirect to S3. For instance:
$ dvc get-url https://github.com/explosion/spacy-models/releases/download/ro_core_news_sm-2.3.1/ro_core_news_sm-2.3.1.tar.gz
ERROR: failed to get 'https://github.com/explosion/spacy-models/releases/download/ro_core_news_sm-2.3.1/ro_core_news_sm-2.3.1.tar.gz' - dependency 'https://github.com/explosion/spacy-models/releases/download/ro_core_news_sm-2.3.1/ro_core_news_sm-2.3.1.tar.gz' does not exist
While wget works fine:
wget https://github.com/explosion/spacy-models/releases/download/ro_core_news_sm-2.3.1/ro_core_news_sm-2.3.1.tar.gz
--2020-06-29 19:29:31-- https://github.com/explosion/spacy-models/releases/download/ro_core_news_sm-2.3.1/ro_core_news_sm-2.3.1.tar.gz
Resolving github.com (github.com)... 140.82.118.4
Connecting to github.com (github.com)|140.82.118.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
...
Saving to: ‘ro_core_news_sm-2.3.1.tar.gz