import-url/update: add --no-download flag #8024

dtrifiro · 2022-07-18T09:33:47Z

add --no-download flag to dvc import-url/dvc import/dvc update to only create/update .dvc files without downloading the associated data

Closes #7918

dvc/commands/update.py

tests/func/test_import_url.py

dberenbaum · 2022-07-18T15:27:40Z

@jorgeorpinel I would appreciate your feedback on the flag naming here whenever you can 🙏

dberenbaum · 2022-07-18T16:17:46Z

Nice @dtrifiro! I think the expected usage here is still a little unclear, so I'll try to clarify (cc @dmpetrov):

dvc import-url --no-download foo should generate everything in the corresponding foo.dvc so that if I later run dvc pull, I can get foo without changing foo.dvc. Why this PR doesn't fully address it: a) it generates the deps info but not the outs, so actually downloading foo still changes foo.dvc; and b) there's no way to pull foo without having it saved to the DVC cache/remote (instead, it should act like dvc import where it can be pulled directly from source).
There should be some way to check if the source has changed without touching foo.dvc. This could be in dvc update with a flag like --dry-run/--no-exec (see dvc import[-url] --no-exec) or in status with some flag (should dvc status --with-deps do this already? cc @skshetry). Why this PR doesn't fully address it: it still updates the deps in foo.dvc.

jorgeorpinel

I'm suggesting the more general term "hash value" since its what we use in https://dvc.org/doc/user-guide/project-structure/dvc-files#dependency-entries (includes md5, etag, checksum fields) but I suppose "checksum" is the most accurate concept so up to you (feel free to edit the suggestions).

dvc/commands/imp_url.py

dvc/commands/update.py

dtrifiro · 2022-07-20T13:47:01Z

@dberenbaum

dvc import-url --no-download foo should generate everything in the corresponding foo.dvc so that if I later run dvc pull, I can get foo without changing foo.dvc. Why this PR doesn't fully address it: a) it generates the deps info but not the outs, so actually downloading foo still changes foo.dvc; and b) there's no way to pull foo without having it saved to the DVC cache/remote (instead, it should act like dvc import where it can be pulled directly from source).

Not sure I agree about generating outputs here. In order to actually write the out's hash (md5), we'd have to download (stream) the data, defeating (imo) the purpose of having a --no-download flag.
It is a bit strange to have this "intermediate" .dvc file containing deps info but not outs info, and this shows in status: data is shown as missing in the cache (which I'd expect) but the checksum is also shown as changed (which feels odd at first, but makes sense, since data is missing). Maybe this can be dealt with by adding a warning while waiting to implement a proper solution in data status? (cc @skshetry @efiop )

There should be some way to check if the source has changed without touching foo.dvc. [...]

I could add --dry-run (or --check-only?) flag in update, although I'm wondering whether it would more appropriate to have this in data status as well?

dberenbaum · 2022-07-21T13:56:10Z

Not sure I agree about generating outputs here. In order to actually write the out's hash (md5), we'd have to download (stream) the data, defeating (imo) the purpose of having a --no-download flag.

How are we getting the checksum for the deps section without streaming it?

I see your point, but avoiding streaming the file is not a hard requirement here (that's what --no-exec is for). The only hard requirements are to capture the checksums for versioning, not save to disk, and be able to pull later (maybe --no-local is a better name than --no-download). Open to more discussion on this, but wanted to make clear the requirements.

Some other thoughts:

It's like import-url --to-remote, except we shouldn't need to push to the remote if we have a static dataset (or in the future, if we record the cloud version ID).
Does it make sense to think about import in addition to import-url?

I could add --dry-run (or --check-only?) flag in update, although I'm wondering whether it would more appropriate to have this in data status as well?

Good point, data status seems like an appropriate place for this. @skshetry What do you think (it shouldn't be part of the initial PR)?

dtrifiro · 2022-07-21T14:52:30Z

How are we getting the checksum for the deps section without streaming it?

We're not: depending on the remote we use various types of metadata (md5, etag, checksum, ...) to check whether the remote file has changed and we need to update it.
Running dvc update after the initial import-url --no-download actually downloads the file, computes the md5 and updates the .dvc file with the computed md5.

Some clouds do offer the possibility of getting the md5 checksum (directly or through the etag), but this is not always available/consistent (e.g. on S3 it depends on whether the bucket is encrypted, whether the file has been uploaded with multipart upload).

The only hard requirements are to capture the checksums for versioning, not save to disk, and be able to pull later.
This works already, although one has to use dvc update as dvc pull will not currently work since the md5 hash is not saved in the outs. Would this work?

When running dvc status after dvc import-url --no-download ..., we get something like this:

WARNING: stage: 'data.dvc' is frozen. Its dependencies are not going to be shown in the status output.
testfiles3.dvc:
	changed outs:
		deleted:            data
	changed checksum

Streaming the file to compute the md5, would get rid of changed checksum and make it possible to use dvc pull.

dberenbaum · 2022-07-21T15:24:01Z

Thanks @dtrifiro!

The only hard requirements are to capture the checksums for versioning, not save to disk, and be able to pull later.

This works already, although one has to use dvc update as dvc pull will not currently work since the md5 hash is not saved in the outs. Would this work?

It seems like update isn't right here since I don't want to get the latest version but the frozen version I imported.

If we start saving the cloud version ID, a typical workflow would be:

Use Studio to add a model via import-url --no-download.
Overwrite that model with a second version on the cloud.
Get a copy of the first version that was added via import-url --no-download.

Is there a way to accomplish 3? Or, since we don't have cloud version ID support yet, is there a way to try to get the file and fail if it doesn't match the checksum?

When running dvc status after dvc import-url --no-download ..., we get something like this:
WARNING: stage: 'data.dvc' is frozen. Its dependencies are not going to be shown in the status output.
testfiles3.dvc:
	changed outs:
		deleted:            data
	changed checksum
Streaming the file to compute the md5, would get rid of changed checksum and make it possible to use dvc pull.

We at least need a way to show when the source contents change, which is what's missing now.

dberenbaum · 2022-08-03T15:58:37Z

Discussed with @dtrifiro that there are several options.

Use dvc repro and fail if etags don't match. However, if a user does dvc pull, they will get an error that the file can't be found, which could be unexpected.
Use dvc pull to do the same but without relying on md5 lookup.
Use new flag to import-url like --download/--finish to complete but completely different syntax.

We can show how 1 works to confirm the workflow makes sense, and then prototype 2 since it seems users will expect this to work in dvc pull.

This touches on aspects of the cloud versioning discussion:

Needing cloud version IDs to authoritatively track the source data.
Needing to pull files without a cache entry.

cc @efiop @dmpetrov

dtrifiro · 2022-08-03T17:42:15Z

I'm more inclined to get 1 and 2 (repro/pull) working. repro is easy, although I'm hoping to find a non-hacky way to also get it to work with pull.

As @dberenbaum mentioned, I getting pull to work would is closely related to the cloud versioning scenario: in both cases we'd need some other logic to retrieve files when we don't have them in an ObjectDB cache

dmpetrov · 2022-08-05T20:23:59Z

we'd have to download (stream) the data, defeating (imo) the purpose of having a --no-download flag.

Yes, not spending time for the download is the core requirement here. I see 3 options:

Avoid using md5s for such imports. Fully relay on etags.
dvc pull modify .dvc file
dvc pull does not modify .dvc but temporary saves it somewhere

(1) is preferred. (2) & (3) are ok options.

Running dvc update after the initial import-url --no-download actually downloads the file, computes the md5 and updates the .dvc file with the computed md5.

Yes dvc import-url = dvc import-url --no-download & dvc update

Also, we need to be careful with dvc status and dvc exp run as Dave mentioned.

dberenbaum · 2022-08-05T21:25:54Z

Yes dvc import-url = dvc import-url --no-download & dvc update

@dmpetrov I think we want dvc import-url = dvc import-url --no-download & dvc pull? In other words, we don't want to update the etag if it has changed. Instead, we want to fail in that case, right?

dvc/commands/data_sync.py

tests/unit/output/test_output.py

dvc/stage/imports.py

efiop · 2022-08-23T07:37:49Z

@pmrowla Could you please take a look as well? Since you are working on #8164 which is related.

add --no-download flag to dvc import-url/dvc update to only create/update .dvc files without downloading the associated data. Created .dvc files can be fetched using `dvc pull` or can be updated using `dvc update --no-download`.

pmrowla · 2022-08-23T09:44:01Z

What is the expected behavior of dvc commit when a repo has partial imports? Currently it will fail because there is no output file in the workspace

$ cat file.dvc
md5: 750c9e509a218cbbcde864ce15515c0d
frozen: true
deps:
- etag: '0x8DA79D118EC7839'
  size: 2
  path: remote://azure/dir/file
outs:
- path: file

$ dvc commit
outputs ['file'] of stage: 'file.dvc' changed. Are you sure you want to commit it? [y/n] y
ERROR: failed to commit - output 'file' does not exist

It seems to me that committing a partial import should just be a no-op and silently pass?

pmrowla · 2022-08-23T10:22:32Z

Also, this is semi-related to #8164, but IMO using dvc pull here to download and complete/un-partial the imports gets confusing, because subsequent dvc pulls do not download the import from it's original source location. So dvc pull only works as a substitute for dvc update-with-etag-verification that one time. And for all subsequent downloads the user has to use dvc update anyways.

DVC will try to pull it from a regular remote (and dvc pull will outright fail if your repo does not have a remote configured).

Basically, given this scenario, I don't think it's obvious to the user why the 2nd and 3rd dvc pull's fail, but the final dvc update succeeds:

$ dvc import-url "https://raw.githubusercontent.com/iterative/dvc/main/README.rst" --no-download
Importing 'https://raw.githubusercontent.com/iterative/dvc/main/README.rst' -> 'README.rst'
...

# pull to complete the import (succeeds, downloads from original https://... source URL)
$ dvc pull
Importing 'https://raw.githubusercontent.com/iterative/dvc/main/README.rst' -> 'README.rst'
1 file fetched

# remove the output and cache
$ rm -rf README.rst .dvc/cache                                                                                                                              ⏎

# pull with no remote configured (fails due to no remote configured)
$ dvc pull
ERROR: failed to pull data from the cloud - config file error: no remote specified. Create a default remote with
    dvc remote add -d <remote name> <remote url>

# add a dummy remote
$ dvc remote add -d empty-remote ../empty                                                                                                                   ⏎
Setting 'empty-remote' as a default remote.

# pull with remote configured (fails because it tries to download cached object from dummy remote and not original import URL)
$ dvc pull
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
name: README.rst, md5: ff649e3dae3038e81076e6e37dc7f57f
1 file failed
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
/Users/pmrowla/git/scratch/tmp/README.rst
Is your cache up to date?
<https://error.dvc.org/missing-files>

# use update (succeeds, downloads latest version from original https://... source URL)
$ dvc update README.rst.dvc                                                                                                                               ⏎
Importing 'https://raw.githubusercontent.com/iterative/dvc/main/README.rst' -> 'README.rst'

dberenbaum · 2022-08-23T11:50:07Z

What is the expected behavior of dvc commit when a repo has partial imports? Currently it will fail because there is no output file in the workspace
$ cat file.dvc
md5: 750c9e509a218cbbcde864ce15515c0d
frozen: true
deps:
- etag: '0x8DA79D118EC7839'
  size: 2
  path: remote://azure/dir/file
outs:
- path: file

$ dvc commit
outputs ['file'] of stage: 'file.dvc' changed. Are you sure you want to commit it? [y/n] y
ERROR: failed to commit - output 'file' does not exist
It seems to me that committing a partial import should just be a no-op and silently pass?

What if there was something in the workspace?

pmrowla · 2022-08-23T11:54:40Z

What if there was something in the workspace?

It will give you the "output has changed are you sure you want to commit" prompt, and if you commit it we overwrite the outs section of the import with whatever is in the workspace. This is the same behavior you get if you try to commit existing imports though (whether or not you are using --no-download/partial imports)

efiop · 2022-08-23T19:56:23Z

For the record: discussed with @dberenbaum and @pmrowla that the commit part should be fixed in this PR, but the pull-related logic is more about import-url in general and we can get back to it in a followup.

dberenbaum · 2022-08-23T19:59:52Z

@efiop Is commit logic even needed for this PR? Since commit fails now, it shouldn't introduce any breaking changes if we later start making commit a noop.

efiop · 2022-08-23T20:06:03Z

@dberenbaum Good point, for some reason I thought this wasn't broken before. Yeah, let's merge and move on then.

Thanks for a great discussion folks, the initial implementation seemed deceptively simple, but you did a great job here figuring it out properly. 🙏

skshetry · 2022-08-24T04:23:36Z

dvc/stage/imports.py

+        if deps == stage.deps:
+            stage.outs = outs


I am not sure what this is doing here.

A few lines above we clear the stage's outs (delete hash_info, meta, and obj from the out).
This is important because when using no_download == True, old outputs could be retained (even if deps are changed).
If the deps did not change, we can restore the previous outs since they did not change either.

But the .clear() mutates the output itself right? Isn't stage.outs[0] and outs[0] the same instance here?

efiop · 2022-08-28T21:39:30Z

dvc/stage/imports.py

        stage.save_deps()
-        stage.deps[0].download(stage.outs[0], jobs=jobs)
+        if check_changed and not old_hash_info == stage.deps[0].hash_info:


For the record (unrelated to this PR, it does it correctly for now): in the short future this should really be a meta check rather than hash_info, since etags are really not hashes.

pmrowla · 2022-08-29T06:45:32Z

dvc/repo/__init__.py

+        partial_imports = chain.from_iterable(
+            self.index.partial_imports(targets, recursive=recursive)
+            for _ in self.brancher(


One other thing to note is that this change makes us call brancher more than once on fetch - first to collect regular/cached used objects and then a second time to collect partial imports. This is expensive if we are fetching multiple revs (i.e. the user uses --all-commits).

Ideally we want to walk the repo once per revision, and collect everything we need in that one pass.

pmrowla · 2022-08-29T08:25:07Z

dvc/repo/fetch.py

+    failed = 0
+    for stage in repo.partial_imports(targets, **kwargs):
+        try:
+            stage.run()


Using run()/sync_import() here means that this is always a pull and not really a fetch. The import will be downloaded into the workspace and then saved/committed to cache, but the output will remain "checked out" in the workspace, even if the user ran fetch.

in this workflow I would not expect file to be present in my workspace (it should only be in .dvc/cache to be checked out later):

$ dvc import-url "azure://test-container/dir/file" --no-download Importing 'azure://test-container/dir/file' -> 'file' $ ls file.dvc tags $ dvc fetch Importing 'azure://test-container/dir/file' -> 'file' 1 file fetched $ ls file file.dvc tags

dtrifiro changed the title ~~Import-url no download~~ import-url/update: add --no-download/--check-only flags Jul 18, 2022

dtrifiro force-pushed the import-url-no-download branch from fed977d to 65d8fa5 Compare July 18, 2022 10:46

efiop reviewed Jul 18, 2022

View reviewed changes

dvc/commands/update.py Outdated Show resolved Hide resolved

dtrifiro force-pushed the import-url-no-download branch from e5161f1 to cb533f0 Compare July 18, 2022 11:58

dtrifiro changed the title ~~import-url/update: add --no-download/--check-only flags~~ import-url/update: add --no-download flag Jul 18, 2022

dtrifiro force-pushed the import-url-no-download branch from cb533f0 to 7f3819a Compare July 18, 2022 12:06

efiop reviewed Jul 18, 2022

View reviewed changes

tests/func/test_import_url.py Outdated Show resolved Hide resolved

efiop reviewed Jul 18, 2022

View reviewed changes

tests/func/test_import_url.py Outdated Show resolved Hide resolved

dtrifiro force-pushed the import-url-no-download branch from 7f3819a to ab40c8b Compare July 18, 2022 15:04

This comment was marked as outdated.

Sign in to view

dtrifiro force-pushed the import-url-no-download branch from ab40c8b to 75a2cb8 Compare July 18, 2022 16:44

dtrifiro marked this pull request as ready for review July 18, 2022 17:00

dtrifiro requested a review from a team as a code owner July 18, 2022 17:00

dtrifiro requested a review from pared July 18, 2022 17:00

jorgeorpinel mentioned this pull request Jul 18, 2022

import-url/update: add --no-download flag iterative/dvc.org#3773

Merged

jorgeorpinel reviewed Jul 18, 2022

View reviewed changes

dvc/commands/imp_url.py Outdated Show resolved Hide resolved

dvc/commands/update.py Outdated Show resolved Hide resolved

dtrifiro marked this pull request as draft July 27, 2022 13:02

dtrifiro force-pushed the import-url-no-download branch 3 times, most recently from 3e35588 to 8145d6f Compare August 8, 2022 16:47

dtrifiro marked this pull request as ready for review August 13, 2022 20:38

efiop self-requested a review August 17, 2022 11:12

efiop reviewed Aug 17, 2022

View reviewed changes

dvc/commands/data_sync.py Outdated Show resolved Hide resolved

efiop reviewed Aug 17, 2022

View reviewed changes

tests/unit/output/test_output.py Show resolved Hide resolved

efiop reviewed Aug 23, 2022

View reviewed changes

dvc/stage/imports.py Outdated Show resolved Hide resolved

dtrifiro added 2 commits August 23, 2022 11:12

fix error message on fetch/pull/checkout failure

1de37b5

tests: fix typo in test name

83d36f4

dtrifiro force-pushed the import-url-no-download branch from 1dbc4c5 to 634d350 Compare August 23, 2022 09:13

import-url/update: add --no-download flag

da19dcc

add --no-download flag to dvc import-url/dvc update to only create/update .dvc files without downloading the associated data. Created .dvc files can be fetched using `dvc pull` or can be updated using `dvc update --no-download`.

dtrifiro force-pushed the import-url-no-download branch from 634d350 to da19dcc Compare August 23, 2022 09:45

efiop approved these changes Aug 23, 2022

View reviewed changes

efiop merged commit 26008f1 into iterative:main Aug 23, 2022

This was referenced Aug 23, 2022

import-url: --no-download and dvc pull #8172

Closed

import-url: --no-download and dvc commit #8173

Closed

skshetry reviewed Aug 24, 2022

View reviewed changes

efiop reviewed Aug 28, 2022

View reviewed changes

pmrowla reviewed Aug 29, 2022

View reviewed changes

dberenbaum mentioned this pull request Aug 29, 2022

import-url: fetch after --no-download should not checkout to the workspace #8200

Closed

dtrifiro deleted the import-url-no-download branch August 30, 2022 15:45

dberenbaum mentioned this pull request Jan 2, 2023

update: import-url updates will get re-downloaded even if nothing changes #8751

Closed

dberenbaum mentioned this pull request Jan 2, 2024

index: fix storage for imports #10213

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

import-url/update: add --no-download flag #8024

import-url/update: add --no-download flag #8024

dtrifiro commented Jul 18, 2022 •

edited

dberenbaum commented Jul 18, 2022

This comment was marked as outdated.

dberenbaum commented Jul 18, 2022

jorgeorpinel left a comment •

edited

dtrifiro commented Jul 20, 2022

dberenbaum commented Jul 21, 2022

dtrifiro commented Jul 21, 2022

dberenbaum commented Jul 21, 2022

dberenbaum commented Aug 3, 2022

dtrifiro commented Aug 3, 2022

dmpetrov commented Aug 5, 2022

dberenbaum commented Aug 5, 2022

efiop commented Aug 23, 2022

pmrowla commented Aug 23, 2022

pmrowla commented Aug 23, 2022 •

edited

dberenbaum commented Aug 23, 2022

pmrowla commented Aug 23, 2022

efiop commented Aug 23, 2022

dberenbaum commented Aug 23, 2022

efiop commented Aug 23, 2022

skshetry Aug 24, 2022

dtrifiro Aug 25, 2022

skshetry Aug 25, 2022

efiop Aug 28, 2022

pmrowla Aug 29, 2022

pmrowla Aug 29, 2022

import-url/update: add --no-download flag #8024

import-url/update: add --no-download flag #8024

Conversation

dtrifiro commented Jul 18, 2022 • edited

dberenbaum commented Jul 18, 2022

This comment was marked as outdated.

dberenbaum commented Jul 18, 2022

jorgeorpinel left a comment • edited

Choose a reason for hiding this comment

dtrifiro commented Jul 20, 2022

dberenbaum commented Jul 21, 2022

dtrifiro commented Jul 21, 2022

dberenbaum commented Jul 21, 2022

dberenbaum commented Aug 3, 2022

dtrifiro commented Aug 3, 2022

dmpetrov commented Aug 5, 2022

dberenbaum commented Aug 5, 2022

efiop commented Aug 23, 2022

pmrowla commented Aug 23, 2022

pmrowla commented Aug 23, 2022 • edited

dberenbaum commented Aug 23, 2022

pmrowla commented Aug 23, 2022

efiop commented Aug 23, 2022

dberenbaum commented Aug 23, 2022

efiop commented Aug 23, 2022

skshetry Aug 24, 2022

Choose a reason for hiding this comment

dtrifiro Aug 25, 2022

Choose a reason for hiding this comment

skshetry Aug 25, 2022

Choose a reason for hiding this comment

efiop Aug 28, 2022

Choose a reason for hiding this comment

pmrowla Aug 29, 2022

Choose a reason for hiding this comment

pmrowla Aug 29, 2022

Choose a reason for hiding this comment

dtrifiro commented Jul 18, 2022 •

edited

jorgeorpinel left a comment •

edited

pmrowla commented Aug 23, 2022 •

edited