Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import-url/update: add --no-download flag #8024

Merged
merged 3 commits into from Aug 23, 2022

Conversation

dtrifiro
Copy link
Contributor

@dtrifiro dtrifiro commented Jul 18, 2022

add --no-download flag to dvc import-url/dvc import/dvc update to only create/update .dvc files without downloading the associated data

Closes #7918

@dtrifiro dtrifiro changed the title Import-url no download import-url/update: add --no-download/--check-only flags Jul 18, 2022
dvc/commands/update.py Outdated Show resolved Hide resolved
@dtrifiro dtrifiro changed the title import-url/update: add --no-download/--check-only flags import-url/update: add --no-download flag Jul 18, 2022
@dberenbaum
Copy link
Contributor

@jorgeorpinel I would appreciate your feedback on the flag naming here whenever you can 🙏

@dtrifiro

This comment was marked as outdated.

@dberenbaum
Copy link
Contributor

Nice @dtrifiro! I think the expected usage here is still a little unclear, so I'll try to clarify (cc @dmpetrov):

  1. dvc import-url --no-download foo should generate everything in the corresponding foo.dvc so that if I later run dvc pull, I can get foo without changing foo.dvc. Why this PR doesn't fully address it: a) it generates the deps info but not the outs, so actually downloading foo still changes foo.dvc; and b) there's no way to pull foo without having it saved to the DVC cache/remote (instead, it should act like dvc import where it can be pulled directly from source).
  2. There should be some way to check if the source has changed without touching foo.dvc. This could be in dvc update with a flag like --dry-run/--no-exec (see dvc import[-url] --no-exec) or in status with some flag (should dvc status --with-deps do this already? cc @skshetry). Why this PR doesn't fully address it: it still updates the deps in foo.dvc.

@dtrifiro dtrifiro marked this pull request as ready for review July 18, 2022 17:00
@dtrifiro dtrifiro requested a review from a team as a code owner July 18, 2022 17:00
@dtrifiro dtrifiro requested a review from pared July 18, 2022 17:00
Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting the more general term "hash value" since its what we use in https://dvc.org/doc/user-guide/project-structure/dvc-files#dependency-entries (includes md5, etag, checksum fields) but I suppose "checksum" is the most accurate concept so up to you (feel free to edit the suggestions).

dvc/commands/imp_url.py Outdated Show resolved Hide resolved
dvc/commands/update.py Outdated Show resolved Hide resolved
@dtrifiro
Copy link
Contributor Author

@dberenbaum

  1. dvc import-url --no-download foo should generate everything in the corresponding foo.dvc so that if I later run dvc pull, I can get foo without changing foo.dvc. Why this PR doesn't fully address it: a) it generates the deps info but not the outs, so actually downloading foo still changes foo.dvc; and b) there's no way to pull foo without having it saved to the DVC cache/remote (instead, it should act like dvc import where it can be pulled directly from source).

Not sure I agree about generating outputs here. In order to actually write the out's hash (md5), we'd have to download (stream) the data, defeating (imo) the purpose of having a --no-download flag.
It is a bit strange to have this "intermediate" .dvc file containing deps info but not outs info, and this shows in status: data is shown as missing in the cache (which I'd expect) but the checksum is also shown as changed (which feels odd at first, but makes sense, since data is missing). Maybe this can be dealt with by adding a warning while waiting to implement a proper solution in data status? (cc @skshetry @efiop )

  1. There should be some way to check if the source has changed without touching foo.dvc. [...]

I could add --dry-run (or --check-only?) flag in update, although I'm wondering whether it would more appropriate to have this in data status as well?

@dberenbaum
Copy link
Contributor

Not sure I agree about generating outputs here. In order to actually write the out's hash (md5), we'd have to download (stream) the data, defeating (imo) the purpose of having a --no-download flag.

How are we getting the checksum for the deps section without streaming it?

I see your point, but avoiding streaming the file is not a hard requirement here (that's what --no-exec is for). The only hard requirements are to capture the checksums for versioning, not save to disk, and be able to pull later (maybe --no-local is a better name than --no-download). Open to more discussion on this, but wanted to make clear the requirements.

Some other thoughts:

  • It's like import-url --to-remote, except we shouldn't need to push to the remote if we have a static dataset (or in the future, if we record the cloud version ID).
  • Does it make sense to think about import in addition to import-url?

I could add --dry-run (or --check-only?) flag in update, although I'm wondering whether it would more appropriate to have this in data status as well?

Good point, data status seems like an appropriate place for this. @skshetry What do you think (it shouldn't be part of the initial PR)?

@dtrifiro
Copy link
Contributor Author

How are we getting the checksum for the deps section without streaming it?

We're not: depending on the remote we use various types of metadata (md5, etag, checksum, ...) to check whether the remote file has changed and we need to update it.
Running dvc update after the initial import-url --no-download actually downloads the file, computes the md5 and updates the .dvc file with the computed md5.

Some clouds do offer the possibility of getting the md5 checksum (directly or through the etag), but this is not always available/consistent (e.g. on S3 it depends on whether the bucket is encrypted, whether the file has been uploaded with multipart upload).

The only hard requirements are to capture the checksums for versioning, not save to disk, and be able to pull later.
This works already, although one has to use dvc update as dvc pull will not currently work since the md5 hash is not saved in the outs. Would this work?

When running dvc status after dvc import-url --no-download ..., we get something like this:

WARNING: stage: 'data.dvc' is frozen. Its dependencies are not going to be shown in the status output.
testfiles3.dvc:
	changed outs:
		deleted:            data
	changed checksum

Streaming the file to compute the md5, would get rid of changed checksum and make it possible to use dvc pull.

@dberenbaum
Copy link
Contributor

Thanks @dtrifiro!

The only hard requirements are to capture the checksums for versioning, not save to disk, and be able to pull later.

This works already, although one has to use dvc update as dvc pull will not currently work since the md5 hash is not saved in the outs. Would this work?

It seems like update isn't right here since I don't want to get the latest version but the frozen version I imported.

If we start saving the cloud version ID, a typical workflow would be:

  1. Use Studio to add a model via import-url --no-download.
  2. Overwrite that model with a second version on the cloud.
  3. Get a copy of the first version that was added via import-url --no-download.

Is there a way to accomplish 3? Or, since we don't have cloud version ID support yet, is there a way to try to get the file and fail if it doesn't match the checksum?

When running dvc status after dvc import-url --no-download ..., we get something like this:

WARNING: stage: 'data.dvc' is frozen. Its dependencies are not going to be shown in the status output.
testfiles3.dvc:
	changed outs:
		deleted:            data
	changed checksum

Streaming the file to compute the md5, would get rid of changed checksum and make it possible to use dvc pull.

We at least need a way to show when the source contents change, which is what's missing now.

@dtrifiro dtrifiro marked this pull request as draft July 27, 2022 13:02
@dberenbaum
Copy link
Contributor

Discussed with @dtrifiro that there are several options.

  1. Use dvc repro and fail if etags don't match. However, if a user does dvc pull, they will get an error that the file can't be found, which could be unexpected.
  2. Use dvc pull to do the same but without relying on md5 lookup.
  3. Use new flag to import-url like --download/--finish to complete but completely different syntax.

We can show how 1 works to confirm the workflow makes sense, and then prototype 2 since it seems users will expect this to work in dvc pull.

This touches on aspects of the cloud versioning discussion:

  1. Needing cloud version IDs to authoritatively track the source data.
  2. Needing to pull files without a cache entry.

cc @efiop @dmpetrov

@dtrifiro
Copy link
Contributor Author

dtrifiro commented Aug 3, 2022

I'm more inclined to get 1 and 2 (repro/pull) working. repro is easy, although I'm hoping to find a non-hacky way to also get it to work with pull.

As @dberenbaum mentioned, I getting pull to work would is closely related to the cloud versioning scenario: in both cases we'd need some other logic to retrieve files when we don't have them in an ObjectDB cache

@dmpetrov
Copy link
Member

dmpetrov commented Aug 5, 2022

we'd have to download (stream) the data, defeating (imo) the purpose of having a --no-download flag.

Yes, not spending time for the download is the core requirement here. I see 3 options:

  1. Avoid using md5s for such imports. Fully relay on etags.
  2. dvc pull modify .dvc file
  3. dvc pull does not modify .dvc but temporary saves it somewhere

(1) is preferred. (2) & (3) are ok options.

Running dvc update after the initial import-url --no-download actually downloads the file, computes the md5 and updates the .dvc file with the computed md5.

Yes dvc import-url = dvc import-url --no-download & dvc update

Also, we need to be careful with dvc status and dvc exp run as Dave mentioned.

@dberenbaum
Copy link
Contributor

Yes dvc import-url = dvc import-url --no-download & dvc update

@dmpetrov I think we want dvc import-url = dvc import-url --no-download & dvc pull? In other words, we don't want to update the etag if it has changed. Instead, we want to fail in that case, right?

@dtrifiro dtrifiro force-pushed the import-url-no-download branch 3 times, most recently from 3e35588 to 8145d6f Compare August 8, 2022 16:47
@dtrifiro dtrifiro marked this pull request as ready for review August 13, 2022 20:38
@efiop efiop self-requested a review August 17, 2022 11:12
dvc/commands/data_sync.py Outdated Show resolved Hide resolved
dvc/stage/imports.py Outdated Show resolved Hide resolved
@efiop
Copy link
Member

efiop commented Aug 23, 2022

@pmrowla Could you please take a look as well? Since you are working on #8164 which is related.

add --no-download flag to dvc import-url/dvc update to only create/update .dvc
files without downloading the associated data.

Created .dvc files can be fetched using `dvc pull` or can be
updated using `dvc update --no-download`.
@pmrowla
Copy link
Contributor

pmrowla commented Aug 23, 2022

What is the expected behavior of dvc commit when a repo has partial imports? Currently it will fail because there is no output file in the workspace

$ cat file.dvc
md5: 750c9e509a218cbbcde864ce15515c0d
frozen: true
deps:
- etag: '0x8DA79D118EC7839'
  size: 2
  path: remote://azure/dir/file
outs:
- path: file

$ dvc commit
outputs ['file'] of stage: 'file.dvc' changed. Are you sure you want to commit it? [y/n] y
ERROR: failed to commit - output 'file' does not exist

It seems to me that committing a partial import should just be a no-op and silently pass?

@pmrowla
Copy link
Contributor

pmrowla commented Aug 23, 2022

Also, this is semi-related to #8164, but IMO using dvc pull here to download and complete/un-partial the imports gets confusing, because subsequent dvc pulls do not download the import from it's original source location. So dvc pull only works as a substitute for dvc update-with-etag-verification that one time. And for all subsequent downloads the user has to use dvc update anyways.

DVC will try to pull it from a regular remote (and dvc pull will outright fail if your repo does not have a remote configured).

Basically, given this scenario, I don't think it's obvious to the user why the 2nd and 3rd dvc pull's fail, but the final dvc update succeeds:

$ dvc import-url "https://raw.githubusercontent.com/iterative/dvc/main/README.rst" --no-download
Importing 'https://raw.githubusercontent.com/iterative/dvc/main/README.rst' -> 'README.rst'
...

# pull to complete the import (succeeds, downloads from original https://... source URL)
$ dvc pull
Importing 'https://raw.githubusercontent.com/iterative/dvc/main/README.rst' -> 'README.rst'
1 file fetched

# remove the output and cache
$ rm -rf README.rst .dvc/cache                                                                                                                              ⏎

# pull with no remote configured (fails due to no remote configured)
$ dvc pull
ERROR: failed to pull data from the cloud - config file error: no remote specified. Create a default remote with
    dvc remote add -d <remote name> <remote url>

# add a dummy remote
$ dvc remote add -d empty-remote ../empty                                                                                                                   ⏎
Setting 'empty-remote' as a default remote.

# pull with remote configured (fails because it tries to download cached object from dummy remote and not original import URL)
$ dvc pull
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
name: README.rst, md5: ff649e3dae3038e81076e6e37dc7f57f
1 file failed
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
/Users/pmrowla/git/scratch/tmp/README.rst
Is your cache up to date?
<https://error.dvc.org/missing-files>

# use update (succeeds, downloads latest version from original https://... source URL)
$ dvc update README.rst.dvc                                                                                                                               ⏎
Importing 'https://raw.githubusercontent.com/iterative/dvc/main/README.rst' -> 'README.rst'

@dberenbaum
Copy link
Contributor

What is the expected behavior of dvc commit when a repo has partial imports? Currently it will fail because there is no output file in the workspace

$ cat file.dvc
md5: 750c9e509a218cbbcde864ce15515c0d
frozen: true
deps:
- etag: '0x8DA79D118EC7839'
  size: 2
  path: remote://azure/dir/file
outs:
- path: file

$ dvc commit
outputs ['file'] of stage: 'file.dvc' changed. Are you sure you want to commit it? [y/n] y
ERROR: failed to commit - output 'file' does not exist

It seems to me that committing a partial import should just be a no-op and silently pass?

What if there was something in the workspace?

@pmrowla
Copy link
Contributor

pmrowla commented Aug 23, 2022

What if there was something in the workspace?

It will give you the "output has changed are you sure you want to commit" prompt, and if you commit it we overwrite the outs section of the import with whatever is in the workspace. This is the same behavior you get if you try to commit existing imports though (whether or not you are using --no-download/partial imports)

@efiop
Copy link
Member

efiop commented Aug 23, 2022

For the record: discussed with @dberenbaum and @pmrowla that the commit part should be fixed in this PR, but the pull-related logic is more about import-url in general and we can get back to it in a followup.

@dberenbaum
Copy link
Contributor

@efiop Is commit logic even needed for this PR? Since commit fails now, it shouldn't introduce any breaking changes if we later start making commit a noop.

@efiop
Copy link
Member

efiop commented Aug 23, 2022

@dberenbaum Good point, for some reason I thought this wasn't broken before. Yeah, let's merge and move on then.

Thanks for a great discussion folks, the initial implementation seemed deceptively simple, but you did a great job here figuring it out properly. 🙏

@efiop efiop merged commit 26008f1 into iterative:main Aug 23, 2022
Comment on lines +38 to +39
if deps == stage.deps:
stage.outs = outs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what this is doing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few lines above we clear the stage's outs (delete hash_info, meta, and obj from the out).
This is important because when using no_download == True, old outputs could be retained (even if deps are changed).
If the deps did not change, we can restore the previous outs since they did not change either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the .clear() mutates the output itself right? Isn't stage.outs[0] and outs[0] the same instance here?

stage.save_deps()
stage.deps[0].download(stage.outs[0], jobs=jobs)
if check_changed and not old_hash_info == stage.deps[0].hash_info:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record (unrelated to this PR, it does it correctly for now): in the short future this should really be a meta check rather than hash_info, since etags are really not hashes.

Comment on lines +471 to +473
partial_imports = chain.from_iterable(
self.index.partial_imports(targets, recursive=recursive)
for _ in self.brancher(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thing to note is that this change makes us call brancher more than once on fetch - first to collect regular/cached used objects and then a second time to collect partial imports. This is expensive if we are fetching multiple revs (i.e. the user uses --all-commits).

Ideally we want to walk the repo once per revision, and collect everything we need in that one pass.

failed = 0
for stage in repo.partial_imports(targets, **kwargs):
try:
stage.run()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using run()/sync_import() here means that this is always a pull and not really a fetch. The import will be downloaded into the workspace and then saved/committed to cache, but the output will remain "checked out" in the workspace, even if the user ran fetch.

in this workflow I would not expect file to be present in my workspace (it should only be in .dvc/cache to be checked out later):

$ dvc import-url "azure://test-container/dir/file" --no-download
Importing 'azure://test-container/dir/file' -> 'file'
$ ls
file.dvc  tags
$ dvc fetch
Importing 'azure://test-container/dir/file' -> 'file'
1 file fetched
$ ls
file  file.dvc  tags

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Import-url no download
8 participants