Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update: how to update specific rev import? Clarify docs. #735

Closed
jorgeorpinel opened this issue Oct 22, 2019 · 21 comments · Fixed by #679
Closed

update: how to update specific rev import? Clarify docs. #735

jorgeorpinel opened this issue Oct 22, 2019 · 21 comments · Fixed by #679
Labels
A: docs Area: user documentation (gatsby-theme-iterative)

Comments

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Oct 22, 2019

UPDATE: See #735 (comment) and under for concepts that should be better explained in docs.


$ dvc version
DVC version: 0.63.4
Python version: 3.7.4
Platform: Darwin-19.0.0-x86_64-i386-64bit
Binary: False
Cache: reflink - True, hardlink - True, symlink - True
Filesystem type (cache directory): ('apfs', '/dev/disk1s5')
Filesystem type (workspace): ('apfs', '/dev/disk1s5')

$ mkdir test && cd test
$ dvc init --no-scm
...
$ dvc import --rev cats-dogs-v1 \
             git@github.com:iterative/dataset-registry.git \
             use-cases/cats-dogs
Importing 'use-cases/cats-dogs (git@github.com:iterative/dataset-registry.git)' -> 'cats-dogs'
Output 'cats-dogs' didn't change. Skipping saving.                                                                                                           
Saving information to 'cats-dogs.dvc'.
$ ls
total 8
drwxr-xr-x  3 usr  staff    96B Oct 22 14:05 cats-dogs/
-rw-r--r--  1 usr  staff   340B Oct 22 14:05 cats-dogs.dvc
$ tree cats-dogs --filelimit 3      
cats-dogs
└── data
    ├── train
    │   ├── cats [500 entries exceeds filelimit, not opening dir]
    │   └── dogs [500 entries exceeds filelimit, not opening dir]
...
$ dvc update cats-dogs.dvc
Saving information to 'cats-dogs.dvc'.
$ tree cats-dogs --filelimit 3
cats-dogs
└── data
    ├── train
    │   ├── cats [500 entries exceeds filelimit, not opening dir]
    │   └── dogs [500 entries exceeds filelimit, not opening dir]
...

Notice how the last command still counts 500 images each in cats-dogs/data/train cats/ and dogs/, however the latest version of cats-dogs.dvc in https://github.com/iterative/dataset-registry/tree/master/use-cases should track 1000 files in each.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Oct 22, 2019

To confirm that all the files are on the remote, I go to my local copy of the dataset-registry repo and run:

$ git remote -v             
origin	git@github.com:iterative/dataset-registry.git (fetch)
origin	git@github.com:iterative/dataset-registry.git (push)
$ git rev-parse --short HEAD
99d1cdb
$ tree use-cases/cats-dogs --filelimit 3
use-cases/cats-dogs
└── data
    ├── train
    │   ├── cats [1000 entries exceeds filelimit, not opening dir]
    │   └── dogs [1000 entries exceeds filelimit, not opening dir]
...
$ dvc push                              
Everything is up to date.                                                                                                                                    

Note that the HEAD Git commit is iterative/dataset-registry@99d1cdb, same as the tip of master in that repo rn:

image

https://github.com/iterative/dataset-registry/commits/master

@efiop efiop self-assigned this Oct 22, 2019
@efiop
Copy link
Contributor

efiop commented Oct 22, 2019

Can reproduce. Investigating.

@efiop

This comment has been minimized.

@efiop
Copy link
Contributor

efiop commented Oct 22, 2019

Ah, alright, I've missed this: cats-dogs-v1 is a tag that doesn't go anywhere. So update for that tag is not possible, as there is no newer version for it, it is not a branch. So this is not a bug.

@efiop efiop closed this as completed Oct 22, 2019
@efiop
Copy link
Contributor

efiop commented Oct 22, 2019

@jorgeorpinel to update to cats-dogs-v2, you need to re-import it. There is no option for dvc update to tell it specific rev that you want to update to yet.

@efiop
Copy link
Contributor

efiop commented Oct 22, 2019

@jorgeorpinel We can introduce that option though 🙂 Something like dvc update cats-dogs.dvc --rev cats-dogs-v2. What do you think?

@efiop efiop reopened this Oct 22, 2019
@shcheklein
Copy link
Member

@efiop what would happen if I don't specify initially a --rev option when I do import and then I run dvc update. Will it update to the latest master?

Btw, what are the differences between running dvc update and dvc import again?

@efiop
Copy link
Contributor

efiop commented Oct 22, 2019

@efiop what would happen if I don't specify initially a --rev option when I do import and then I run dvc update. Will it update to the latest master?

@shcheklein Yes, it will import from default branch and if it moves then dvc update will update it to the new state.

Btw, what are the differences between running dvc update and dvc import again?

Update moves within the specified rev (the one specified during import) and so if it is a specific sha then it won't ever update it. If it is a specific tag that didn't move it also won't update it. Basically there is no difference between dvc update and dvc import --rev cats-dogs-v1 here, because first import before dvc update had --rev cats-dogs-v1, so dvc update can't update here.

@jorgeorpinel
Copy link
Contributor Author

Ah, alright, I've missed this: cats-dogs-v1 is a tag that doesn't go anywhere. So update for that tag is not possible...

Ah, thanks for the info! Interesting indeed. It makes sense...

to update to cats-dogs-v2, you need to re-import it...

Re-import it how? Using the commit hash instead of the tag name? That will detect the commit's branch when updating? (I want to import an older version and then update it to the latest.)

We can introduce that option though 🙂 Something like dvc update cats-dogs.dvc --rev cats-dogs-v2.

Yes, I was going to suggest that also.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Oct 22, 2019

Update moves within the specified rev (the one specified during import) and so if it is a specific sha then it won't ever update it.

Hm. Sounds like I can't do what I wanted then (import an older version and then update it to the latest), at least not with dvc update. I have to run dvc import again, knowing which rev contains the updates.

What happens if I (re-)import with dvc import cats-dogs.dvc --rev master, will that become a regular import that can be updated later with dvc update?

Besides or instead of adding --rev to dvc update, I would say that if possible, updating should detect the branch where the imported rev exists. If there's more than one, use the one with the latest commit. What do you guys think?

@shcheklein
Copy link
Member

Hm. Sounds like I can't do what I wanted then (import an older version and then update it to the latest), at least not with dvc update

You can (and probably should). As an illustration, it won't be runnable. I think it will a regular workflow in certain case to import just a branch and then use dvc update from time to time to move to the latest commit in it.

@jorgeorpinel
Copy link
Contributor Author

As an illustration, it won't be runnable.

What do you mean?

in certain case to import just a branch and then use dvc update...

Even if you specify --rev <branch> I think what Ruslan said implies those imports will also never update. Let's wait for his feedback though.

@shcheklein
Copy link
Member

What do you mean?

I mean that user won't be able to to run dcv import without specifying --rev to get the initial version. But, we should use dvc import without --rev to illustrate the regular workflow. In the real world your project evolves along with the data registry and you just use the most recent version w/o specifying --rev.

Ruslan said implies those imports will also never update. Let's wait for his feedback though.

no, I hope that when you use dvc import artifact (w/o rev) dvc update moves it ti the most recent version in that default branch.

@dashohoxha
Copy link
Contributor

dashohoxha commented Oct 23, 2019

Even if you specify --rev <branch> I think what Ruslan said implies those imports will also never update.

@jorgeorpinel To understand the behavior of dvc update with respect to dvc import --rev <tag> and dvc import --rev <branch>, it might be useful to think about the difference between a <tag> and a <branch> in Git. A <tag> is a label that always points to the same commit (revision), unless explicitly moved. A <branch> is a floating label that always points at the top commit of a certain branch.
If you also think of a dvc update as a dvc get followed by a dvc commit, this should make clear its behavior.

I would say that if possible, updating should detect the branch where the imported rev exists. If there's more than one, use the one with the latest commit. What do you guys think?

If you understand properly how dvc update works, I think that this would be not only unnecessary, but also wrong and unexpected. When I import a <tag> or a <commit>, I would expect it not to change, unless I do a re-import with a different <tag> or <commit>.

For the same reason I think that adding a --rev option to dvc update is not a good idea. An explicit re-import would be better in my opinion.

@jorgeorpinel jorgeorpinel changed the title update: not updating update: how to update specific rev import? Oct 24, 2019
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Oct 24, 2019

Makes sense about moving labelsreferences. I knew that, but was not understanding Ruslan's comments correctly.

But I don't agree that it would be wrong or unexpected that update detects the branch when imported with specific SHA or maybe even unmoved tag. When you run dvc update, and you know the repo has changes, you probably expect to get the changes back from update.

However I don't have a strong opinion either way. I don't think either approach is "wrong"; It's just a question of product design.

Of course its easier to not change/complicate it if there's no compelling reason. We could just should probably improve the docs to explain this behavior clearly.

@jorgeorpinel jorgeorpinel transferred this issue from iterative/dvc Oct 24, 2019
@jorgeorpinel jorgeorpinel added command-reference A: docs Area: user documentation (gatsby-theme-iterative) labels Oct 24, 2019
@shcheklein
Copy link
Member

@jorgeorpinel agreed on updating docs! (as usual, haha :))

I would clarify though that dvc update updates the DVC-file and data when branch name is specified as a revision.

It does not update only when you explicitly told it to use an exact commit. It should be the same as regular package managers behave if you specify them a link to an exact commit on Github. Their update logic will keep them as is. And will be updating only if you use a branch name as a rev.

@shcheklein shcheklein changed the title update: how to update specific rev import? update: how to update specific rev import - clarify docs? Oct 24, 2019
@jorgeorpinel jorgeorpinel changed the title update: how to update specific rev import - clarify docs? update: how to update specific rev import? Clarify docs Oct 24, 2019
@jorgeorpinel jorgeorpinel changed the title update: how to update specific rev import? Clarify docs update: how to update specific rev import? Clarify docs. Oct 24, 2019
@jorgeorpinel
Copy link
Contributor Author

Agreed. I like the package manager analogy. Maybe we should've called these features dvc pkg install/update after all haha. 😋

jorgeorpinel added a commit that referenced this issue Oct 30, 2019
… but also for

use-case: add expandable sections to new data registry case
per #679 (comment)
and other misc. copy edits.

Also standardizes term "external" (repo) vs. "source" data/project in this context
and introduces the term "revision fixing".
@jorgeorpinel
Copy link
Contributor Author

Adding details on all this to the docs in 131af1e!

Should we also open a ticket to give special output from the command itself when it detects a fixed rev field in the DVC-file, and this rev hasn't changed? (An alert suggesting it may be necessary to re-import instead of updating if the rev is a tag or SHA commit)

@shcheklein
Copy link
Member

It can be just a status message (not WARNING or something like that - since there is nothing wrong) that this file is up to date since exact commit is specified and there is nothing to update.

@jorgeorpinel
Copy link
Contributor Author

Opened iterative/dvc/issues/2696.

@shcheklein
Copy link
Member

@jorgeorpinel 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants