Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to return back to an older version of data #599

Closed
rahulguptajss opened this issue Apr 1, 2018 · 18 comments
Closed

how to return back to an older version of data #599

rahulguptajss opened this issue Apr 1, 2018 · 18 comments

Comments

@rahulguptajss
Copy link

rahulguptajss commented Apr 1, 2018

Hi,

If i change a data file (let's say my training set) and then run dvc repro. How do i revert back to an older version of data file which is my older training set

Thanks.

@efiop
Copy link
Contributor

efiop commented Apr 1, 2018

Hi @rahulguptajss !

Versions of data files are determined by the appropriate dvc files that store their md5 checksums. I.e. if you've added your training set with dvc add train.tsv, then train.tsv.dvc has info about train.tsv version. So basically versions of your data files are fully determined by the version of the appropriate dvc files(which are tracked by git). If you did commit changes before changing your training set, then you can simply roll-back your train.tsv.dvc file version and call dvc checkout.

Example:

echo old > train.tsv
dvc add train.tsv
git add train.tsv.dvc
git add .gitignore
git commit -m 'Add old version'
# Now lets replace old data with new.
# Note that we don't modify the existing data file, but actually fully replace it.
# This is important as we currently use hardlinks for data files and modifying existing data
# file will also modify the cache file, causing its corruption and automatic removal by dvc.
dvc remove train.tsv.dvc
echo new > train.tsv
dvc add train.tsv
git add train.tsv.dvc
git commit -m New
# Now lets go back to the old version of that file:
git reset <Old commit> train.tsv.dvc
dvc checkout # Now your train.tsv is restored to 'old'

Edit: fixed typos.

@rahulguptajss
Copy link
Author

thank you for the detailed explanation.

@efiop
Copy link
Contributor

efiop commented Apr 2, 2018

@rahulguptajss FYI: there was a typo in my last comment. There should be 'git add train.tsv.dvc' instead of 'git add train.tsv'. Kudos to @dmpetrov for noticing.

@drorata
Copy link

drorata commented Aug 28, 2018

@efiop Thanks for the details! Very helpful and might be worthy to add to the docs. Could you also elaborate how to handle the data versioning when it is stored on the cloud (S3 for example)?

@efiop
Copy link
Contributor

efiop commented Aug 28, 2018

Hi @drorata !

Very helpful and might be worthy to add to the docs.

Thank you for the feedback! Created iterative/dvc.org#78 to track the progress on it.

Could you also elaborate how to handle the data versioning when it is stored on the cloud (S3 for example)?

If you are talking about external output scenario then it is absolutely no different from what I've described above. If you are talking about cache stored on s3(e.g. https://dvc.org/doc/use-cases/share-data-and-model-files ) then it stores all the versions that you've pushed to it from your local workspace, so all the versioning happens locally and once again no different from my example from above 🙂

Thanks,
Ruslan

@drorata
Copy link

drorata commented Sep 5, 2018

The more I think about it I understand this is a super important and central use case which has a tricky step: dvc remove train.tsv. Correct me if I'm wrong, but it seems like this is not reflected in the documentation in a clear and loud enough of a manner. Don't you think it should be added to https://dvc.org/doc/use-cases/data-and-model-files-versioning? Or maybe a another use case?

@efiop
Copy link
Contributor

efiop commented Sep 5, 2018

Good point! I agree, it should be in https://dvc.org/doc/use-cases/data-and-model-files-versioning. I will add it ASAP.

Thanks,
Ruslan

@drorata
Copy link

drorata commented Sep 10, 2018

@efiop Above, shouldn't dvc remove train.tsv be actually dvc remove train.tsv.dvc?

@efiop
Copy link
Contributor

efiop commented Sep 10, 2018

@drorata You are absolutely right! Fixed. Thank you!

@drorata
Copy link

drorata commented Sep 10, 2018

I am a little confused. If I run dvc remove foo.csv.dvc, then foo.csv is removed from my working directory. This is OK if I need to replace it. What is the right flow if I want to change/update foo.csv?

What I just tried is to simply edit the file in the working directory and then dvc noticed the changed and I could run dvc add foo.csv and the corresponding foo.csv.dvc was updated. Am I missing something?

@efiop
Copy link
Contributor

efiop commented Sep 10, 2018

I am a little confused. If I run dvc remove foo.csv.dvc, then foo.csv is removed from my working directory. This is OK if I need to replace it. What is the right flow if I want to change/update foo.csv?

If you want to simply modify the file, then in general this:

dvc remove train.tsv.dvc
echo new > train.tsv
dvc add train.tsv

should be replaced by this:

cp train.tsv train.tsv.tmp
dvc remove train.tsv.dvc
mv train.tsv.tmp train.tsv
echo new >> train.tsv
dvc add train.tsv

This is a general flow that is needed for hardlink/symlink cache types in order to avoid corrupting the cache for the previous version of the file.

What I just tried is to simply edit the file in the working directory and then dvc noticed the changed and I could run dvc add foo.csv and the corresponding foo.csv.dvc was updated.

Unless your workspace supports reflinks(if you are on a recent Mac then chances are you are using reflinks) or you've manually specified cache.type copy, you have corrupted the cache for the previous version of the file 🙁 and that is why in general it should be done as in the listing above. We are currently working on protecting hard/symlinks with read-only permissions to avoid such inconvenience.

Thanks,
Ruslan

@drorata
Copy link

drorata commented Sep 10, 2018

  • How can I tell what links were used?
  • How can I tell whether the cache is corrupted?
  • Lastly, this is a huge pitfall IMHO. I hope you'll find a safer workaround. And in the meanwhile this has to be documented in BOLD LETTERS.

Thanks for your detailed responses.

@efiop
Copy link
Contributor

efiop commented Sep 10, 2018

How can I tell what links were used?

Try dvc add something -v or dvc checkout -v and it will tell which one it uses. You can specify which one exactly you want to use using dvc config cache.type reflink/hardlink/symlink/copy.

How can I tell whether the cache is corrupted?

If it was corrupted, dvc will print a warning and remove the corrupted cache.

Lastly, this is a huge pitfall IMHO. I hope you'll find a safer workaround. And in the meanwhile this has to be documented in BOLD LETTERS.

I agree, and we are working on it. Here is #799 where we track the progress on it. There is a v1.0 coming pretty soon, where we are trying to improve dvc with all the feedback we've received so far. We will be sure to bring attention to this moment in the docs. Thank you for the feedback!

@shcheklein
Copy link
Member

@drorata thank you, we've added a couple of safety notes here (expand DVC internals at the bottom) and here (check the cache type section). Let me know if you have other places in mind where it makes sense to warn our users.

We've also created an issue to better describe the flow, thanks @efiop .

@bayethiernodiop
Copy link

I am a little confused. If I run dvc remove foo.csv.dvc, then foo.csv is removed from my working directory. This is OK if I need to replace it. What is the right flow if I want to change/update foo.csv?

If you want to simply modify the file, then in general this:

dvc remove train.tsv.dvc
echo new > train.tsv
dvc add train.tsv

should be replaced by this:

cp train.tsv train.tsv.tmp
dvc remove train.tsv.dvc
mv train.tsv.tmp train.tsv
echo new >> train.tsv
dvc add train.tsv

This is a general flow that is needed for hardlink/symlink cache types in order to avoid corrupting the cache for the previous version of the file.

What I just tried is to simply edit the file in the working directory and then dvc noticed the changed and I could run dvc add foo.csv and the corresponding foo.csv.dvc was updated.

Unless your workspace supports reflinks(if you are on a recent Mac then chances are you are using reflinks) or you've manually specified cache.type copy, you have corrupted the cache for the previous version of the file and that is why in general it should be done as in the listing above. We are currently working on protecting hard/symlinks with read-only permissions to avoid such inconvenience.

Thanks,
Ruslan

for the modification, shouldn't we use the dvc unprotect

@efiop
Copy link
Contributor

efiop commented Mar 21, 2019

@bayethiernodiop Yes, but it was added fairly recently and this post is from a year ago 🙂 We have the process described at https://dvc.org/doc/user-guide/how-to/update-tracked-files

@safijari
Copy link

Link is broke

@efiop
Copy link
Contributor

efiop commented May 11, 2020

@safijari Fixed https://dvc.org/doc/user-guide/updating-tracked-files . Thanks for the heads up! 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants