Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc diff or some dataset management tooling #770

Open
villasv opened this issue Jun 13, 2018 · 36 comments
Open

dvc diff or some dataset management tooling #770

villasv opened this issue Jun 13, 2018 · 36 comments
Labels
enhancement Enhances DVC feature request Requesting a new feature p2-medium Medium priority, should be done, but less important

Comments

@villasv
Copy link
Contributor

villasv commented Jun 13, 2018

dvc plays really well with git, but one thing that I still miss in a data version control system that I really value in source version control systems is the tooling to inspect patches. Because data should be a deterministic reproducible output from the source code, almost all important changes are found in the code history. But frequently I also want to inspect what changed in the data.

Concrete scenario: let's say that I have an input csv and a transformation script that outputs a sanitized version of that input file. Then, I make a very small change in the sanitizing strategy, and run the command again. I can see that the output file changed hash, so I know that my code indeed changed behavior. But 99.999% of the outputs data points stayed the same, just a very minor portion of the file changed.

How can I inspect the data points that changed? Or the files in a very large output directory that changed? I can't git diff those files anymore because they're ignored. I believe that this is achievable with dvc.

@efiop
Copy link
Member

efiop commented Jun 13, 2018

Hi @villasv !

Great idea! We will be sure to get to implementing it after 0.9.8.

Thanks,
Ruslan

@efiop efiop added the enhancement Enhances DVC label Jun 13, 2018
@efiop efiop added this to the 0.9.9 milestone Jun 13, 2018
@dmpetrov
Copy link
Member

Thank you @villasv. The idea is great!

But we should understand that in many cases dvc works in Gb size scale where diff has no meaning. The command might be abbused and many users can come up with a conclusion that dvc diff is a buggy\slow command.

Let's check file size before dvc diff and exit with an error if the size if exceeds the limit. The limit should be defined in dvc config file. Default value .... How do you think about 100Mb?

How do you guys think about introducing the limit?

@efiop
Copy link
Member

efiop commented Jun 13, 2018

@dmpetrov I thought about introducing a limit, but there is actually no good reason to do that, since we can simply leave it for user to decide whether he wants to kill dvc diff if it takes too long or if he wants to wait.

@dmpetrov
Copy link
Member

btw... there are no diffs in HDFS. It is one more thing to consider.

just throwing out some idea - hide the FS specific and SCM\Git specific commands under a special command like dvc whatever diff

@villasv
Copy link
Contributor Author

villasv commented Jun 13, 2018

I think that performance is a valid concern, and probably some aggressive warnings should be issued beforehand since dvc knows the file sizes instead of simply forbidding a certain file size. Perhaps some people will still want to diff even if it takes hours and outputs Gbs of data as well.

One extra care is that the previous revision from the git POV might not have the output file in the local cache, in which case it's necessary to either do a dvc pull or in the worst case a retroactive dvc repro.

@villasv
Copy link
Contributor Author

villasv commented Jun 14, 2018

Relevant new development: https://youtu.be/fw6P6VFPo24

@efiop
Copy link
Member

efiop commented Jun 14, 2018

@villasv I agree with you. File size limitation is definitely not a way to go. We might add a warning, but considering other tools(e.g. plain diff and git diff) don't do that, I think that we are pretty safe not printing anything as well. If the operation takes too long user could just CTRL+C it as usual :)

Missing local cache is also a great point and we will be sure to account for that(e.g. maybe something like proposed --fetch for dvc metrics).

Thanks for the link!

@villasv
Copy link
Contributor Author

villasv commented Oct 18, 2018

Just a heads up. I investigated the library I mentioned (tdda) but it wasn't of much help. I made a small script that achieves what I want for CSV and JSONLines. I'm pretty sure if this was incorporated in DVC it would be a bit cleaner, because I woudn't need to invoke so many subprocesses, but I decided to implement it standalone instead of inside a PR just for a Proof of Concept, and maybe we don't really want to put something so file-type-specific and formatting-opinionated into DVC (yet).

Perhaps a dvc plugin? The first of its kind?

My goal was to inspect each line separately and aggregate additions (pprint the whole item), deletions (print the id of the deleted item) and editions (pprint a jsondiff).

Examples using my own project:

$ python code/reftest.py diff --data-file data/orgs.txt --index-path wikidata.@id
Comparing data/orgs.txt with ./.dvc/cache/e0/6bd8b3a5945f043687c7a4256b0291
Removed:
http://www.wikidata.org/entity/Q0
http://www.wikidata.org/entity/Q1
---
Edited:
http://www.wikidata.org/entity/Q414163 {update: {'wikidata': {update: {'i18n': {update: {'en': {update: {'label': 'Academy '
                                                                           'of '
                                                                           'Sciences '
                                                                           'and '
                                                                           'Literature '
                                                                           'Mainz'}}}}}}}}
http://www.wikidata.org/entity/Q503473 {update: {'wikidata': {update: {'i18n': {update: {'en': {update: {'names': {insert: [(2,
                                                                                      'UNIGE')]}}}}}}}}}
http://www.wikidata.org/entity/Q797585 {update: {'wikidata': {update: {'i18n': {update: {'en': {update: {'names': ['Babasaheb '
                                                                            'Ambedkar '
                                                                            'University',
                                                                            'Bhimrao '
                                                                            'Ambedkar '
                                                                            'University']}}}}}}}}
http://www.wikidata.org/entity/Q1619487 {update: {'wikidata': {update: {'i18n': {update: {'en': {update: {'label': 'Willy-Brandt-school'}}}}}}}}
---
Added:
{'wikidata': {'@id': 'http://www.wikidata.org/entity/Q28704642',
              'i18n': {'en': {'label': 'École Française Internationale de Kiev',
                              'names': ['EFIK',
                                        'École française internationale de '
                                        'Kiev']}}}}
{'wikidata': {'@id': 'http://www.wikidata.org/entity/Q57420477',
              'i18n': {'en': {'label': 'Kerch Polytechnic College',
                              'names': []}}}}
{'wikidata': {'@id': 'http://www.wikidata.org/entity/Q57428768',
              'i18n': {'en': {'label': None, 'names': []}}}}
---
Total: R(2) / E(1) / A(1)
$ python code/reftest.py diff --data-file data/orgs.csv --index-col 0
Comparing data/orgs.csv with ./.dvc/cache/99/769e4afd29269400056d9b17cc39a8
Removed:
---
Edited:
Q414163 {delete: [9], insert: [(9, 'Academy of Sciences and Literature Mainz')]}
Q1619487 {delete: [9], insert: [(9, 'Willy-Brandt-school')]}
Q1698809 {delete: [9], insert: [(9, 'Johanneum Breslau')]}
Q3438435 {delete: [9], insert: [(9, 'San Marcos University')]}
---
Added:
['Q28704642',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'École Française Internationale de Kiev',
 '',
 '',
 '',
 '',
 '']
['Q57420477',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Kerch Polytechnic College',
 '',
 '',
 '',
 '',
 '']
['Q57428768', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
---
Total: R(0) / E(4) / A(3)

The script:

import csv
import difflib
import fire
import json
import jsondiff
import pprint
import yaml
import subprocess
import sys


class ReftestManager:
    def __init__(self, data_file):
        self._new = data_file
        self._old = data_file + '.dvc'

        # now we translate _old to its true cache path

        self._dvc_root = subprocess.Popen(
            ['dvc', 'root'],
            stdout=subprocess.PIPE,
        ).communicate()[0].rstrip().decode('utf-8')

        dvc_file = subprocess.Popen(
            ['git', 'show', f'HEAD:{self._old}'],
            stdout=subprocess.PIPE,
        ).communicate()[0].rstrip().decode('utf-8')
        dvc_spec = yaml.load(dvc_file)

        cache_md5 = next(
            out['md5']
            for out in dvc_spec['outs']
            if out['path'].split('/')[-1] == data_file.split('/')[-1]
        )

        self._old = f'{self._dvc_root}/.dvc/cache/{cache_md5[:2]}/{cache_md5[2:]}'

        print(f"Comparing {self._new} with {self._old}")

    def diff(self, index_col=None, index_path=None):
        if not ((index_col is not None) ^ (index_path is not None)):
            raise ValueError("Inform either index col or index path")

        with open(self._new) as fnew:
            new_lines = fnew.readlines()
        with open(self._old) as fold:
            old_lines = fold.readlines()

        diff = [
            line for line in
            difflib.unified_diff(
                old_lines, new_lines,
                fromfile=self._old, tofile=self._new, n=0,
            )
            if not line.startswith('---')
                and not line.startswith('+++')
                and not line.startswith('@@')
        ]

        rmved = [line[1:] for line in diff if line[0] == '-']
        added = [line[1:] for line in diff if line[0] == '+']

        if index_col is not None:
            rmved = {
                line[index_col]: line
                for line in csv.reader(rmved)
            }
            added = {
                line[index_col]: line
                for line in csv.reader(added)
            }

        if index_path is not None:
            def follow_path(json, key_chain):
                for key in key_chain:
                    json = json[key]
                return json
            key_chain = index_path.split('.')
            rmved = {
                follow_path(json.loads(line), key_chain): json.loads(line)
                for line in rmved
            }
            added = {
                follow_path(json.loads(line), key_chain): json.loads(line)
                for line in added
            }

        edits = {k:(v,added[k]) for k,v in rmved.items() if k in added}
        rmved = {k:v for k,v in rmved.items() if k not in edits }
        added = {k:v for k,v in added.items() if k not in edits }

        print("Removed:")
        for k in rmved:
            print(k)
        print("---")

        print("Edited:")
        for k,(old,new) in edits.items():
            print(k, pprint.pformat(jsondiff.diff(old, new, syntax='explicit')))
        print("---")

        print("Added:")
        for k in added:
            pprint.pprint(added[k])
        print("---")

        print(f"Total: R({len(rmved)}) / E({len(edits)}) / A({len(added)})")


if __name__ == "__main__":
    fire.Fire(ReftestManager)

@villasv
Copy link
Contributor Author

villasv commented Oct 18, 2018

Unfortunately, because I invoke dvc as a subproccess, sometimes I get the output of dvc root while checking for updates...:

FileNotFoundError: [Errno 2] No such file or directory: 'Checking for updates...\n./.dvc/cache/e0/6bd8b3a5945f043687c7a4256b0291'

:-) but that's really minor

@efiop
Copy link
Member

efiop commented Oct 18, 2018

Hi @villasv !

This looks amazing! I see no reason to make it a plugin, since it looks very suiting to be a part of core dvc functionality. Would you like to file a PR with dvc diff? Please feel free to ping us if you need any help.

Thanks,
Ruslan

@efiop
Copy link
Member

efiop commented Oct 18, 2018

@villasv Btw, we have introduced a community chat recently at https://dvc.org/chat and we would be very honored to have you there 🙂

@efiop
Copy link
Member

efiop commented Apr 7, 2019

Big thanks to @django-kz for contributing dvc diff command in #1778 ! 🎉 🚀 Currently it shows a difference in a number of files and their sizes between git revisions. We could add actual data diff functionality on top of it 🙂 dvc diff is going to be released in 0.35.1 today, please feel free to give it a try!

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented May 24, 2019

Hi, I just discovered this issue mentioned on Discord. Here's an idea: Not ideal but we could provide a short guide in the docs for now on how to use git checkout, dvc checkout, copy the tracked text file(s) in question (from 2 different versions) to /tmp/a/ and /tmp/b/ , and run diff on them manually.

@efiop
Copy link
Member

efiop commented May 24, 2019

That would be amazing @jorgeorpinel ! Great idea!

@shcheklein
Copy link
Member

Don't like to put this hack into the official docs to be honest. Especially if we are thinking about implementing this in the future. I would document this workaround in this ticket and may be put a link from the doc.

@villasv
Copy link
Contributor Author

villasv commented May 24, 2019

Or a trick I’ve seen around: a question and semi-official-but-ok-to-become-dated answer on Stack Overflow

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jul 18, 2019

OK here's a quick example on how to do it based on https://github.com/iterative/example-get-started (assumes the project has been cloned and user moved into the local repo). Also, we'll focus on files model.pkl and auc.metric:

EDIT: The behavior of dvc diff has changed slightly over time, but the general idea should still be valid.

$ dvc pull -aT
Preparing to download data from 'https://remote.dvc.org/get-started'
...
$ dvc diff HEAD^
dvc diff from 30a96ce to 6c73875
...
diff for 'model.pkl'
-model.pkl with md5 a66489653d1b6a8ba989799367b32c43
+model.pkl with md5 3863d0e317dee0a55c4e59d2ec0eef33
...
diff for 'auc.metric'
-auc.metric with md5 f58e5ccd66bf1195b53f458e7f619ab8
...
$ mkdir /tmp/a
$ cp model.pkl /tmp/a/model.pkl
$ cp auc.metric /tmp/a/auc.metric
$ git checkout HEAD^
...
$ dvc checkout
[##############################] 100% Checkout finished!
$ mkdir /tmp/b
$ cp model.pkl /tmp/b/model.pkl
$ cp auc.metric /tmp/b/auc.metric
$ diff /tmp/a/model.pkl /tmp/b/model.pkl 
Binary files /tmp/a/model.pkl and /tmp/b/model.pkl differ

Finally, use diff once all the changed files of interest are ready to be compared:

$ diff /tmp/a/auc.metric /tmp/b/auc.metric
1c1
< 0.602818
---
> 0.588426

UPDATE: Link to this comment added to https://dvc.org/doc/command-reference/diff in iterative/dvc.org@9ce552a

  • If someone asks the question on SO, I'll be glad to copy this example as answer.

@efiop efiop added p4-not-important feature request Requesting a new feature labels Jul 23, 2019
@efiop efiop added p3-nice-to-have It should be done this or next sprint and removed p4 labels Sep 25, 2019
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Dec 10, 2019

  • Remember to update the dvc diff cmd ref (possibly removing the note that links to this issue) and tam completion scripts (if needed) when/if this is addressed.

@nik123
Copy link
Contributor

nik123 commented Dec 24, 2019

Hi, I have read the docs and this thread but I still have problems with dvc diff command. Probably I'm missing something and I would be glad if someone will correct me. I have a problematic use case which I describe below.

I have a directory with several binary files under control of dvc. Whenever I execute dvc diff I get statistics about changed files. For example:

$dvc diff -t data/myfiles 2a19a244a218088c8f1313f5c528d9cf878bb7af
diff for 'data/myfiles'
-data/myfiles with md5 51cf02d89b91e204a0876563823f2c90.dir
+data/myfiles with md5 3e5c51dbc302efbd9e9c821483a19e8f.dir

116 files untouched, 1 file modified, 1 file added, 0 files deleted, size was increased by 12.3 kB

The output is not really helpful because it doesn't provide any information what exactly has changed in the directory. Of course it shows me that some files have changed but I knew it anyway. Since I work with binary files I don't think I need line-by-line diff but rather list of changed binary files. The possible output for dvc diff may look like this:

$dvc diff -t data/myfiles 2a19a244a218088c8f1313f5c528d9cf878bb7af
diff for 'data/myfiles'
-data/myfiles with md5 51cf02d89b91e204a0876563823f2c90.dir
+data/myfiles with md5 3e5c51dbc302efbd9e9c821483a19e8f.dir

116 files untouched, 1 file modified, 1 file added, 0 files deleted, size was increased by 12.3 kB

Modified:
-data/myfiles/117.jpg

Added:
-data/myfiles/118.jpg

P.S.:
Of course I can checkout different versions of data/myfiles, save them to temporary directories and then calculate hash sums for each pair of files in those temporary directories but it does seems like too much work for such a simple task. I really hope there is a better way.

@dmpetrov
Copy link
Member

dmpetrov commented Dec 24, 2019

Thank you @nik123 ! I've just created a detailed issue about the dvc diff output #2982. Please vote for it! And please add comments if I missed anything.

@jorgeorpinel
Copy link
Contributor

The output is not really helpful because it doesn't provide any information what exactly has changed in the directory.

@nik123 If the directory was dvc added as a whole, currently DVC doesn't examine what's inside so you only get those general stats. We are working on providing this kind of granularity for all of our commands though 🙂 Please see Dmitry's #2982 (comment) like he mentioned.

@nik123
Copy link
Contributor

nik123 commented Dec 25, 2019

@nik123 , @dmpetrov - thank you for your response! The issue #2982 is exactly what I need

@ammarasmro
Copy link

Big thanks to @django-kz for contributing dvc diff command in #1778 ! 🎉 🚀 Currently it shows a difference in a number of files and their sizes between git revisions. We could add actual data diff functionality on top of it 🙂 dvc diff is going to be released in 0.35.1 today, please feel free to give it a try!
@efiop

Copied from Discord, qna channel:

I think what the comment there meant to me, and what my use case needs, is a data diff. Let's say I changed the dataset, then committed the change. How do I know what changed? Right now dvc diff only shows what files changed. It would've been so beneficial if it shows something like git diff with a line by line diff

@jorgeorpinel
Copy link
Contributor

Hi @ammarasmro

It would've been so beneficial if it shows something like git diff with a line by line diff

Right, this is discussed above. There are probably thousands of data formats so this is not easy as with regular diff which only works on plain text files. Do you have a more specific use case? E.g. a certain data format or a set of formats? Thanks

@ammarasmro
Copy link

So the particular use case that raised this issue was a text dataset. The process we, the ML team, have used, is that after the data team gets data in any format, they process it and we get it as CSV, TSV, TXT, JSON, .md files. So alot of our experiments at least start with a text data file. We also use other formats like pickle but I'm guessing diffing would be more complicated than text.

@stefanocoretta
Copy link

Would be great to have line-by-line comparison at least for "text-based" files (TXT, CSV, JSON, MD, YAML, TOML, XLM, ...). By the way, great job with dvc, it's making my life easier!

@jorgeorpinel
Copy link
Contributor

@ammarasmro and @stefanocoretta for now you can refer to the #770 (comment) above for a procedure to do that. It's not clear that we want such a feature for really large files, even when they're plaintext. But yes, maybe!

@efiop efiop removed this from the Queue milestone May 3, 2021
@alex000kim
Copy link

a request from a high-priority client:

  • They need the ability to diff text files (CSV, JSON, YAML, etc) tracked by DVC
  • The suggested workaround (below) is a little too verbose for them
dvc get --rev commit_sha -o myfile.csv.temp . myfile.csv
diff myfile.csv.temp myfile.csv

@jorgeorpinel
Copy link
Contributor

Should it fall within DVC's feature set? Maybe we can find better diff tools to recommend instead.

@alex000kim
Copy link

The request was mostly about an easy way to compare any text files versioned by DVC and not so much about showing data format-specific diff.
I presume many people would expect dvc diff to work similarly to git diff, but for files versioned by DVC.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Sep 22, 2022

people would expect dvc diff to work similarly to git diff, but for files versioned by DVC

Maybe new users who are familiar with Git. But would DS/ ML Engs expect a printed line-to-line comparison of huge dataset versions?

Maybe dvc diff --targets can auto-detect small text files (or/and have another flag and/or config param for this) and print a diff instead of/ in addition to a summary. Cc @dberenbaum

@dberenbaum
Copy link
Contributor

It makes sense to me as a feature request. Probably not as the default, but if people specifically want to see the diff of the contents of the files, I see no reason we shouldn't let them. It's more an issue of warning the users it could be expensive and/or unhelpful, and I'm not sure when we could prioritize this since there are pretty straightforward workarounds.

@jorgeorpinel
Copy link
Contributor

In terms of implementation, maybe some sort of hidden DVC experiments (custom Git refs) can be used and then literally git diff the file versions internally.

@pmrowla
Copy link
Contributor

pmrowla commented Sep 28, 2022

We shouldn't be staging things in git that are supposed to be gitignored, even in hidden refs.

I think what we want here is just something equivalent to git's .gitattributes + difftool support: https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes

So you would tell DVC what external diff tool you want it to use for what file extensions. And then based on file extension, we would just automatically call the correct tool depending on what file the user is trying to diff. This also lets you extend support for diffing beyond text files. So you could do things like pass image formats into gui image diffing tools, and arbitrary binary files into binary diff tools, etc

@dberenbaum
Copy link
Contributor

@alex000kim If these are only single-entry .dvc files, they can get text diffs with git diff using the external diff approach above.

Create dvcfile2text.py at the root of the repo that looks like:

import ruamel.yaml as yaml
import sys

from dvc.repo import Repo

dvcfile = sys.argv[1]

with open(dvcfile) as f:
    content = yaml.safe_load(f)
md5 = content["outs"][0]["md5"]

repo = Repo()
path = repo.odb.local.oid_to_path(md5)
with open(path) as f:
    print(f.read())

The add to .git/config:

[diff "dvcfile"]
    textconv = python dvcfile2text.py
    cachetextconv = true

And finally add to .gitattributes:

*.dvc diff=dvcfile

This is hacky and won't work for directories, but a dev could probably make it robust and working with directories without too much effort.


We could consider providing external diff drivers for git that diff .dvc and dvc.lock files, starting with showing output similar to what dvc diff does today (except it would be shown inside git diff). Not something that's likely to be a high priority right now, though.

@alex000kim
Copy link

Great, thanks for the example, however hacky :)

Not something that's likely to be a high priority right now, though.

Understood.

@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important and removed p3-nice-to-have It should be done this or next sprint labels Jul 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC feature request Requesting a new feature p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests