Skip to content
Generate a diff between two tabular datasets expressed in CSV files.
Python Makefile
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
csvdiff Bump the maximum field size to 16m Oct 2, 2018
docs
tests Add test case for significance Sep 22, 2018
.gitignore Typecheck with mypy Sep 22, 2018
.python-version Drop Python 2 support Sep 22, 2018
.travis.yml Remove failing pypy support Sep 22, 2018
AUTHORS.rst added some docs to the readme Dec 17, 2016
CONTRIBUTING.rst Start with a fresh pypackage template. Mar 13, 2014
HISTORY.rst Bump version to 0.3.3 Jul 20, 2017
LICENSE Start with a fresh pypackage template. Mar 13, 2014
MANIFEST.in Start with a fresh pypackage template. Mar 13, 2014
Makefile Use a bundled tox for "make test-all" Oct 2, 2018
README.rst Add basic API documentation Oct 2, 2018
requirements-dev.txt Use a bundled tox for "make test-all" Oct 2, 2018
requirements.txt Fix failing separators for py2. Apr 24, 2016
setup.cfg Typecheck with mypy Sep 22, 2018
setup.py Drop Python 2 support Sep 22, 2018
tox.ini Remove failing pypy support Sep 22, 2018

README.rst

csvdiff

https://badge.fury.io/py/csvdiff.png https://travis-ci.org/larsyencken/csvdiff.png?branch=master

Overview

Generate a diff between two CSV files on the command-line.

csvdiff allows you to compare the semantic contents of two CSV files, ignoring things like row and column ordering in order to get to what's actually changed. This is useful if you're comparing the output of an automatic system from one day to the next, so that you can look at just what's changed.

It's also useful for maintaining patches to third-party data. Diffs generated by csvdiff are a subset of JSON and can be stored and applied using the matching csvpatch command. If upstream data changes, you can fetch the new version and re-apply your changes to it easily.

Installing

You'll firstly need Python and pip. Then run:

pip install csvdiff

Examples

For example, suppose we have a.csv:

id,name,amount
1,bob,20
2,eva,63
3,sarah,7
4,jeff,19
6,fred,10

After some changes and corrections to the data, we now have b.csv:

id,name,amount
1,bob,23       <--- changed
3,sarah,7
4,jeff,19
5,mira,81      <--- added
6,fred,13      <--- changed

Now we can ask for a summary of differences:

$ csvdiff --style=summary id a.csv b.csv
1 rows removed (20.0%)
1 rows added (20.0%)
2 rows changed (40.0%)

Or look at the full diff pretty printed, to make it more readable:

$ csvdiff --style=pretty --output=diff.json id a.csv b.csv
$ cat diff.json
{
  "_index": [
    "id"
  ],
  "added": [
    {
      "amount": "81",
      "id": "5",
      "name": "mira"
    }
  ],
  "changed": [
    {
      "fields": {
        "amount": {
          "from": "20",
          "to": "23"
        }
      },
      "key": [
        "1"
      ]
    },
    {
      "fields": {
        "amount": {
          "from": "10",
          "to": "13"
        }
      },
      "key": [
        "6"
      ]
    }
  ],
  "removed": [
    {
      "amount": "63",
      "id": "2",
      "name": "eva"
    }
  ]
}

If you want to ignore a column from the comparison then you can do so by specifying a comma seperated list of column names to ignore. For example:

$ csvdiff --style=summary --ignore-columns=amount id a.csv b.csv
1 rows removed (20.0%)
1 rows added (20.0%)
0 rows changed (0%)

You can also choose to compare numeric fields only up to a certain number of significant figures. Use negative significant figures for orders of magnitude:

$ csvdiff --style=summary id a.csv c.csv
0 rows removed (0.0%)
0 rows added (0.0%)
2 rows changed (40.0%)
$ csvdiff --style=summary id --significance=-1 a.csv c.csv
files are identical

Diffs generated this way contain all the data that's changed, and can be reapplied later if the original data changes. For example, suppose more data gets added to a.csv, giving us a-plus.csv:

id,name,amount
1,bob,20
2,eva,63
3,sarah,7
4,jeff,19
6,fred,10
8,henry,9

We can reapply our changes with the csvpatch command:

$ csvpatch --input=diff.json --output=b-plus.csv a-plus.csv
$ cat b-plus.csv
id,name,amount
1,bob,23
3,sarah,7
4,jeff,19
5,mira,81
6,fred,13
8,henry,9

This can be useful if you're using csvdiff to transform data that's outside your control. In this case, you maintain the patch file and simply reapply it when the upstream data provider gives you a fresh file.

For more usage options, run csvdiff --help or csvpatch --help.

API

The main entry points are the diff_files and diff_records methods:

import csvdiff

patch = csvdiff.diff_files('a.csv', 'b.csv', ['id'])

# just show the changed rows
print(patch['changed'])

Using diff_records instead:

import csvdiff

records_a = [{'id': 1, 'name': 'Alice'},
             {'id': 2, 'name': 'Bob'}]
records_b = [{'id': 1, 'name': 'Alice'},
             {'id': 2, 'name': 'Jeff'}]

patch = csvdiff.diff_records(records_a, records_b, ['id'])
print(patch['changed'])

See the matching patch_file and patch_records methods for working with patches.

License

BSD license

You can’t perform that action at this time.