- Author: Ben Du
- Date: 2020-07-11 21:25:43
- Title: Compare Two Dataframes Using Datacompy
- Slug: compare-two-dataframes-using-datacompy
- Category: Computer Science
- Tags: Computer Science, DataCompy, data, comparison, compare, big data, Spark, Python, DataFrame, pandas
- Modified: 2020-08-11 21:25:43


In [3]:
!pip3 install datacompy

Collecting datacompy
  Downloading https://files.pythonhosted.org/packages/73/23/d51f2fe4e41fec2820db32b428683810251e6abefcb003f74a397c8b9221/datacompy-0.7.1-py3-none-any.whl
Installing collected packages: datacompy
Successfully installed datacompy-0.7.1


## Comparing Two pandas DataFrmaes

In [4]:
from io import StringIO
import pandas as pd
import datacompy

data1 = """acct_id,dollar_amt,name,float_fld,date_fld
10000001234,123.45,George Maharis,14530.1555,2017-01-01
10000001235,0.45,Michael Bluth,1,2017-01-01
10000001236,1345,George Bluth,,2017-01-01
10000001237,123456,Bob Loblaw,345.12,2017-01-01
10000001238,1.05,Lucille Bluth,,2017-01-01
10000001238,1.05,Loose Seal Bluth,,2017-01-01
"""

data2 = """acct_id,dollar_amt,name,float_fld
10000001234,123.4,George Michael Bluth,14530.155
10000001235,0.45,Michael Bluth,
10000001236,1345,George Bluth,1
10000001237,123456,Robert Loblaw,345.12
10000001238,1.05,Loose Seal Bluth,111
"""

df1 = pd.read_csv(StringIO(data1))
df2 = pd.read_csv(StringIO(data2))

In [5]:
df1

Unnamed: 0,acct_id,dollar_amt,name,float_fld,date_fld
0,10000001234,123.45,George Maharis,14530.1555,2017-01-01
1,10000001235,0.45,Michael Bluth,1.0,2017-01-01
2,10000001236,1345.0,George Bluth,,2017-01-01
3,10000001237,123456.0,Bob Loblaw,345.12,2017-01-01
4,10000001238,1.05,Lucille Bluth,,2017-01-01
5,10000001238,1.05,Loose Seal Bluth,,2017-01-01


In [6]:
df2

Unnamed: 0,acct_id,dollar_amt,name,float_fld
0,10000001234,123.4,George Michael Bluth,14530.155
1,10000001235,0.45,Michael Bluth,
2,10000001236,1345.0,George Bluth,1.0
3,10000001237,123456.0,Robert Loblaw,345.12
4,10000001238,1.05,Loose Seal Bluth,111.0


In [9]:
compare = datacompy.Compare(
    df1,
    df2,
    join_columns='acct_id',  #You can also specify a list of columns
    abs_tol=0.0001,
    rel_tol=0,
    df1_name='original',
    df2_name='new'
)
print(compare.report())

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0  original        5     6
1       new        4     5

Column Summary
--------------

Number of columns in common: 4
Number of columns in original but not in new: 1
Number of columns in new but not in original: 0

Row Summary
-----------

Matched on: acct_id
Any duplicates on match values: Yes
Absolute Tolerance: 0.0001
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in original but not in new: 1
Number of rows in new but not in original: 0

Number of rows with some compared columns unequal: 5
Number of rows with all compared columns equal: 0

Column Comparison
-----------------

Number of columns compared with some values unequal: 3
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 8

Columns with Unequal Values or Types
------------------------------------

       Column original dtype new dtype  # Unequal  Max Dif

## Comparing Two Spark DataFrames

In [None]:
import datacompy
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Data_Comparison") \
    .enableHiveSupport().getOrCreate()
prod = spark.read.parquet("/path/to/prod").drop("part_id", "local_rank")
dev = spark.read.parquet("/path/to/dev")
comparison = datacompy.SparkCompare(
    spark,
    prod,
    dev,
    join_columns=["site_id", "item_id"],
    cache_intermediates=True,
    match_rates=True
)
comparison.report()

## References

https://capitalone.github.io/datacompy/#

https://github.com/capitalone/datacompy

https://capitalone.github.io/datacompy/pandas_usage.html