The following is from [this article](https://medium.com/towards-data-science/3-easy-ways-to-compare-two-pandas-dataframes-b2a18169c876) in Medium.

You can find such DataFrame comparison applications most commonly in data validation, data change detection, testing, and debugging. So, it is important to know how you can compare two datasets quickly and easily.

# 0. Data Preparation

In [1]:
import pandas as pd

In [2]:
df = pd.DataFrame(
    {
        "device_id": ["D475", "D175", "D200", "D375", "M475", "M400", "M250", "A150"],
        "device_temperature": [35.4, 45.2, 59.3, 49.3, 32.2, 35.7, 36.8, 34.9],
        "device_status": [
            "Inactive",
            "Active",
            "Active",
            "Active",
            "Active",
            "Inactive",
            "Active",
            "Active",
        ],
    }
)

df1 = pd.DataFrame(
    {
        "device_id": ["D475", "D175", "D200", "D375", "M475", "M400", "M250", "A150"],
        "device_temperature": [39.4, 45.2, 29.3, 49.3, 32.2, 35.7, 36.8, 24.9],
        "device_status": [
            "Active",
            "Active",
            "Inactive",
            "Active",
            "Active",
            "Inactive",
            "Active",
            "Inactive",
        ],
    }
)

In [3]:
df

Unnamed: 0,device_id,device_temperature,device_status
0,D475,35.4,Inactive
1,D175,45.2,Active
2,D200,59.3,Active
3,D375,49.3,Active
4,M475,32.2,Active
5,M400,35.7,Inactive
6,M250,36.8,Active
7,A150,34.9,Active


In [4]:
df1

Unnamed: 0,device_id,device_temperature,device_status
0,D475,39.4,Active
1,D175,45.2,Active
2,D200,29.3,Inactive
3,D375,49.3,Active
4,M475,32.2,Active
5,M400,35.7,Inactive
6,M250,36.8,Active
7,A150,24.9,Inactive


# 1. Compare Pandas DataFrames using `equals()`

Pandas offer an amazing method called `pandas.DataFrame.equals()` which compares two objects and checks if two DataFrames are identical or not.

Two DataFrame are identical only when they have the **same shape** and the **same data** at the **same position**. This means, `equals()` checks both — content and the structure of both the DataFrames.

As this method compares two objects, you can use this method to compare an entire DataFrame in one go and also to compare a single DataFrame column, i.e. a Series.

In [5]:
# Case 1: Compare two Series having the same shape, same data at the same position
series1 = pd.Series([1, 2, 3, 4])
series2 = pd.Series([1, 2, 3, 4])
series1.equals(series2)

True

In [6]:
# Case 2: Compare two Series having different shapes
series1 = pd.Series([1, 2, 3, 4, 5, 6])
series2 = pd.Series([1, 2, 4, 4])
series1.equals(series2)

False

In [7]:
# Case 3: Compare two Series having the same shape but different order of the data
series1 = pd.Series([1, 2, 3, 4])
series2 = pd.Series([2, 1, 3, 4])
series1.equals(series2)

False

In [8]:
# Case 4: Compare two Series having the same shape but different data
series1 = pd.Series([1, 2, 3, 4])
series2 = pd.Series([1, 2, 4, 4])
series1.equals(series2)

False

In [9]:
df.equals(df1)

False

It returns a quite obvious output, i.e. a False as the DataFrames df and df1 have some differences between them.

However, to understand exactly in which column these differences are, you can use the `equals()` method on individual columns, like this

In [10]:
df["device_id"].equals(df1["device_id"])

True

In [11]:
df["device_temperature"].equals(df1["device_temperature"])

False

Alternatively, when your DataFrames have multiple columns, you can write a simple `for loop` like the one below, to understand which columns from both the DataFrames are different.

In [12]:
print("List of the columns having different values in the DataFrames df1 and df \n")
for column in df.columns:
    if df[column].equals(df1[column]):
        pass
    else:
        print(column)

List of the columns having different values in the DataFrames df1 and df 

device_temperature
device_status


Well, this method gives you info about which columns from both the DataFrames have distinct values.

However, it **does not tell you about the exact location** or rows where the change has occurred. And that’s where the next method jumps in.

# 2. Compare Pandas DataFrames using `concat()`

So far, you must know the method `concat()` in pandas is used to combine two or more DataFrames along the specified axis.

For example, when you concatenate DataFrames df and df1 using the default axis i.e. stacking two DataFrame one-below-another, you get all the rows from both the DataFrames as shown here —

In [13]:
df2 = pd.concat([df, df1])
df2

Unnamed: 0,device_id,device_temperature,device_status
0,D475,35.4,Inactive
1,D175,45.2,Active
2,D200,59.3,Active
3,D375,49.3,Active
4,M475,32.2,Active
5,M400,35.7,Inactive
6,M250,36.8,Active
7,A150,34.9,Active
0,D475,39.4,Active
1,D175,45.2,Active


Now, your task is to simply drop the rows which are common i.e. duplicated in both DataFrames and you can do it using another pandas DataFrame method `drop_duplicates()` as shown below.

In [14]:
df3 = df2.drop_duplicates(keep=False)
df3

Unnamed: 0,device_id,device_temperature,device_status
0,D475,35.4,Inactive
2,D200,59.3,Active
7,A150,34.9,Active
0,D475,39.4,Active
2,D200,29.3,Inactive
7,A150,24.9,Inactive


The parameter `keep=False` deletes all the duplicated rows. So, don’t forget to mention it.

Although this method keeps the index from both the DataFrames, it is unclear which rows belong to which DataFrame.

To solve this confusion, the next method is helpful.

# 3. Compare Pandas DataFrames using `compare()`

It essentially merges both the DataFrames and adds a MultiIndex to show both the DataFrames columns side by side, which helps you to see the columns and positions where the values have been changed.

In [15]:
df4 = df.compare(df1)
df4

Unnamed: 0_level_0,device_temperature,device_temperature,device_status,device_status
Unnamed: 0_level_1,self,other,self,other
0,35.4,39.4,Inactive,Active
2,59.3,29.3,Active,Inactive
7,34.9,24.9,Active,Inactive


As per the syntax of the function `compare()`, self indicates the first DataFrame df and other indicates the other DataFrame df1.

So, this function directly gives you row indexes and column names, i.e. the actual position where the values have been changed.