# **Guided Lab 343.4.8 - How to Calculates the Difference of a DataFrame or Series elements between rows/columns.**

## **Learning Objective:**

In this lab, you will learn how to use the Pandas **diff()** function to calculate the difference between rows and between columns.

By the end of this lab, learners wil be able to use **diff()** function for finding the difference between two Dataframe or Series.




## **Introduction:**
Pandas diff() function returns the difference between rows or columns on your DataFrame. You have the option to select how many rows/columns you'd like to difference via the 'periods' parameter.

- Pandas diff() function will difference your rows or columns. This means calculating the change in your row(s)/column(s) over a set number of periods. Or simply, pandas diff() will subtract 1 cell value from another cell value within the same index.

- The diff() is very helpful when calculating rates of change. For example: you have temperature readings per day, calculating the difference will tell you how the temperatures have changed Day-Over-Day.

- You can also think of this as taking the derivative (rate of change) of the data. This is also helpful when working with time series data and calculating Week-Over-Week.



## Syntax: `pandas.DataFrame.diff(self, periods=1, axis=0)`

**Diff Parameters:**

**Periods:** (Default=1): You can select how many periods you’d like to difference by via the periods parameter. An easier way to think about this is, ‘how many rows would you like to difference from each cell?’ In the picture above, our periods=1 so we take the difference from each neighboring cell above.

**Axis:**(Default=0): We usually talk about differencing rows (Axis=0), but pandas also allows you to difference columns (Axis=1).

In this lab, we will run through three type of examples:

- Default differencing

- Two Period Differencing

- Column Differencing


First, let's create our DataFrame

In [1]:
import numpy as np
import pandas as pd

In [11]:
np.random.seed(seed=42)
df = pd.DataFrame(data=np.random.normal(loc=70, scale=10, size=(7,3)),
           columns=('San Francisco', 'San Diego', 'Los Angeles'),
            index=['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']
                 )
df = df.round()
df

Unnamed: 0,San Francisco,San Diego,Los Angeles
Mon,75.0,69.0,76.0
Tues,85.0,68.0,68.0
Wed,86.0,78.0,65.0
Thurs,75.0,65.0,65.0
Fri,72.0,51.0,53.0
Sat,64.0,60.0,73.0
Sun,61.0,56.0,85.0


## **Example 1: Default differencing**
By default, Pandas will difference by 1 row. Let's see how this looks for our cities.

**Notice:** Your first row in your resulting diff DataFrame will generally be **NaN**. This is because there is no other observation to difference it with. If you had periods=2, then there would be 2 NaNs.

In [12]:
df.diff()

Unnamed: 0,San Francisco,San Diego,Los Angeles
Mon,,,
Tues,10.0,-1.0,-8.0
Wed,1.0,10.0,-3.0
Thurs,-11.0,-13.0,0.0
Fri,-3.0,-14.0,-12.0
Sat,-8.0,9.0,20.0
Sun,-3.0,-4.0,12.0


## **Example 2: Two Period Differencing**
Say instead of differencing your data by 1 period, you wanted to do it by 2 periods. To do this, set your periods=2.

In [13]:
df.diff(periods=2)

Unnamed: 0,San Francisco,San Diego,Los Angeles
Mon,,,
Tues,,,
Wed,11.0,9.0,-11.0
Thurs,-10.0,-3.0,-3.0
Fri,-14.0,-27.0,-12.0
Sat,-11.0,-5.0,8.0
Sun,-11.0,5.0,32.0


## **Example 3: Column Differencing**
Did you know you can also do column differencing? This would be helpful if your column represent dates or other items you'd like to compare.

To do this, set axis=1

In [14]:
df.diff(periods=1, axis=1)

Unnamed: 0,San Francisco,San Diego,Los Angeles
Mon,,-6.0,7.0
Tues,,-17.0,0.0
Wed,,-8.0,-13.0
Thurs,,-10.0,0.0
Fri,,-21.0,2.0
Sat,,-4.0,13.0
Sun,,-5.0,29.0


## **Example 4: Find Difference Between Each Previous Row**

Suppose we have the following pandas DataFrame:

In [15]:
#create DataFrame
df_2 = pd.DataFrame({'period': [1, 2, 3, 4, 5, 6, 7, 8],
                   'sales': [12, 14, 15, 15, 18, 20, 19, 24],
                   'returns': [2, 2, 3, 3, 5, 4, 4, 6]})

#view DataFrame
df_2


Unnamed: 0,period,sales,returns
0,1,12,2
1,2,14,2
2,3,15,3
3,4,15,3
4,5,18,5
5,6,20,4
6,7,19,4
7,8,24,6


The following code shows how to find the difference between every current row in a DataFrame and the previous row:

In [16]:
#add new column to represent sales differences between each row
df_2['sales_diff'] = df_2['sales'].diff()

#view DataFrame
df_2

Unnamed: 0,period,sales,returns,sales_diff
0,1,12,2,
1,2,14,2,2.0
2,3,15,3,1.0
3,4,15,3,0.0
4,5,18,5,3.0
5,6,20,4,2.0
6,7,19,4,-1.0
7,8,24,6,5.0


Note that we can also find the difference between several rows prior. For example, the following code shows how to find the difference between each current row and the row that occurred three rows earlier:

In [17]:
#add new column to represent sales differences between current row and 3 rows earlier
df_2['sales_diff'] = df_2['sales'].diff(periods=3)

#view DataFrame
df_2

Unnamed: 0,period,sales,returns,sales_diff
0,1,12,2,
1,2,14,2,
2,3,15,3,
3,4,15,3,3.0
4,5,18,5,4.0
5,6,20,4,5.0
6,7,19,4,4.0
7,8,24,6,6.0


## **Example 5: Find Difference Based on Condition**

We can also filter the DataFrame to show rows where the difference between the current row and the previous row is less than or greater than some value.

For example, the following code returns only the rows where the value in the current row is less than the value in the previous row:

In [19]:
#create DataFrame
df_3 = pd.DataFrame({'period': [1, 2, 3, 4, 5, 6, 7, 8],
                   'sales': [12, 14, 15, 13, 18, 20, 19, 24],
                   'returns': [2, 2, 3, 3, 5, 4, 4, 6]})

#find difference between each current row and the previous row
df_3['sales_diff'] = df_3['sales'].diff()

#filter for rows where difference is less than zero
df_3 = df_3[df_3['sales_diff'] < 0]

#view DataFrame
df_3


Unnamed: 0,period,sales,returns,sales_diff
3,4,13,3,-2.0
6,7,19,4,-1.0
