When you first dive into your dataset, you may be surprised to find that some data simply is not there, like at all.

In [1]:
import pandas as pd

Let's review some methods that Pandas has, for situations regarding missing data.

First, let me create a df with temperature measurements.

In [2]:
temps = pd.DataFrame ( {"sequence": [1,2,3,4,5],
                         "measurement type": ['actual', 'actual', 'actual', None, 'estimated'],
                         "temperature_f": [67.24,84.56,91.61, None, 49.64]
                         })
temps

Unnamed: 0,sequence,measurement type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
3,4,,
4,5,estimated,49.64


I intentionally added some 'None' values on the df.

Note the two missing values in sequence number
estimated 49.64
four.

First things first, one quick method to quickly identify all missing values in your df, is to call 'isna'.

In [3]:
temps.isna()

Unnamed: 0,sequence,measurement type,temperature_f
0,False,False,False
1,False,False,False
2,False,False,False
3,False,True,True
4,False,False,False


Generally, the default parameters in Pandas functions are built to handle null values.
For example, sometimes I'll treat nulls as zero and means ignore null values by default.
Let me show you an example using a cumulative sum down my df.

In [5]:
temps['temperature_f'].cumsum() #'cum' yes, I know LOL

0     67.24
1    151.80
2    243.41
3       NaN
4    293.05
Name: temperature_f, dtype: float64

By default, the cumulative sum, skips nulls.

Now, if I set 'skinpa' equal to false, the cumulative sum will null all subsequent results after the first null.

In [6]:
temps['temperature_f'].cumsum(skipna=False)

0     67.24
1    151.80
2    243.41
3       NaN
4       NaN
Name: temperature_f, dtype: float64

One case where you will nedd to be mindful of how Pandas treats nulls, is when aggregating your data using 'group by'

The dafault behavior, is to exclude any records with no values for any dimensions you are grouping by.
Here is an example.

In [7]:
temps.groupby(by=['measurement type']).max()

Unnamed: 0_level_0,sequence,temperature_f
measurement type,Unnamed: 1_level_1,Unnamed: 2_level_1
actual,3,91.61
estimated,5,49.64


Notice the entry with no measurement was not included.

To prevent the group by from dropping nulls, pass 'dropna' equal to false.

In [8]:
temps.groupby(by=['measurement type'], dropna=False).max()

Unnamed: 0_level_0,sequence,temperature_f
measurement type,Unnamed: 1_level_1,Unnamed: 2_level_1
actual,3,91.61
estimated,5,49.64
,4,


Great!

Now let me review some methods to treat these nulls.

The most straighforward method is to simply drop records with null using 'dropna'.

This method, is the most simply, but you shuld consder this carefully.

By calling, the 'dropna', the default behaviour is to drop any rows wich contain null values in any column.

In [9]:
temps.dropna()

Unnamed: 0,sequence,measurement type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
4,5,estimated,49.64


Now, if you only want to drop rows with nulls in certain columns, you can use the subset parameter.
A less common approach is to drop any columns with no values, which you can do by passing access equal to one in 'dropna'.

In [10]:
temps.dropna(axis=1)

Unnamed: 0,sequence
0,1
1,2
2,3
3,4
4,5


Another method is to actually fill null values using 'fillna'.

To see this in action, I will fill the nulls with zeros.

In [11]:
temps.fillna(0)

Unnamed: 0,sequence,measurement type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
3,4,0,0.0
4,5,estimated,49.64


At first glance, this could be problematic.

Imagine if I want to calculate the mean for the temperature column. It would be heavily biased by the zero I just introduced.

Another more nuanced approach is to use the 'pad' method.

This will carry over values from a prior row.

In [12]:
temps.fillna(method='pad')

Unnamed: 0,sequence,measurement type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
3,4,actual,91.61
4,5,estimated,49.64


Now, this method poses its own issues.

Largely, because I've simply created data out of thin air.

Given the drop from 91 degrees to 50 degress that you can see, you might expect sequence four to fall somwhere in the middle.

This brings me to the final method I will show you, called 'interpolate'.

While 'interpolate' allos for several different methods, the default approach will create a straight line estimate for the missing temperature value.

In [13]:
temps.interpolate()

Unnamed: 0,sequence,measurement type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
3,4,,70.625
4,5,estimated,49.64


There it is!

Now the estimate lies halfway between the two values.

So, a final advice before you get too far along analyzing your data.

BE F*K*N SURE TO CHECK FOR NULL VALUES, and put these methods to use!.