## Handling Missing Data with Pandas

In this notebook, we will explore how to handle missing data using the Pandas library in Python. Missing data is a common issue in data analysis, and Pandas provides several methods to deal with it effectively. We will cover some topics while leading with missing values for data cleaning and preparation. This notebook will focus in `null` values, but we will also touch on other types of missing data, such as `NaN` (Not a Number) and `None`.

In [83]:
import numpy as np
import pandas as pd

### Pandas utility functions

Similarly to `numpy`, `pandas` also has a few utility functions to identify and detect null values:

In [84]:
pd.isnull(np.nan)

True

In [85]:
pd.isnull(None)

True

In [86]:
pd.isna(np.nan)

True

In [87]:
pd.isna(None)

True

The opposite ones also exist:

In [88]:
pd.notnull(None)

False

In [89]:
pd.notnull(np.nan)

False

In [90]:
pd.notna(np.nan)

False

In [91]:
pd.notnull(3)

True

These functions also work with `Series` and `DataFrames`:

In [92]:
pd.isnull(pd.Series([1, np.nan, 7]))

0    False
1     True
2    False
dtype: bool

In [93]:
pd.notnull(pd.Series([1, np.nan, 7]))

0     True
1    False
2     True
dtype: bool

In [94]:
pd.isnull(
    pd.DataFrame(
        {
            "Column A": [1, np.nan, 7],
            "Column B": [np.nan, 2, 3],
            "Column C": [np.nan, 2, np.nan],
        }
    )
)

Unnamed: 0,Column A,Column B,Column C
0,False,True,True
1,True,False,False
2,False,False,True


---

### Pandas Operations with Missing Values

Pandas manages missing values more gracefully than `numpy`. `nans` will no longer behave as "viruses", and operations will just ignore them completely:

In [95]:
pd.Series([1, 2, np.nan]).count()

2

Has noticed we can check that with `numpy` just needed one value to be `nan` to return `nan`, while with `pandas` it will return the result of the operation with the other values. This is a very useful feature when working with large datasets, as it allows us to perform calculations without having to worry about missing values.

In [96]:
pd.Series([1, 2, np.nan]).sum()

3.0

In [97]:
pd.Series([2, 2, np.nan]).mean()

2.0

---

### Filtering missing data

As we saw with `numpy`, we could combine boolean selection + `pd.isnull` to filter out those `nans` and `null` values:

In [98]:
s = pd.Series([1, 2, 3, np.nan, np.nan, 4])

In [99]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

In [100]:
pd.notnull(s)

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

In [101]:
pd.isnull(s)

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

In [102]:
pd.notnull(s).sum()

4

In [103]:
pd.isnull(s).sum()

2

In [104]:
# s[s.notnull()]
s[pd.notnull(s)]

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

But both `notnull` and `isnull` are also methods of `Series` and `DataFrames`, so we could use it that way:

In [105]:
s.isnull()

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

In [106]:
s.notnull()

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

In [107]:
s[s.notnull()]

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

---

### Dropping null values

Boolean selection + `notnull()` seems a little bit verbose and repetitive. And as we said before: any repetitive task will probably have a better, more DRY way. In this case, we can use the `dropna` method:

In [108]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

In [109]:
s.dropna()

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

---

### Dropping null values on DataFrames

You saw how simple it is to drop `na`s with a `Series`. But with `DataFrames`, there will be a few more things to consider, because you can't drop single values. You can only drop entire columns or rows. Let's start with a sample DataFrame:

In [124]:
df_Columns = pd.Series({
    "Column A": [1, np.nan, 30, np.nan],
    "Column B": [2, 8, 31, np.nan],
    "Column C": [np.nan, 9, 32, 100],
    "Column D": [5, 8, 34, 110],
})

df = pd.DataFrame(
    {
        "Column A": df_Columns["Column A"],
        "Column B": df_Columns["Column B"],
        "Column C": df_Columns["Column C"],
        "Column D": df_Columns["Column D"],
    }
)

In [122]:
df

Unnamed: 0,Column As,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [125]:
df.shape

(4, 4)

In [126]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Column A  2 non-null      float64
 1   Column B  3 non-null      float64
 2   Column C  3 non-null      float64
 3   Column D  4 non-null      int64  
dtypes: float64(3), int64(1)
memory usage: 260.0 bytes


In [127]:
df.isnull()

Unnamed: 0,Column A,Column B,Column C,Column D
0,False,False,True,False
1,True,False,False,False
2,False,False,False,False
3,True,True,False,False


In [128]:
df.isnull().sum()

Column A    2
Column B    1
Column C    1
Column D    0
dtype: int64

The default `dropna` behavior will drop all the rows in which any null value is present:

In [129]:
df.dropna()

Unnamed: 0,Column A,Column B,Column C,Column D
2,30.0,31.0,32.0,34


In this case we're dropping `rows`. Rows containing null values are dropped from the DF. You can also use the axis parameter to drop columns containing null values:

In [137]:
df.dropna(axis=1)  # axis=1 also works

Unnamed: 0,Column D
0,5
1,8
2,34
3,110


In [None]:
df.dropna(axis="rows") # axis=0 also works

Unnamed: 0,Column A,Column B,Column C,Column D
2,30.0,31.0,32.0,34


In this case, any row or column that contains at least one null value will be dropped. Which can be, depending on the case, too extreme. You can control this behavior with the `how` parameter. Can be either `'any'` or `'all'`:

In [138]:
df2 = pd.DataFrame(
    {
        "Column A": [1, np.nan, 30],
        "Column B": [2, np.nan, 31],
        "Column C": [np.nan, np.nan, 100],
    }
)

In [139]:
df2

Unnamed: 0,Column A,Column B,Column C
0,1.0,2.0,
1,,,
2,30.0,31.0,100.0


In [None]:
df.dropna(how="all") # will only drop if all values are null for that row or column
# df.dropna(how="any") # will drop if any value is null for that row or column

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [142]:
df.dropna(how="any") # default is how="any"

Unnamed: 0,Column A,Column B,Column C,Column D
2,30.0,31.0,32.0,34


You can also use the `thresh` parameter to indicate a threshold (a minimum number) of non-null values for the row/column to be kept:

In [143]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [150]:
df.dropna(thresh=3) # means that at least 3 non-null values are required to keep the row or column

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34


In [153]:
df.dropna(thresh=3, axis="columns")

Unnamed: 0,Column B,Column C,Column D
0,2.0,,5
1,8.0,9.0,8
2,31.0,32.0,34
3,,100.0,110


---

### Filling null values

Sometimes instead than dropping the null values, we might need to replace them with some other value. This highly depends on your context and the dataset you're currently working. Sometimes a `nan` can be replaced with a `0`, sometimes it can be replaced with the `mean` of the sample, and some other times you can take the closest value. Again, it depends on the context. We'll show you the different methods and mechanisms and you can then apply them to your own problem

In [154]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

**Filling nulls with a arbitrary value**

In [157]:
s.fillna(0) # replaces all null values with 0

0    1.0
1    2.0
2    3.0
3    0.0
4    0.0
5    4.0
dtype: float64

In [158]:
s.fillna(s.mean())

0    1.0
1    2.0
2    3.0
3    2.5
4    2.5
5    4.0
dtype: float64

In [159]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

**Filling nulls with contiguous (close) values**

The `method` argument is used to fill null values with other values close to that null one:

In [None]:
s.fillna(method="ffill") # forward fill

  s.fillna(method="ffill")


0    1.0
1    2.0
2    3.0
3    3.0
4    3.0
5    4.0
dtype: float64

In [161]:
s.fillna(method="bfill")

  s.fillna(method="bfill")


0    1.0
1    2.0
2    3.0
3    4.0
4    4.0
5    4.0
dtype: float64

This can still leave `null` values at the extremes of the Series/DataFrame:

In [165]:
pd.Series([np.nan, 3, np.nan, 9]).fillna(method="ffill")

  pd.Series([np.nan, 3, np.nan, 9]).fillna(method="ffill")


0    NaN
1    3.0
2    3.0
3    9.0
dtype: float64

In [166]:
pd.Series([1, np.nan, 3, np.nan, np.nan]).fillna(method="bfill")

  pd.Series([1, np.nan, 3, np.nan, np.nan]).fillna(method="bfill")


0    1.0
1    3.0
2    3.0
3    NaN
4    NaN
dtype: float64

---

### Filling null values on DataFrames

The `fillna` method also works on `DataFrames`, and it works similarly. The main differences are that you can specify the `axis` (as usual, rows or columns) to use to fill the values (specially for methods) and that you have more control on the values passed:

In [167]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [169]:
df.fillna({"Column A": 0, "Column B": 99, "Column C": df["Column C"].mean()}) # will replace null values with the specified values for each column

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,47.0,5
1,0.0,8.0,9.0,8
2,30.0,31.0,32.0,34
3,0.0,99.0,100.0,110


In [172]:
df.fillna(method="ffill", axis=0) # forward fill for columns

  df.fillna(method="ffill", axis=0) # forward fill for columns


Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,1.0,8.0,9.0,8
2,30.0,31.0,32.0,34
3,30.0,31.0,100.0,110


In [174]:
df.fillna(method="ffill", axis=1) # forward fill for rows

  df.fillna(method="ffill", axis=1) # forward fill for rows


Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,2.0,5.0
1,,8.0,9.0,8.0
2,30.0,31.0,32.0,34.0
3,,,100.0,110.0


---

### Checking if there are NAs

The question is: Does this `Series` or `DataFrame` contain any missing value? The answer should be yes or no: `True` or `False`. How can you verify it?

**Example 1: Checking the length**

If there are missing values, `s.dropna()` will have less elements than `s`:

In [176]:
s.dropna().count() # counts non-null values in the series after dropping nulls

4

In [179]:
missing_values = len(s.dropna()) != len(s) # check if there are any NaN values in the series
missing_values

True

There's also a `count` method, that excludes `nan`s from its result:

In [181]:
len(s) # total number of values in the series

6

In [183]:
s.count() # counts non-null values in the series

4

So we could just do:

In [184]:
missing_values = s.count() != len(s)
missing_values

True

The methods `any` and `all` check if either there's any `True` value in a `Series` or all the values are `True`. They work in the same way as in Python:

**More Pythonic solution `any`**

In [186]:
pd.Series([True, False, False]).any() # checks if any value is True in the series

True

In [188]:
pd.Series([True, False, False]).all() # checks if all values are True in the series

False

The `isnull()` method returned a `Boolean Series` with `True` values wherever there was a `nan`:

In [191]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

In [189]:
s.isnull()

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

So we can just use the `any` method with the boolean array returned:

In [192]:
pd.Series([1, np.nan]).isnull().any()

True

In [193]:
pd.Series([1, 2]).isnull().any()

False

In [194]:
s.isnull().any()

True

A more strict version would check only the `values` of the Series:

In [195]:
s.isnull().values

array([False, False, False,  True,  True, False])

In [196]:
s.isnull().values.any()

True