---
# Missing Data in Pandas


---
## 1. Missing Data - Definition:
Data comes in many shapes and forms and more often than not, missing data is one of the biggest issues we encounter in real life! This is why Pandas aims to be flexible with regards to handling it. But before we learn about the ways to handle missing data, let's take a look at some definitions:

1. __NaN (Not a Number)__ - a numerical value that best refers to cases of __numerical invalidity__
2. __None (NoneType)__ - an internal Python data type, which refers to __empty__ values

While __NaN__ is the default missing value marker, we need to be able to easily detect and handle this value with data of different data types - float, integer, boolean, etc. This is why we also consider __None__ to be flagging __missing, not avaiable, or NA__ values too!

---
## 2. Handling Missing Data:

### What are the causes of missing data?

There are many reasons why data could missing, and not all missing data is created equal.

- The data simply does not exist e.g. not all cars will have an engine size (in the case of electric cars).
- The data collection mechanism could have failed e.g. a tempermental physical sensor.
- Attrition e.g. sensors could be retired/died, people might drop out of a study.
- Non-response e.g. people might be uncomfortable giving personal information away in a form.
- Data collection issues / human error in recording.

### Missing data does not have a specific distribution in your dataset.

- Data can be missing at random e.g. due to random noise.
- Data could be missing for a certain category or time window e.g. no temperature measures during the days of a storm.
- Missing data could correlate with a certain property e.g. presence of a birth certificate and age (the older you are the more likely you would be to have lost your certificate or lack of processes meant you wouldn't have one to begin with.)


### What are the consecuences of missing data, and how can we handle it?

Missing data could lead to inaccurate predictions of an ML model, unrepresentative conclusions drawn from your analytics, miscalculated statistics or any number of unintended or unforeseen outcomes. Understanding the cause and shape of the missing data helps you better accommodate the limitations of your data.

__How to handle missing data?__ - in this unit we will explore two of the most commonly used methods to handle missing data:
- Dropping Values
- Imputation (filling in with a value)

In [1]:
import pandas as pd
import numpy as np

In [2]:
def make_df3():
    data = np.array(range(24)).reshape(-1, 3)
    df = pd.DataFrame(data, columns=["col1", "col2", "col3"])
    df.iloc[0,1] = np.nan
    df.iloc[4,0] = np.nan
    df.iloc[6,0] = np.nan
    return df

In [3]:
df = make_df3()
df

Unnamed: 0,col1,col2,col3
0,0.0,,2
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
4,,13.0,14
5,15.0,16.0,17
6,,19.0,20
7,21.0,22.0,23


---
## 3. Dropping:

__Dropping missing values__ is the 'easiest way' to handle missing data to (behind simply leaving it in the data set). Dropping refers to removing any **rows** or **columns** that contain a missing value.

However, the ease of this method comes at the risk of losing important data points. Therefore dropping must be conducted carefully and very selectively!
    
To perform dropping of values, we use Pandas `.dropna()` function:
- `df_name.dropna(axis = 1)` - drops all columns with missing values
- `df_name.dropna(axis = 0)` - drops all rows with missing values

In [6]:
# df1 will contain only non-null columns of df
df1 = df.dropna(axis=1)
display(df1)

# df2 will contain only non-null rows of df
df2 = df.dropna(axis=0)
display(df2)

Unnamed: 0,col3
0,2
1,5
2,8
3,11
4,14
5,17
6,20
7,23


Unnamed: 0,col1,col2,col3
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
5,15.0,16.0,17
7,21.0,22.0,23


---
## 4. Imputation (filling in values):

__Imputation__ is the another tool we can use when it comes to handling missing data, it refers simply to fill in the blanks of a data set with values.

There are multiple approaches to Data Imputation:
- Manual imputation with a single static value.
- Imputation with multiple static values.
- Automatic imputation using __forward fill__ and __backward fill__ approach.

All of these approaches leverage the `.fillna()` DataFrame method:

In [7]:
# Fill all blanks with a single value
df.fillna(9000)

Unnamed: 0,col1,col2,col3
0,0.0,9000.0,2
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
4,9000.0,13.0,14
5,15.0,16.0,17
6,9000.0,19.0,20
7,21.0,22.0,23


In [9]:
# fill in a column's blanks with a single value using a mask
mask = df["col1"].isnull() # this returns a mask where True corresponds to locations of NaNs
df.loc[mask, "col1"] = 9000 # note we are using explicit indexing
df

Unnamed: 0,col1,col2,col3
0,0.0,,2
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
4,9000.0,13.0,14
5,15.0,16.0,17
6,9000.0,19.0,20
7,21.0,22.0,23


In [10]:
df = make_df3()
df.fillna({"col1":6666, "col2":3333}) # Different columns will have different values

Unnamed: 0,col1,col2,col3
0,0.0,3333.0,2
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
4,6666.0,13.0,14
5,15.0,16.0,17
6,6666.0,19.0,20
7,21.0,22.0,23


---
### 4.1 Forward Fill:
__Forward Fill__ is a method, which can be passed on to the __.fillna()__ function.

__Forward Fill__ will propagate the last valid observation forward to the missing data point. Essentially, this means that all missing values will be replaced with the value above them in their corresponding column. If the missing value is the first element in a column, it will remain __NaN__.

Syntax: 
- `df_name.fillna(method = 'ffill')`

In [13]:
df = make_df3()
print("Before:")
display(df)
print("--------")
print("After:")
df.fillna(method="ffill")

Before:


Unnamed: 0,col1,col2,col3
0,0.0,,2
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
4,,13.0,14
5,15.0,16.0,17
6,,19.0,20
7,21.0,22.0,23


--------
After:


Unnamed: 0,col1,col2,col3
0,0.0,,2
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
4,9.0,13.0,14
5,15.0,16.0,17
6,15.0,19.0,20
7,21.0,22.0,23


---
### 4.2 Backward Fill:
__Backward Fill__ is a method, which can be passed on to the __.fillna()__ function.

__Backward Fill__ works in the opposite way to Forward Fill - it fills in a missing data field with the value beneath them in their corresponding column. If the missing value is the last element in a column, it will remain __NaN__.

Syntax: 
- `df_name.fillna(method = 'bfill')`

In [14]:
df = make_df3()
print("Before:")
display(df)
print("--------")
print("After:")
df.fillna(method="bfill")

Before:


Unnamed: 0,col1,col2,col3
0,0.0,,2
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
4,,13.0,14
5,15.0,16.0,17
6,,19.0,20
7,21.0,22.0,23


--------
After:


Unnamed: 0,col1,col2,col3
0,0.0,4.0,2
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
4,15.0,13.0,14
5,15.0,16.0,17
6,21.0,19.0,20
7,21.0,22.0,23


---
## 5. Interpolation:
__Interpolation__ is a method of finding a simple function from a given data set, which can then be used to derive data points in between the given data ones. There are many interpolation methods, but we will consider the simples one - __linear interpolation__. 

A __Linear Interpolation__ will take the two closest values to a missing data field and will 'fill in the blank' with the mid-point (or average) of the two.

Syntax:
- `df_name.interpolate()`

In [15]:
df = make_df3()
print("Before:")
display(df)
print("--------")
print("After:")
df.interpolate()

Before:


Unnamed: 0,col1,col2,col3
0,0.0,,2
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
4,,13.0,14
5,15.0,16.0,17
6,,19.0,20
7,21.0,22.0,23


--------
After:


Unnamed: 0,col1,col2,col3
0,0.0,,2
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
4,12.0,13.0,14
5,15.0,16.0,17
6,18.0,19.0,20
7,21.0,22.0,23


In [17]:
# When there are multiple adjacent blanks, interpolate will take this into account and assign equidistant values
pd.Series([1, np.nan, np.nan, 4]).interpolate()

0    1.0
1    2.0
2    3.0
3    4.0
dtype: float64

---
## 6. Summary:
- Understanding the causes of, and handling, missing data is important as it can lead to calculation errors, as well as negatively impact data analyses or the outcomes of predictive models.
- The two main methods to handle missing data is by __dropping values__ and __imputation (filling in blanks)__.
- When handling missing data, always pick the most adequate and relevant method to your data set in order to minimise data loss and minimise the impact to data quality.

---
## 7. Concept Check:

1. Suppose we have a DataFrame `df=pd.DataFrame({'col1':[1,2,np.nan, 4,5], 'col2':[6,7,8,9,10], 'col3':[np.nan, 12,13, np.nan,15]})`. Without running a code, determine:
- the shape of the output produced by `df.dropna(axis=1)`
- the shape of the output produced by `df.dropna(axis=0)`
2. Using the DataFrame from question 1 and without running a code, determine:
- the value of `df.loc[0,'col3']` after applying imputation via forward fill `.fillna(method = 'ffill')`
- the value of `df.loc[2,'col1']` after applying imputation via backward fill `.fillna(method = 'bfill')`

In [21]:
df = pd.DataFrame({'col1':[1,2,np.nan, 4,5], 'col2':[6,7,8,9,10], 'col3':[np.nan, 12,13, np.nan,15]})
display(df)
# 1.
display(df.dropna(axis=1)) # shape: (5,1)
display(df.dropna(axis=0)) # shape: (2,3)
# 2.
display(df.fillna(method="ffill")) # pred: NaN
display(df.fillna(method="bfill")) # pred: 4.0

Unnamed: 0,col1,col2,col3
0,1.0,6,
1,2.0,7,12.0
2,,8,13.0
3,4.0,9,
4,5.0,10,15.0


Unnamed: 0,col2
0,6
1,7
2,8
3,9
4,10


Unnamed: 0,col1,col2,col3
1,2.0,7,12.0
4,5.0,10,15.0


Unnamed: 0,col1,col2,col3
0,1.0,6,
1,2.0,7,12.0
2,2.0,8,13.0
3,4.0,9,13.0
4,5.0,10,15.0


Unnamed: 0,col1,col2,col3
0,1.0,6,12.0
1,2.0,7,12.0
2,4.0,8,13.0
3,4.0,9,15.0
4,5.0,10,15.0
