---
# Missing Data in Pandas


---
## 1. Missing Data - Definition:
Data comes in many shapes and forms and more often than not, missing data is one of the biggest issues we encounter in real life! This is why Pandas aims to be flexible with regards to handling it. But before we learn about the ways to handle missing data, let's take a look at some definitions:

1. __NaN (Not a Number)__ - a numerical value that best refers to cases of __numerical invalidity__
2. __None (NoneType)__ - an internal Python data type, which refers to __inexistent__ or __empty__ values

While __NaN__ is the default missing value marker, we need to be able to easily detect and handle this value with data of different data types - float, integer, boolean, etc. This is why we also conside __None__ to be flagging __missing, not avaiable, or NA__ values too!

---
## 2. Handling Missing Data:

__Why is it important to handle missing data?__ - there are countless reasons to consider Missing Data an 'Enemy' to any data set or analysis.

1. Missing data can indicate:
    - data collection errors
    - calculation errors
    - incomplete data collection/implementation
    

2. Missing data can have:
    - an adverse impact on the quality of __Deterministic Models__ (e.g. Machine Learning Models)
    - negative consequences to businesses and sectors from a regulatory standpoint
    
__How to handle missing data?__ - in this unit we will explore two of the most commonly used methods to handle missing data:
- Dropping Values
- Imputation (filling in with a value)

In [None]:
# Imports
import pandas as pd
import numpy as np
import datetime as dt

In [None]:
# Write a function that constructs a DataFrame using the .reshape() function
# Note that our function assigns nan values to certain DataFrame cells
def make_df3():
    data = np.array(range(24)).reshape(-1,3)
    df   = pd.DataFrame(data, columns=['col1', 'col2', 'col3'])
    df.iloc[0,1] = np.nan
    df.iloc[4,0] = np.nan
    df.iloc[6,0] = np.nan
    return df

In [None]:
# Create a demo DataFrame
df = make_df3()
df

---
## 3. Dropping:
__Dropping Missing Values__ is the the second 'easiest way' to handle missing data to simply leaving it in the data set! Dropping refers to 'getting rid of' any records that contain a missing value - be it columns or rows in a DataFrame. 

However, the ease of this method comes at the expense of a high risk of losing important data points - often this is critical from a business point of view! Therefore dropping must be conducted carefully and very selectively!

To perform dropping of values, we use Pandas `.dropna()` function:
- `df_name.dropna(axis = 1)` - drops all columns with missing values
- `df_name.dropna(axis = 0)` - drops all rows with missing values

In [None]:
# Create two DataFrames using different axis values:

# df1 will contain only non-null columns of df
df1 = df.dropna(axis = 1)
display(df1)

# df2 will contain only non-null rows of df
df2 = df.dropna(axis = 0)
display(df2)

---
## 4. Imputation (filling in values):
__Imputation__ is the second and most widely used method when it comes to handling missing data. __Impute__ means to __assign (ascribe)__ to something, and in that sense, __Imputation__ simply means to fill in the blanks of a data set. 

There are multiple approaches to Data Imputation:
- Manual Imputation with a single static value
- Imputation with multiple static values
- Automatic Imputation using __forward fill__ and __backward fill__ aproach

All of these approaches leverage the `.fillna()` function:

In [None]:
# filling in all blanks with a single value
df.fillna(9000)

In [None]:
# filling in a column's blanks with a single value using a mask
# note - here we use the .isnull() method and explicit indexing
mask = df['col1'].isnull()
df.loc[mask, 'col1'] = 9000
display(df)
df

In [None]:
# filling in columns' blanks with different values
df = make_df3()
df.fillna({'col1':6666, 'col2': 33333})

---
### 4.1 Forward Fill:
__Forward Fill__ is a method, which can be passed on to the __.fillna()__ function.

__Forward Fill__ will propagate the last valid observation forward to the missing data point. Essentially, this means that all missing values will be replaced with the value above them in their corresponding column. If the missing value is the first element in a column, it will remain __NaN__.

Syntax: 
- `df_name.fillna(method = 'ffill')`

In [None]:
# Forward fill
df = make_df3()
print('Before:')
display(df)
print('---------------------')
print('After:')
display(df.fillna(method='ffill'))

---
### 4.2 Backward Fill:
__Backward Fill__ is a method, which can be passed on to the __.fillna()__ function.

__Backward Fill__ works in the opposite way to Forward Fill - it fills in a missing data field with the value beneath them in their corresponding column. If the missing value is the last element in a column, it will remain __NaN__.

Syntax: 
- `df_name.fillna(method = 'bfill')`

In [None]:
# Backward fill
df = make_df3()
print('Before:')
display(df)
print('---------------------')
print('After:')
display(df.fillna(method='bfill'))

---
## 5. Interpolation:
__Interpolation__ is a method of finding a simple function from a given data set, which can then be used to derive data points in between the given data ones. There are many interpolation methods, but we will consider the simples one - __linear interpolation__. 

A __Linear Interpolation__ will take the two closest values to a missing data field and will 'fill in the blank' with the mid-point (or average) of the two.

Syntax:
- `df_name.interpolate()`

In [None]:
# Interpolation
df = make_df3()
print('Before:')
display(df)
print('---------------------')
print('After:')
display(df.interpolate())
df

In [None]:
# when there are multiple adjasent blanks, interpolate() will take this into account and assign equidistant values
pd.Series([1,np.nan,np.nan,4]).interpolate()

---
## 6. Summary:
- Handling missing data is important, as it can indicate incomplete data sets, calculation errors, as well as negatively impact data analyses
- The two main methods to handle missing data is by __Dropping Values__ and __Imputation (filling in blanks)__
- When handling missing data, always pick the most adequate and relevant method to your data set in order to minimise critical data loss

---
## 7. Concept Check:

1. Suppose we have a DataFrame `df=pd.DataFrame({'col1':[1,2,np.nan, 4,5], 'col2':[6,7,8,9,10], 'col3':[np.nan, 12,13, np.nan,15]})`. Without running a code, determine:
- the shape of the output produced by `df.dropna(axis=1)`
- the shape of the output produced by `df.dropna(axis=0)`
2. Using the DataFrame from question 1 and without running a code, determine:
- the value of `df.loc[0,'col3']` after applying imputation via forward fill `.fillna(method = 'ffill')`
- the value of `df.loc[2,'col1']` after applying imputation via backward fill `.fillna(method = 'bfill')`