In [3]:
import numpy as np
import pandas as pd

# Content
- [Types of Missing Values](#missing-types)
- [Simple Techniques to Handle Missing Values with Pandas](#handle-missing-value)
- [Reference](#reference)

## <a name="missing-types"></a>Types of Missing Values
There are types of missing values:
1. __Missing Completely at Random (MCAR)__: the missingness occurs completely at random, does not relate to the value itself or other variables.
2. __Missing at Random (MAR)__: The missingingness does not depend on the the variable with missing value itself, but the other variables. E.g. if the probability of missing of _x<sub>1i</sub>_ does not depend on the value of _X1_ after controlling all the other variables, then it is MAR. P(x<sub>missing</sub>|X, Y) = P(x<sub>missing</sub>|Y). E.g. one failded to fill in a depression survey has nothing to do with the level of depression, but gender.
3. __Missing Not at Random (MNAR)__: a.k.a. nonignorable nonresponse. The value of the missing variable is related to the reason it's missing. E.g. one failed to fill in a depression survey because of level of depression.

## <a name='handle-missing-value'></a>Simple Techniques to Handle Missing Values with Pandas
For __MCAR__, it is okay to remove missing data.
For __MAR__ and __MNAR__, it is better to impute missing data.

- [Imputation](#imputation)
    - General
        - categorical/discrete
            - make missing values as new category
            - multiple imputation
            - logistic regression
        - continuous
            - mean, median, mode,
            - multiple imputation
            - linear regression
   
    - Time-series
        - no trend & no seasonality: mean, median, mode, random sample imputation
        - trend but no seasonality: linear interpolation
        - trend and seasonality: seasonal adjustment + interpolation
- [Removing](#remove)
    - Listwise/Casewise deletion: remove row (remove an observation). NOT a preferred method as it rarely satisfis the strong MCAR assumption. Deletion of rows with missing values may lead to bias.
    - Pairwise deletion. weighted
    - Delete columns

In [28]:
# create a dataframe with missing values
df = pd.DataFrame({'A': [1, np.nan, 3, 4, np.nan],
                   'B': [6, np.nan, 8, 9, 10],
                   'C': [2, np.nan, np.nan, 8, 10],
                   'D': [11,12,13,14,15]
                  })
df

Unnamed: 0,A,B,C,D
0,1.0,6.0,2.0,11
1,,,,12
2,3.0,8.0,,13
3,4.0,9.0,8.0,14
4,,10.0,10.0,15


In [29]:
# detect missing values
# create a mask of boolean values indicating if the corresponding value is missing or not

# return True if the value is missing
df.isna()

Unnamed: 0,A,B,C,D
0,False,False,False,False
1,True,True,True,False
2,False,False,True,False
3,False,False,False,False
4,True,False,False,False


In [30]:
# return True if the value is NOT missing
df.notna()

Unnamed: 0,A,B,C,D
0,True,True,True,True
1,False,False,False,True
2,True,True,False,True
3,True,True,True,True
4,False,True,True,True


In [31]:
df.isnull()

Unnamed: 0,A,B,C,D
0,False,False,False,False
1,True,True,True,False
2,False,False,True,False
3,False,False,False,False
4,True,False,False,False


In [32]:
# Note: in Python and NumPy, nan's do not compare equal, but None's do. 
# Pandas treats None like np.nan
# In pandas, a scalar equality comparison versus a None/np.nan does NOT provide useful info.

print(None == None)
print(np.nan == np.nan)

True
False


### <a name='imputation'></a>Imputation
- regression based imputation
    - underestimate error due to missing of random error
    - have to assume linear relationship between variables, which may not always exist (can add random noise following a distribution)
    - distort the sample distribution  

Below are some simple operations to fill missing data using `pandas`. 
See more: [`Cleaning / filling missing data`](https://pandas.pydata.org/pandas-docs/stable/missing_data.html#cleaning-filling-missing-data)

Can also use `Imputer` from `sklearn`

In [33]:
# fill with a value, e.g. 0
df.fillna(0)

Unnamed: 0,A,B,C,D
0,1.0,6.0,2.0,11
1,0.0,0.0,0.0,12
2,3.0,8.0,0.0,13
3,4.0,9.0,8.0,14
4,0.0,10.0,10.0,15


In [34]:
# forward fill, use preceding valid observation to fill. 
df.fillna(method='ffill')

Unnamed: 0,A,B,C,D
0,1.0,6.0,2.0,11
1,1.0,6.0,2.0,12
2,3.0,8.0,2.0,13
3,4.0,9.0,8.0,14
4,4.0,10.0,10.0,15


In [35]:
# backward fill, use next valid observation to fill. 
df.fillna(method='bfill')

Unnamed: 0,A,B,C,D
0,1.0,6.0,2.0,11
1,3.0,8.0,8.0,12
2,3.0,8.0,8.0,13
3,4.0,9.0,8.0,14
4,,10.0,10.0,15


In [36]:
# if to fill by the average of neighbours
(df.fillna(method='ffill') + df.fillna(method='bfill'))/2

Unnamed: 0,A,B,C,D
0,1.0,6.0,2.0,11
1,2.0,7.0,5.0,12
2,3.0,8.0,5.0,13
3,4.0,9.0,8.0,14
4,,10.0,10.0,15


In [37]:
# use mean to fill
df.fillna(df.mean())

Unnamed: 0,A,B,C,D
0,1.0,6.0,2.0,11
1,2.666667,8.25,6.666667,12
2,3.0,8.0,6.666667,13
3,4.0,9.0,8.0,14
4,2.666667,10.0,10.0,15


In [38]:
# use a dictionary to fill
values = {'A': df.A.min(), 'B': df.B.max(), 'C': df.C.median()}
df.fillna(value = values)

Unnamed: 0,A,B,C,D
0,1.0,6.0,2.0,11
1,1.0,10.0,8.0,12
2,3.0,8.0,8.0,13
3,4.0,9.0,8.0,14
4,1.0,10.0,10.0,15


In [41]:
# interpolation with pandas
# default: linear
df.interpolate()

Unnamed: 0,A,B,C,D
0,1.0,6.0,2.0,11
1,2.0,7.0,4.0,12
2,3.0,8.0,6.0,13
3,4.0,9.0,8.0,14
4,4.0,10.0,10.0,15


In [48]:
# nearest
df.interpolate(method='nearest')

Unnamed: 0,A,B,C,D
0,1.0,6.0,2.0,11
1,1.0,6.0,2.0,12
2,3.0,8.0,8.0,13
3,4.0,9.0,8.0,14
4,,10.0,10.0,15


In [50]:
# slinear
df.interpolate(method='nearest', limit_direction='both')

Unnamed: 0,A,B,C,D
0,1.0,6.0,2.0,11
1,1.0,6.0,2.0,12
2,3.0,8.0,8.0,13
3,4.0,9.0,8.0,14
4,,10.0,10.0,15


### <a name='remove'></a>Removing

In [51]:
# drop columns with missing values
df.dropna(axis=1)

Unnamed: 0,D
0,11
1,12
2,13
3,14
4,15


In [52]:
# drop rows with missing values
df.dropna(axis=0)

Unnamed: 0,A,B,C,D
0,1.0,6.0,2.0,11
3,4.0,9.0,8.0,14


##  <a name="reference"></a>Reference
- [Missing data](https://en.wikipedia.org/wiki/Missing_data#Types_of_missing_data)
- [How to Handle Missing Data](https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4)
- [Working with missing data](https://pandas.pydata.org/pandas-docs/stable/missing_data.html)