# Data Preprocessing - Missing Data

In [10]:
import numpy as np
import pandas as pd
from io import StringIO
from sklearn.impute import SimpleImputer

### Create a mock CSV with missing data

In [2]:
csv_data = \
'''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''

df = pd.read_csv(StringIO(csv_data))
print(df)

      A     B     C    D
0   1.0   2.0   3.0  4.0
1   5.0   6.0   NaN  8.0
2  10.0  11.0  12.0  NaN


### Analyzing missing data

In [4]:
# Validates how many rows have missing values per column
df.isnull().sum()

A    0
B    0
C    1
D    1
dtype: int64

## Handling missing data by removing

In [8]:
print(df.dropna(axis=0)) # Drops rows with missing values
print()

print(df.dropna(axis=1)) # Drops columns with missing values
print()

print(df.dropna(how='all')) # Drops rows where all columns are NaN
print()

print(df.dropna(thresh=4)) # Drops rows that have less than 4 real values
print()

print(df.dropna(subset=['C'])) # Drops rows where NaN appears in specific columns

     A    B    C    D
0  1.0  2.0  3.0  4.0

      A     B
0   1.0   2.0
1   5.0   6.0
2  10.0  11.0

      A     B     C    D
0   1.0   2.0   3.0  4.0
1   5.0   6.0   NaN  8.0
2  10.0  11.0  12.0  NaN

     A    B    C    D
0  1.0  2.0  3.0  4.0

      A     B     C    D
0   1.0   2.0   3.0  4.0
2  10.0  11.0  12.0  NaN


Dropping missing values is definitely the easiest way to handle data cleanup, but it does come with its downsides. As we remove columns, there is potential to remove features that contain valuable information. Also if we remove enough rows we can severly degrade our training data.

## Imputing missing data

### Mean Imputation

In [11]:
# Replace missing values with the mean of the column
imr = SimpleImputer(missing_values=np.nan, strategy='mean')
imr = imr.fit(df.values) # Although dataframes are supported by sklearn, the API has matured more to work with numpy arrays

imputed_data = imr.transform(df.values)
print(imputed_data)

[[ 1.   2.   3.   4. ]
 [ 5.   6.   7.5  8. ]
 [10.  11.  12.   6. ]]


### Other forms of imputation

In [15]:
imr = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)
print(imputed_data)
print()

imr = SimpleImputer(missing_values=np.nan, strategy='median')
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)
print(imputed_data)

[[ 1.  2.  3.  4.]
 [ 5.  6.  3.  8.]
 [10. 11. 12.  4.]]

[[ 1.   2.   3.   4. ]
 [ 5.   6.   7.5  8. ]
 [10.  11.  12.   6. ]]


### Imputation Using Pandas

In [16]:
df.fillna(df.mean()) # Fills missing values with the mean of the column

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.5,8.0
2,10.0,11.0,12.0,6.0


The sklearn estimator API used in the SimpleImputer class is a type of transformer that is fitted to the training data. It can then be used to transform the training data and any future data (of the same size) that we want to predict on.

Why use the same transformer on different datasets?

- The reason we want to use the same transformer on different datasets is for consistency across data that will be used for training a model.
- Fitting a transformer once can be more efficient for larger datasets.
- Transfer of knowledge, the information learned from one dataset can be applied to another dataset.