## Dealing with Missing Data

Most computational tools are unable to handle such missing values or will produce unpredictable results if we simply ignore them. Therefore, it is crucial that we take care of those missing values before we proceed with further analyses.

In [14]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer 

In [8]:
data = {
      'A' : [1.0, 5.0, 10.0],
      'B' : [2.0, 6.0, 11.0],
      'C' : [3.0, np.nan,12.0],
      'D' : [4.0, 8.0, np.nan]
      }

df = pd.DataFrame(data)

df.head()

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [10]:
# Look for missing values and count column-wise
df.isnull().sum()

A    0
B    0
C    1
D    1
dtype: int64

In [11]:
# Easiest way of dealing with null values is to remove the corresponding training examples
df.dropna(axis=0)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [13]:
# Easiest way of dealing with null values is to remove the corresponding columns
df.dropna(axis=1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


> Although the removal of missing data seems to be a convenient approach, it also comes with certain
disadvantages; for example, we may end up removing too many samples, which will make a reliable
analysis impossible. Or, if we remove too many feature columns, we will run the risk of losing valuable
information that our classifier needs to discriminate between classes.

## Imputing Missing Values

In this case, we can use different interpolation techniques to estimate the missing values from the other training examples in our dataset. One of the most common interpolation techniques is mean imputation, where we simply replace the missing value with the mean value of the entire feature column

In [15]:
# Replace null values with the mean
imr = SimpleImputer(missing_values=np.nan, strategy='mean')
imr = imr.fit(df)
imputed_data = imr.transform(df)

imputed_data

  import imp


array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])

In [16]:
# Alternate Mean imputation 
df.fillna(df.mean())

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.5,8.0
2,10.0,11.0,12.0,6.0
