## Data imputation methods

In real life, we often have to deal with data that contains missing values. Sometimes, if the dataset is missing too many values, we just don't use it. However, if only a few of the values are missing, we can perform data imputation to substitute the missing data with some other value(s).

There are many different methods for data imputation. In scikit-learn, the SimpleImputer transformer performs four different data imputation methods.

The four methods are:
- Using the mean value
- Using the median value
- Using the most frequent value
- Filling in missing values with a constant

The default imputation method for SimpleImputer is using the column means. By using the strategy keyword argument when initializing a SimpleImputer object, we can specify a different imputation method.

In [4]:
from sklearn.impute import SimpleImputer
import numpy as np

data = [
        [ 1.,  2., np.nan,  2.],
        [ 5., np.nan,  1.,  2.],
        [ 4., np.nan,  3., np.nan],
        [ 5.,  6.,  8.,  1.],
        [np.nan,  7., np.nan,  0.]
]
imp_mean = SimpleImputer()
transformed = imp_mean.fit_transform(data)
print('{}\n'.format(repr(transformed)))

imp_median = SimpleImputer(strategy='median')
transformed = imp_median.fit_transform(data)
print('{}\n'.format(repr(transformed)))

imp_frequent = SimpleImputer(strategy='most_frequent')
transformed = imp_frequent.fit_transform(data)
print('{}\n'.format(repr(transformed)))

imp_constant = SimpleImputer(strategy='constant',
                             fill_value=-1)
transformed = imp_constant.fit_transform(data)
print('{}\n'.format(repr(transformed)))

array([[1.  , 2.  , 4.  , 2.  ],
       [5.  , 5.  , 1.  , 2.  ],
       [4.  , 5.  , 3.  , 1.25],
       [5.  , 6.  , 8.  , 1.  ],
       [3.75, 7.  , 4.  , 0.  ]])

array([[1. , 2. , 3. , 2. ],
       [5. , 6. , 1. , 2. ],
       [4. , 6. , 3. , 1.5],
       [5. , 6. , 8. , 1. ],
       [4.5, 7. , 3. , 0. ]])

array([[1., 2., 1., 2.],
       [5., 2., 1., 2.],
       [4., 2., 3., 2.],
       [5., 6., 8., 1.],
       [5., 7., 1., 0.]])

array([[ 1.,  2., -1.,  2.],
       [ 5., -1.,  1.,  2.],
       [ 4., -1.,  3., -1.],
       [ 5.,  6.,  8.,  1.],
       [-1.,  7., -1.,  0.]])



## Other imputation methods

There are also more advanced imputation methods such as k-Nearest Neighbors (https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm, filling in missing values based on similarity scores from the kNN algorithm) and MICE (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/, applying multiple chained imputations, assuming the missing values are randomly distributed across observations).

In most industry cases these advanced methods are not required, since the data is either perfectly cleaned or the missing values are scarce. Nevertheless, the advanced methods could be useful when dealing with open source datasets, since these tend to be more incomplete.