The missing value is very common in real value. For different reasons, the dataset contains missing value as blank, **nan**, **inf**, or other specified value. In some cases, some normal values are also considered to be missing value, such as **0** or **1**. Why do we care about the missing value? 

- Some algorithms or some implementation can't deal with the missing value. They assume that the dataset is complete.
- The missing value would impact the performance of our model.

In most cases, the first one is the main reason.

In some cases, you may think that how about just dropping those rows/columns with too many missing values. It's a good idea if only a small part of the data is dropped. However, when the dropped data is large, it may bring some other issues. For example, if you drop the whole columns, it must lead to the loss of information. Another way is to **impute** it. **sklearn** provides some functions missing value imputation.

## Imputation with simple method

As the name suggests, the **SimpleImputer** provides some simple strategies to impute the missing value.

- mean: replace missing values using the mean along each column.
- median: replace missing values using the median along each column. 
- most_frequent: replace missing using the most frequent value along each column. This one can be used for both strings or numeric data.
- constant

Three is another parameter **missing_values** which you can specify what value is considered as a missing value. The default missing value is `np.nan`.

At first, let create a matrix and fill some value as **np.nan**.

In [1]:
import numpy as np

np.random.seed(42)
X = np.random.random(size=(4, 4))
## set the missing value
X[2, 3] = np.nan
X[3, 0] = np.nan
print("The original value")
print(X)

The original value
[[0.37454012 0.95071431 0.73199394 0.59865848]
 [0.15601864 0.15599452 0.05808361 0.86617615]
 [0.60111501 0.70807258 0.02058449        nan]
 [       nan 0.21233911 0.18182497 0.18340451]]


Then, let's impute the missing value by **mean**

In [2]:
from sklearn.impute import SimpleImputer

imp_mean = SimpleImputer(strategy='mean')
X_mean = imp_mean.fit_transform(X)
print("Fill the missing value by mean")
print(X_mean)

Fill the missing value by mean
[[0.37454012 0.95071431 0.73199394 0.59865848]
 [0.15601864 0.15599452 0.05808361 0.86617615]
 [0.60111501 0.70807258 0.02058449 0.54941305]
 [0.37722459 0.21233911 0.18182497 0.18340451]]


Then, let's impute the missing value by **median**.

In [3]:
imp_median = SimpleImputer(strategy='median')
X_median = imp_median.fit_transform(X)
print("Fill the missing value by median")
print(X_median)

Fill the missing value by median
[[0.37454012 0.95071431 0.73199394 0.59865848]
 [0.15601864 0.15599452 0.05808361 0.86617615]
 [0.60111501 0.70807258 0.02058449 0.59865848]
 [0.37454012 0.21233911 0.18182497 0.18340451]]


Not only the numeric data, it also can impute the category data. In this example, the missing value is blank string, not **nan**. And the strategy is **most_frequent**.

In [4]:
X2 = np.array(["Sun", "Sun", "Moon", "Earth", "", "Sun"], dtype='object').reshape(-1, 1)
imp_req = SimpleImputer(missing_values="", strategy='most_frequent')
print("The original data")
print(X2)
X_freq = imp_req.fit_transform(X2)
print("Fill the missing value by most frequent item")
print(X_freq)

The original data
[['Sun']
 ['Sun']
 ['Moon']
 ['Earth']
 ['']
 ['Sun']]
Fill the missing value by most frequent item
[['Sun']
 ['Sun']
 ['Moon']
 ['Earth']
 ['Sun']
 ['Sun']]


## Imputation with KNN

`sklearn` provides another function `KNNImputer` to fill the missing value. This function imputes completing missing values using k-Nearest Neighbors. Each sample’s missing values are imputed using the mean value from `n_neighbors` nearest neighbors found in the training set. Ths usage is very similar to the `SimpleImputer`.

In the example below, we use `n_neighbors=3` as the parameter.

In [10]:
from sklearn.impute import KNNImputer

X = np.array([[1, 2, np.nan, 2], [3, 4, 3, 5], [np.nan, 6, 5, 7], [8, 8, 7, 9]])
knn = KNNImputer(n_neighbors=3)
print("The original value")
print(X)
X_trans = knn.fit_transform(X)
print("Fill the missing value by KNN")
print(X_trans)

The original value
[[ 1.  2. nan  2.]
 [ 3.  4.  3.  5.]
 [nan  6.  5.  7.]
 [ 8.  8.  7.  9.]]
Fill the missing value by KNN
[[1. 2. 5. 2.]
 [3. 4. 3. 5.]
 [4. 6. 5. 7.]
 [8. 8. 7. 9.]]
