# Handling Missing Data

Let us take an example data as follows.

In [12]:
# Data Preprocessing

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Data.csv')

print(dataset)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


As we can see, missing data is labeled as *NaN*. First we seperate our data using *iloc*. 

In [13]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

print(X, '\n')
print(y)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]] 

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


Now we import our imputer and create an object of imputer. Then, we fit our data into the imputer. Afterwards, we transform our first dataset values with the filled ones.

In [14]:
# Taking care of missing data
from sklearn.impute import SimpleImputer

imputer = SimpleImputer()
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


As can be clearly seen, *NaN* values are filled with the mean of the other values. 

**Important to note**

Default parameter value for *missing_values* is np.nan. So we did not have to change it.

Default parameter value for *strategy* is mean. Median or most_frequent can also be given as well as constant.

- Methods:

    - **fit**(self, X[, y]) 	Fit the imputer on X.
    - **fit_transform**(self, X[, y]) 	Fit to data, then transform it.
    - **get_params**(self[, deep]) 	Get parameters for this estimator.
    - **set_params**(self, \*\*params) 	Set the parameters of this estimator.
    - **transform**(self, X) 	Impute all missing values in X.