# Imputation of Missing Values:

Many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning.

## SimpleImputer():

The **SimpleImputer** class provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located. This class also allows for different missing values encodings.

In [1]:
# importing libraries
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

In [2]:
# create a DataFrame
miles = pd.DataFrame({'farthest_run_mi': [50,62,np.nan,100,26,13,31,50]})
miles

Unnamed: 0,farthest_run_mi
0,50.0
1,62.0
2,
3,100.0
4,26.0
5,13.0
6,31.0
7,50.0


In [3]:
# checking for missing values
miles.isna().sum()

farthest_run_mi    1
dtype: int64

#### Mean Strategy:

In [4]:
imp_mean = SimpleImputer(strategy='mean')

imp_mean.fit_transform(miles)

array([[ 50.        ],
       [ 62.        ],
       [ 47.42857143],
       [100.        ],
       [ 26.        ],
       [ 13.        ],
       [ 31.        ],
       [ 50.        ]])

Now, we have the average which is **47.42**.

#### Median Strategy:

In [5]:
imp_median = SimpleImputer(strategy='median')

imp_median.fit_transform(miles)

array([[ 50.],
       [ 62.],
       [ 50.],
       [100.],
       [ 26.],
       [ 13.],
       [ 31.],
       [ 50.]])

Look at this, Median for all of this over here was **50**. Median is pretty close to the mean.

#### Mode Strategy:

In [6]:
imp_mode = SimpleImputer(strategy='most_frequent')

imp_mode.fit_transform(miles)

array([[ 50.],
       [ 62.],
       [ 50.],
       [100.],
       [ 26.],
       [ 13.],
       [ 31.],
       [ 50.]])

In [7]:
# FOR CATEGORICAL
names = pd.DataFrame({'names': ['Muhammad', 'Alex', 'Warner', 'Ali', np.nan, 'Bilal']})
names

Unnamed: 0,names
0,Muhammad
1,Alex
2,Warner
3,Ali
4,
5,Bilal


In [8]:
# strategy for categorical data
imp_cons_cat = SimpleImputer(strategy='constant', fill_value='Sarah')

imp_cons_cat.fit_transform(names)

array([['Muhammad'],
       ['Alex'],
       ['Warner'],
       ['Ali'],
       ['Sarah'],
       ['Bilal']], dtype=object)

In [10]:
df = pd.read_csv('impu.csv')
df

Unnamed: 0,Name,farthest_run_mi
0,Alex,50.0
1,Nolan,62.0
2,Christopher,
3,Muhammad,100.0
4,Mustafa,26.0
5,,13.0
6,Yousuf,31.0
7,Mehmet,50.0


In [11]:
from sklearn.compose import make_column_transformer

In [12]:
ct = make_column_transformer(
    (imp_cons_cat, ['Name']),
    (imp_mean, ['farthest_run_mi']),
    remainder='drop'
)

In [13]:
ct.set_output(transform='pandas')

AttributeError: 'ColumnTransformer' object has no attribute 'set_output'

In [14]:
df_pandas = ct.fit_transform(df)
df_pandas

array([['Alex', 50.0],
       ['Nolan', 62.0],
       ['Christopher', 47.42857142857143],
       ['Muhammad', 100.0],
       ['Mustafa', 26.0],
       ['Sarah', 13.0],
       ['Yousuf', 31.0],
       ['Mehmet', 50.0]], dtype=object)

That is how you can do imputation for both numerical and categorical with the **SimpleImputer()**.