## Frequent category imputation with Scikit-learn ==> SimpleImputer

The **SimpleImputer** class provides **most frequent category imputation** for categorical variables! **Simple to use** if applied to the entire dataframe, it has good quality code, and **fast computation** (numpy), allows for **grid search** over the various imputation techniques and allows for **different missing values encodings** (you can indicate if the missing values are np.nan, or zeroes, etc).
It returns a **numpy array** instead of a pandas dataframe, inconvenient for data analysis, needs to use additional classes to select which features to impute, requires more lines of code!

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

**Use only the following variables for the demo: a mix of categorical and numerical!**

In [6]:
cols_to_use = ['BsmtQual', 'FireplaceQu', 'MSZoning',
               'BsmtUnfSF', 'LotFrontage', 'MasVnrArea',
               'Street', 'Alley', 'SalePrice']

In [7]:
data = pd.read_csv('housingPrices_train.csv', usecols=cols_to_use)
print(data.shape)
data.head()

(1460, 9)


Unnamed: 0,MSZoning,LotFrontage,Street,Alley,MasVnrArea,BsmtQual,BsmtUnfSF,FireplaceQu,SalePrice
0,RL,65.0,Pave,,196.0,Gd,150,,208500
1,RL,80.0,Pave,,0.0,Gd,284,TA,181500
2,RL,68.0,Pave,,162.0,Gd,434,TA,223500
3,RL,60.0,Pave,,0.0,TA,540,Gd,140000
4,RL,84.0,Pave,,350.0,Gd,490,TA,250000


**Check the null values!..**

In [8]:
data.isnull().mean()

MSZoning       0.000000
LotFrontage    0.177397
Street         0.000000
Alley          0.937671
MasVnrArea     0.005479
BsmtQual       0.025342
BsmtUnfSF      0.000000
FireplaceQu    0.472603
SalePrice      0.000000
dtype: float64

The cateogrical variables Alley, **BsmtQual and FirePlaceQu** contain missing data.

**Separate into training and testing set first let's remove the target from the features!**

In [9]:
cols_to_use.remove('SalePrice')
X_train, X_test, y_train, y_test = train_test_split(data[cols_to_use], # just the features
                                                    data['SalePrice'], # the target
                                                    test_size=0.3, # the percentage of obs in the test set
                                                    random_state=0) # for reproducibility
X_train.shape, X_test.shape

((1022, 8), (438, 8))

**Check the misssing data again!**

In [10]:
X_train.isnull().mean()

BsmtQual       0.023483
FireplaceQu    0.467710
MSZoning       0.000000
BsmtUnfSF      0.000000
LotFrontage    0.184932
MasVnrArea     0.004892
Street         0.000000
Alley          0.939335
dtype: float64

### SimpleImputer on the entire dataset

**Impute the missing values with SimpleImputer! Create an instance! indicate that we want to impute with the most frequent category! The imputer will learn the mode of ALL variables! Categorical or not!**

In [11]:
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit(X_train[cols_to_use])  # fit the imputer

SimpleImputer(strategy='most_frequent')

**Look at the learnt frequent values like this:**

In [12]:
imputer.statistics_

array(['TA', 'Gd', 'RL', 0, 60.0, 0.0, 'Pave', 'Pave'], dtype=object)

**Note** that the transformer learns the most frequent value for both categorical AND numerical variables.

**Investigate the frequent values to corroborate the imputer did a good job!**

In [13]:
X_train[cols_to_use].mode()

Unnamed: 0,BsmtQual,FireplaceQu,MSZoning,BsmtUnfSF,LotFrontage,MasVnrArea,Street,Alley
0,TA,Gd,RL,0,60.0,0.0,Pave,Pave


**Impute the train and test set! The data will be returned as a numpy array!!!**

In [14]:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)
X_train

array([['Gd', 'Gd', 'RL', ..., 573.0, 'Pave', 'Pave'],
       ['Gd', 'Gd', 'RL', ..., 0.0, 'Pave', 'Pave'],
       ['TA', 'Gd', 'RL', ..., 0.0, 'Pave', 'Pave'],
       ...,
       ['TA', 'Gd', 'RM', ..., 0.0, 'Pave', 'Pave'],
       ['Gd', 'TA', 'RL', ..., 18.0, 'Pave', 'Pave'],
       ['Gd', 'Gd', 'RL', ..., 30.0, 'Pave', 'Pave']], dtype=object)

**Encode the train set back to a dataframe:**

In [15]:
pd.DataFrame(X_train, columns=cols_to_use).head()

Unnamed: 0,BsmtQual,FireplaceQu,MSZoning,BsmtUnfSF,LotFrontage,MasVnrArea,Street,Alley
0,Gd,Gd,RL,318,60.0,573.0,Pave,Pave
1,Gd,Gd,RL,288,60.0,0.0,Pave,Pave
2,TA,Gd,RL,162,50.0,0.0,Pave,Pave
3,TA,Gd,RL,356,60.0,0.0,Pave,Pave
4,TA,Gd,RL,0,60.0,0.0,Pave,Pave


### SimpleImputer: different procedures on different features

In [17]:
X_train, X_test, y_train, y_test = train_test_split(data[cols_to_use],
                                                    data['SalePrice'],
                                                    test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

((1022, 8), (438, 8))

**Look at the missing values!**

In [18]:
X_train.isnull().mean()

BsmtQual       0.023483
FireplaceQu    0.467710
MSZoning       0.000000
BsmtUnfSF      0.000000
LotFrontage    0.184932
MasVnrArea     0.004892
Street         0.000000
Alley          0.939335
dtype: float64

Now we **impute categorical variables** with the **frequent category**, **numerical variables** with the **mean.**

**Make lists, indicating which features will be imputed with each method! Then put the features list and the transformers together!**

In [19]:
features_numeric = ['BsmtUnfSF', 'LotFrontage', 'MasVnrArea', ]
features_categoric = ['BsmtQual', 'FireplaceQu', 'MSZoning',
                      'Street', 'Alley']
preprocessor = ColumnTransformer(transformers=[
    ('numeric_imputer', SimpleImputer(strategy='mean'), features_numeric),
    ('categoric_imputer', SimpleImputer(strategy='most_frequent'), features_categoric)])

**Fit the preprocessor!**

In [20]:
preprocessor.fit(X_train)

ColumnTransformer(transformers=[('numeric_imputer', SimpleImputer(),
                                 ['BsmtUnfSF', 'LotFrontage', 'MasVnrArea']),
                                ('categoric_imputer',
                                 SimpleImputer(strategy='most_frequent'),
                                 ['BsmtQual', 'FireplaceQu', 'MSZoning',
                                  'Street', 'Alley'])])

**Explore the transformers like this:**

In [21]:
preprocessor.transformers

[('numeric_imputer',
  SimpleImputer(),
  ['BsmtUnfSF', 'LotFrontage', 'MasVnrArea']),
 ('categoric_imputer',
  SimpleImputer(strategy='most_frequent'),
  ['BsmtQual', 'FireplaceQu', 'MSZoning', 'Street', 'Alley'])]

**Look at the parameters learnt like this: for the mean imputer!**

In [22]:
preprocessor.named_transformers_['numeric_imputer'].statistics_

array([565.99217221,  69.66866747, 103.55358899])

**Corroborate the value with that one in the train set!**

In [55]:
X_train[features_numeric].mean()

BsmtUnfSF      565.992172
LotFrontage     69.668667
MasVnrArea     103.553589
dtype: float64

**For frequent category imputer!**

In [56]:
preprocessor.named_transformers_['categoric_imputer'].statistics_

array(['TA', 'Gd', 'RL', 'Pave', 'Pave'], dtype=object)

**Corroborate those values in the train set!**

In [34]:
X_train[features_categoric].mode()

Unnamed: 0,BsmtQual,FireplaceQu,MSZoning,Street,Alley
0,TA,Gd,RL,Pave,Pave


**Impute the data**

In [35]:
X_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)

In [36]:
X_train.shape

(1022, 8)

**See how the result of the imputation is a 3 column dataset!**

In [37]:
pd.DataFrame(X_train,
             columns=features_numeric + features_categoric).head()

Unnamed: 0,BsmtUnfSF,LotFrontage,MasVnrArea,BsmtQual,FireplaceQu,MSZoning,Street,Alley
0,318.0,69.668667,573.0,Gd,Gd,RL,Pave,Pave
1,288.0,69.668667,0.0,Gd,Gd,RL,Pave,Pave
2,162.0,50.0,0.0,TA,Gd,RL,Pave,Pave
3,356.0,60.0,0.0,TA,Gd,RL,Pave,Pave
4,0.0,60.0,0.0,TA,Gd,RL,Pave,Pave


In this case, we passed all the features available in the dataset to the missing data imputers, so the returned dataset contains all the variables.