## Missing Category imputation with Scikit-learn: SimpleImputer

The **SimpleImputer** class of Scikit-learn imputes missing values, by: **mean and median** for numerical variables, **most frequent category** for categorical variables, **arbitrary value** for both categorical and numerical variables. Simple to use! Good quality code! Fast computation! Allows for grid search over the various imputation techniques! Allows for different missing values encodings (you can indicate if the missing values are np.nan, or zeroes, etc)! It returns a numpy array! R instead of a pandas dataframe, inconvenient for data analysis!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

**Load the dataset with some categorical columns and the target SalePrice!**

In [2]:
cols_to_use = ['BsmtQual', 'FireplaceQu', 'SalePrice']
data = pd.read_csv('housingPrices_train.csv', usecols=cols_to_use)
data.head()

Unnamed: 0,BsmtQual,FireplaceQu,SalePrice
0,Gd,,208500
1,Gd,TA,181500
2,Gd,TA,223500
3,TA,Gd,140000
4,Gd,TA,250000


**Inspect the percentage of missing values in each variable!**

In [3]:
data.isnull().mean()

BsmtQual       0.025342
FireplaceQu    0.472603
SalePrice      0.000000
dtype: float64

The variables LotFrontage, MasVnrArea and GarageYrBlt contain missing data.

In [4]:
cols_to_use.remove('SalePrice')
X_train, X_test, y_train, y_test = train_test_split(
    data[cols_to_use],  # just the features
    data['SalePrice'],  # the target
    test_size=0.3,  # the percentage of obs in the test set
    random_state=0)  # for reproducibility
X_train.shape, X_test.shape

((1022, 2), (438, 2))

**Check the misssing data now!**

In [5]:
X_train.isnull().mean()

BsmtQual       0.023483
FireplaceQu    0.467710
dtype: float64

**Inspect the values of the categorical variable!**

In [6]:
X_train['BsmtQual'].unique()

array(['Gd', 'TA', 'Fa', nan, 'Ex'], dtype=object)

**Inspect the values of the categorical variable!**

In [7]:
X_train['FireplaceQu'].unique()

array([nan, 'Gd', 'TA', 'Fa', 'Po', 'Ex'], dtype=object)

**Impute the missing values with SimpleImputer! Create an instance! Indicate that we want to impute by replacing NA with 'Missing'!**

In [8]:
imputer = SimpleImputer(strategy='constant', 
                       fill_value = 'Missing')
imputer.fit(X_train)  # Fit the imputer

SimpleImputer(fill_value='Missing', strategy='constant')

**Look at the learnt modes like this:**

In [9]:
imputer.statistics_

array(['Missing', 'Missing'], dtype=object)

**Impute the train and test set! NOTE: the data is a numpy array!!!**

In [10]:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)
X_train

array([['Gd', 'Missing'],
       ['Gd', 'Gd'],
       ['TA', 'Missing'],
       ...,
       ['Missing', 'Missing'],
       ['Gd', 'TA'],
       ['Gd', 'Missing']], dtype=object)

**Encode the train set back to a dataframe:**

In [11]:
X_train = pd.DataFrame(X_train, columns=cols_to_use)
X_train.head()

Unnamed: 0,BsmtQual,FireplaceQu
0,Gd,Missing
1,Gd,Gd
2,TA,Missing
3,TA,Missing
4,TA,Missing


In [12]:
X_train['BsmtQual'].unique()

array(['Gd', 'TA', 'Fa', 'Missing', 'Ex'], dtype=object)

In [13]:
X_train.isnull().mean()

BsmtQual       0.0
FireplaceQu    0.0
dtype: float64

**A MASSIVE NOTE OF CAUTION**:

Note that when using SimpleImputer and setting the parameters to: **strategy='constant'** and **fill_value = 'Missing'**. If your dataframe contains variables that are numerical and categorical, NA in both will be replaced by 'Missing" therefore converting your numerical variables into categorical, which is probably not what you are after. Most datasets contain both numerical and categorical variables, so very likely you will have to use a column transformer as shown in previous notebooks and as I also show below again.

**Load the dataset with both numerical and categorical variables!**

In [14]:
cols_to_use = ['BsmtQual', 'FireplaceQu', 'LotFrontage',
               'MasVnrArea', 'GarageYrBlt', 'SalePrice']
data = pd.read_csv('housingPrices_train.csv', usecols=cols_to_use)
data.head()

Unnamed: 0,LotFrontage,MasVnrArea,BsmtQual,FireplaceQu,GarageYrBlt,SalePrice
0,65.0,196.0,Gd,,2003.0,208500
1,80.0,0.0,Gd,TA,1976.0,181500
2,68.0,162.0,Gd,TA,2001.0,223500
3,60.0,0.0,TA,Gd,1998.0,140000
4,84.0,350.0,Gd,TA,2000.0,250000


In [15]:
cols_to_use.remove('SalePrice')
X_train, X_test, y_train, y_test = train_test_split(data[cols_to_use],
                                                    data['SalePrice'],
                                                    test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

((1022, 5), (438, 5))

**Look at the missing values!**

In [16]:
X_train.isnull().mean()

BsmtQual       0.023483
FireplaceQu    0.467710
LotFrontage    0.184932
MasVnrArea     0.004892
GarageYrBlt    0.052838
dtype: float64

For this demo, I will impute the numerical variables by the mean, and the categorical variables by the most frequent category.

**Make lists, indicating which features will be imputed with each method!**

In [17]:
features_numeric = ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']
features_categoric = ['BsmtQual', 'FireplaceQu']
preprocessor = ColumnTransformer(transformers=[ # put the features list and the transformers together
    ('imputer_numeric', SimpleImputer(strategy='mean'), features_numeric),
    ('imputer_categoric', SimpleImputer(strategy='constant', fill_value='Missing'), features_categoric)])

**Fit the preprocessor!**

In [18]:
preprocessor.fit(X_train)

ColumnTransformer(transformers=[('imputer_numeric', SimpleImputer(),
                                 ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']),
                                ('imputer_categoric',
                                 SimpleImputer(fill_value='Missing',
                                               strategy='constant'),
                                 ['BsmtQual', 'FireplaceQu'])])

**Explore the transformers like this:**

In [19]:
preprocessor.transformers

[('imputer_numeric',
  SimpleImputer(),
  ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']),
 ('imputer_categoric',
  SimpleImputer(fill_value='Missing', strategy='constant'),
  ['BsmtQual', 'FireplaceQu'])]

**Look at the parameters learnt like this: for the numerical imputer!**

In [20]:
preprocessor.named_transformers_['imputer_numeric'].statistics_

array([  69.66866747,  103.55358899, 1978.01239669])

**For the categorical imputer**

In [21]:
preprocessor.named_transformers_['imputer_categoric'].statistics_

array(['Missing', 'Missing'], dtype=object)

**Impute the data! Data is a numpy array!**

In [22]:
X_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)

**Convert the result into a dataframe!**

In [23]:
pd.DataFrame(X_train,
             columns=features_numeric+features_categoric).head()

Unnamed: 0,LotFrontage,MasVnrArea,GarageYrBlt,BsmtQual,FireplaceQu
0,69.668667,573.0,1998.0,Gd,Missing
1,69.668667,0.0,1996.0,Gd,Gd
2,50.0,0.0,1978.012397,TA,Missing
3,60.0,0.0,1939.0,TA,Missing
4,60.0,0.0,1930.0,TA,Missing


**Convert the result into a dataframe! Explore the missing values, there should be none!**

In [24]:
X_train = pd.DataFrame(X_train,
             columns=features_numeric+features_categoric)
X_train.isnull().mean()

LotFrontage    0.0
MasVnrArea     0.0
GarageYrBlt    0.0
BsmtQual       0.0
FireplaceQu    0.0
dtype: float64