# Handle Unknown Categories

#### This notebook will demonstrate how to handle unknown categories using OneHotEncoder.

#### Author: Priyanka Dave

<img src="img.jpeg" title="Handle Unknown Categories" />

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

### Create Training and Testing Datasets

In [2]:
train_cars = {'Model': ['Jetta','Polo','Vento','Polo'],'Year': ['2016','2016','2017','2018'],'Milage': ['12','15','15','17']}
vw_train = pd.DataFrame(train_cars, columns = ['Model','Year','Milage'])


test_cars = {'Model': ['Jetta','Ameo','Passat'],'Year': ['2017','2017','2017'],'Milage': ['13','17','15']}
vw_test = pd.DataFrame(test_cars, columns = ['Model','Year','Milage'])

### Training Dataset

In [3]:
vw_train

Unnamed: 0,Model,Year,Milage
0,Jetta,2016,12
1,Polo,2016,15
2,Vento,2017,15
3,Polo,2018,17


### Testing Dataset

In [4]:
vw_test

Unnamed: 0,Model,Year,Milage
0,Jetta,2017,13
1,Ameo,2017,17
2,Passat,2017,15


### Define OneHotEncoder with handle_unknown='ignore'
- It will handle unknown categories of testing set
- It will encode all the unknown categories with all 0's

In [5]:
enc = OneHotEncoder(handle_unknown='ignore',sparse=False)
enc_fit = enc.fit(vw_train[['Model']])

### Transform training set and testing set and create dataframe 

In [6]:
vw_train_transformed = vw_train.join(pd.DataFrame(enc_fit.transform(vw_train[['Model']])))
vw_test_transformed = vw_test.join(pd.DataFrame(enc_fit.transform(vw_test[['Model']])))

### Transformed training set

In [7]:
vw_train_transformed

Unnamed: 0,Model,Year,Milage,0,1,2
0,Jetta,2016,12,1.0,0.0,0.0
1,Polo,2016,15,0.0,1.0,0.0
2,Vento,2017,15,0.0,0.0,1.0
3,Polo,2018,17,0.0,1.0,0.0


### Transformed training set
- Here we can notice that **Ameo** and **Passat** are unknown categories as they do not exists in training set.
- Unknown categories are encoded with all 0's

In [8]:
vw_test_transformed

Unnamed: 0,Model,Year,Milage,0,1,2
0,Jetta,2017,13,1.0,0.0,0.0
1,Ameo,2017,17,0.0,0.0,0.0
2,Passat,2017,15,0.0,0.0,0.0


### Now define OneHotEncoder with handle_unknown='error'

In [9]:
enc = OneHotEncoder(handle_unknown='error',sparse=False)
enc_fit = enc.fit(vw_train[['Model']])


### Transform training set and create dataframe

In [10]:
vw_train_transformed = vw_train.join(pd.DataFrame(enc_fit.transform(vw_train[['Model']])))
vw_train_transformed

Unnamed: 0,Model,Year,Milage,0,1,2
0,Jetta,2016,12,1.0,0.0,0.0
1,Polo,2016,15,0.0,1.0,0.0
2,Vento,2017,15,0.0,0.0,1.0
3,Polo,2018,17,0.0,1.0,0.0


### Transform testing set and create dataframe
- It is giving error because of unknown categories exists in our testing dataset.

In [11]:
vw_test_transformed = vw_test.join(pd.DataFrame(enc_fit.transform(vw_test[['Model']])))

ValueError: Found unknown categories ['Ameo', 'Passat'] in column 0 during transform

### Now if we will remove unknown categories from testing set then above code will work.

In [12]:
vw_test_new = vw_test[:1]
vw_test_new

Unnamed: 0,Model,Year,Milage
0,Jetta,2017,13


In [13]:
vw_test_transformed = vw_test_new.join(pd.DataFrame(enc_fit.transform(vw_test_new[['Model']])))

In [14]:
vw_test_transformed

Unnamed: 0,Model,Year,Milage,0,1,2
0,Jetta,2017,13,1.0,0.0,0.0
