<a href="https://colab.research.google.com/github/prashankkadam/OneHotEncoding/blob/master/One_hot_encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **One hot encoding**

This piece of code is a simple introduction to one hot encoding which is the most widely used method to convert categorical data into a continuous form.


## **Creating a new dataset**

In [0]:
import pandas as pd

df = pd.DataFrame([["London", "car", 20],
                   ["Cambridge", "car", 10],
                   ["Liverpool", "bus", 30]],
                  columns=["city", "transport", "duration"])

In [0]:
df_test = pd.DataFrame([["Manchester", "bike", 30],
                        ["Cambridge", "car", 40],
                        ["Liverpool", "bike", 10]],
                       columns=["city", "transport", "duration"])

## **Processing the training data:**

First we will define the list of categorical data that needs processing:

In [0]:
cat_columns = ["city", "transport"]

Now we will create dummy features by calling the pandas get dummy function. 
Creating a dataframe for our processed data

In [0]:
df_processed = pd.get_dummies(df, prefix_sep="__",
                              columns=cat_columns)

In [5]:
df_processed

Unnamed: 0,duration,city__Cambridge,city__Liverpool,city__London,transport__bus,transport__car
0,20,0,0,1,0,1
1,10,1,0,0,0,1
2,30,0,1,0,1,0


That's it for the training set part, now we have a dataframe with one hot encoded features. Now we save a few components in variables so that we can make sure that the exact same feature are updated in the test set too

In [0]:
cat_dummies = [col for col in df_processed 
               if "__" in col
               and col.split("__")[0] in cat_columns]

In [7]:
cat_dummies


['city__Cambridge',
 'city__Liverpool',
 'city__London',
 'transport__bus',
 'transport__car']

Let's also save the list of columns so that we can enforce order later on

In [0]:
processed_columns = list(df_processed.columns[:])

In [9]:
processed_columns

['duration',
 'city__Cambridge',
 'city__Liverpool',
 'city__London',
 'transport__bus',
 'transport__car']

## **Process test data**

Now we ensure our test data has the same columns:

In [0]:
df_test_processed = pd.get_dummies(df_test, prefix_sep="__",
                                   columns=cat_columns)

In [11]:
df_test_processed

Unnamed: 0,duration,city__Cambridge,city__Liverpool,city__Manchester,transport__bike,transport__car
0,30,0,0,1,1,0
1,40,1,0,0,0,1
2,10,0,1,0,1,0


In the above dataframe, we are expected to have new columns and also missing ones which are not present in the training data.

We can easily clean it up by the following code:

In [13]:
for col in df_test_processed.columns:
  if ("__" in col) and (col.split("__")[0] in cat_columns) and (col not in cat_dummies):
    print("Removing additional feature {}".format(col))
    df_test_processed.drop(col, axis=1, inplace=True)

Removing additional feature city__Manchester
Removing additional feature transport__bike


Now we need to add the missing columns and set the column vectors to 0

In [14]:
for col in cat_dummies:
  if col not in df_test_processed.columns:
    print("Adding missing feature {}".format(col))
    df_test_processed[col] = 0
    

Adding missing feature city__London
Adding missing feature transport__bus


Now we have the same feature in both the sets, notice that the order of the columns does not remain the same in the both the sets. 

We need to reorder test dataframe:

In [0]:
df_test_processed = df_test_processed[processed_columns]

## **Using sklearn's one hot and label encoder**

Now we do the same thing again using sklearn 

In [0]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

First we create a dataframe which contains all the non-categorical features.

Then, for each categorical column, we fit a label encoder, transform our column and add it to our new dataframe

In [17]:
label_encoders = {}
for col in cat_columns:
  print("Encoding {}".format(col))
  new_le = LabelEncoder()
  df_processed[col] = new_le.fit_transform(df[col])
  label_encoders[col] = new_le

Encoding city
Encoding transport


In [18]:
df_processed



Unnamed: 0,duration,city__Cambridge,city__Liverpool,city__London,transport__bus,transport__car,city,transport
0,20,0,0,1,0,1,2,1
1,10,1,0,0,0,1,0,1
2,30,0,1,0,1,0,1,0


Now that we have proper numerical features, we need to one hot encode our categorical features

One hot encoding does not support passing the list of features by their names but only their indices

In [0]:
cat_column_idx = [df_processed.columns.get_loc(col)
                  for col in cat_columns]

In [0]:
ohe = OneHotEncoder(categorical_features=cat_column_idx,
                    sparse=False, handle_unknown="ignore")

One hot encoder will create numpy arrays of the categorical columns. 
If is diffucult to recreate the dataframe with nice labels but as most algorithms use numpy arrays we can stop at this step

In [21]:
df_processed_np = ohe.fit_transform(df_processed)



In [22]:
df_processed_np

array([[ 0.,  0.,  1.,  0.,  1., 20.,  0.,  0.,  1.,  0.,  1.],
       [ 1.,  0.,  0.,  0.,  1., 10.,  1.,  0.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  1.,  0., 30.,  0.,  1.,  0.,  1.,  0.]])

## **Processing the test data**

In [0]:
df_test_processed = df_test[[col for col in df_test.columns
                             if col not in  cat_columns]]

Using the label encoder again to properly assign the numerical values to the features

In [25]:
for col in cat_columns:
  print("Encoding {}".format(col))
  label_map = {val: label for label, val in enumerate(label_encoders[col].classes_)}
  print(label_map)
  df_test_processed[col] = df_test[col].map(label_map)
  df_test_processed[col] = df_test_processed[col].fillna(9999).astype(int)

Encoding city
{'Cambridge': 0, 'Liverpool': 1, 'London': 2}
Encoding transport
{'bus': 0, 'car': 1}


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [0]:
df_test_processed_np = ohe.transform(df_test_processed)