#### Tips and tricks for encoding categorical features in classification tasks

Features come in various flavors. Typically we distinguish them as 
* continuous 
* discrete (categorical) features

Discrete features can be further distinguished into 
* Nominal (no order implied)
* Ordinal

Most machine learning algorithms expect the inputs to be numerical and we need to preprocess 
the data accordingly. This notebook contains some helpful tips on how to encode discrete features
using pandas and scikit-learn

First, let's create a simple example dataset with 3 different kinds of features:

* color: nominal
* size: ordinal
* price: continuous

In [56]:
import pandas as pd
df = pd.DataFrame([
            ["green", "M", 10.4, "class1"],
            ["red", "L", 13.5, "class2"],
            ["blue", "XXL", 15.40, "class2"]])

df.columns = ["color", "size", "price", "class label"]
df

Unnamed: 0,color,size,price,class label
0,green,M,10.4,class1
1,red,L,13.5,class2
2,blue,XXL,15.4,class2


#### Class Labels

Typically machine learning algorithms handle class labels with no order implied - unless we use
a ranking classifier (e.g., SVM-rank). Thus, it is safe to use a simple set-item-enumeration to 
convert the class labels from string representation into integers.

In [57]:
class_mapping = {label:idx for idx,label in enumerate(set(df['class label']))}

df['class label'] = df["class label"].map(class_mapping)
df

Unnamed: 0,color,size,price,class label
0,green,M,10.4,0
1,red,L,13.5,1
2,blue,XXL,15.4,1


In [58]:
class_mapping

{'class1': 0, 'class2': 1}

##### Ordinal Features

Ordinal features need special attention. We have to make sure that the correct values are associated with the corresponding strings. Thus, we need to set-up an explicit mapping dictionary:

In [59]:
size_mapping = {
        "XXL": 4,
        "L"  : 2,
        "M"  : 1}

df["size"] = df["size"].map(size_mapping)
df

Unnamed: 0,color,size,price,class label
0,green,1,10.4,0
1,red,2,13.5,1
2,blue,4,15.4,1


In [60]:
size_mapping

{'XXL': 4, 'L': 2, 'M': 1}

##### Nominal Features

Unfortunately, we cannot use the same trick we used for color for nominal features. However, we can use a simple trick by converting each color into a binary feature, with presence of color representated by 1.

In [61]:
color_mapping = {
        "green": (0, 0, 1),
        "red"  : (0, 1, 0),
        "blue" : (1, 0, 0)}

df["color"] = df["color"].map(color_mapping)
df

Unnamed: 0,color,size,price,class label
0,"(0, 0, 1)",1,10.4,0
1,"(0, 1, 0)",2,13.5,1
2,"(1, 0, 0)",4,15.4,1


In [62]:
color_mapping

{'green': (0, 0, 1), 'red': (0, 1, 0), 'blue': (1, 0, 0)}

Using numpy

In [63]:
import numpy as np
y = df["class label"].values
X = df.iloc[:, :-1].values
X = np.apply_along_axis(func1d= lambda x: np.array(list(x[0]) + list(x[1:])), axis=1, arr=X)

print("Class labels:", y)
print("\nFeatures:\n", X)

Class labels: [0 1 1]

Features:
 [[ 0.   0.   1.   1.  10.4]
 [ 0.   1.   0.   2.  13.5]
 [ 1.   0.   0.   4.  15.4]]


#### Inverse mapping

If we want to convert the numerical features back into their original representation, we can simply do so by using inverted mapping dictionary:

In [64]:
inv_color_mapping = {v:k for k,v in color_mapping.items()}
inv_size_mapping = {v:k for k,v in size_mapping.items()}
inv_class_mapping = {v:k for k,v in class_mapping.items()}

df['color'] = df["color"].map(inv_color_mapping)
df["class label"] = df["class label"].map(inv_class_mapping)
df["size"] = df["size"].map(inv_size_mapping)
df

Unnamed: 0,color,size,price,class label
0,green,M,10.4,class1
1,red,L,13.5,class2
2,blue,XXL,15.4,class2


#### Using Scikit-learn and pandas to accomplish the same thing

scikit-learn machine learning librariy comes with many useful preprocessing functions that we can use
for our convenience.

scikit-learn LabelEncoder

In [65]:
from sklearn.preprocessing import LabelEncoder

class_le = LabelEncoder()
df["class label"] = class_le.fit_transform(df["class label"])

size_mapping = {
        "XXL": 4,
        "L"  : 2,
        "M"  : 1}
df["size"] = df["size"].map(size_mapping)
df

Unnamed: 0,color,size,price,class label
0,green,1,10.4,0
1,red,2,13.5,1
2,blue,4,15.4,1


Class labels can be converted back from integer to string via the inverse_transform method:

In [66]:
class_le.inverse_transform(df["class label"])

array(['class1', 'class2', 'class2'], dtype=object)

##### scikit DictVectorizer

The ```DictVectorizer``` is another handy tool for feature extraction. The ```DictVectorizer``` takes a list of dictionary entries (feature-value mappings) and transforms it to vectors. The expected input looks like this:

In [67]:
df.transpose().to_dict().values()

dict_values([{'color': 'green', 'size': 1, 'price': 10.4, 'class label': 0}, {'color': 'red', 'size': 2, 'price': 13.5, 'class label': 1}, {'color': 'blue', 'size': 4, 'price': 15.4, 'class label': 1}])

Dictionary keys in each row represents the feature columns and label column

Now we can use ```DictVectorizer``` to turn this mapping into a matrix

In [68]:
from sklearn.feature_extraction import DictVectorizer
dvec = DictVectorizer(sparse=False)

X = dvec.fit_transform(df.transpose().to_dict().values())
X

array([[ 0. ,  0. ,  1. ,  0. , 10.4,  1. ],
       [ 1. ,  0. ,  0. ,  1. , 13.5,  2. ],
       [ 1. ,  1. ,  0. ,  0. , 15.4,  4. ]])

We can see that the column were reordered during the conversion. We can just simply add back the column names via the ```get_feature_names``` function.

In [69]:
pd.DataFrame(X, columns=dvec.get_feature_names())

Unnamed: 0,class label,color=blue,color=green,color=red,price,size
0,0.0,0.0,1.0,0.0,10.4,1.0
1,1.0,0.0,0.0,1.0,13.5,2.0
2,1.0,1.0,0.0,0.0,15.4,4.0


In [70]:
dvec.get_feature_names()

['class label', 'color=blue', 'color=green', 'color=red', 'price', 'size']

##### scikit OneHotEncoder

Another useful tool in scikit-learn is the ```OneHotEncoder```. The idea is the same as in the ```DictVectorizer``` above; the only difference is that the ```OneHotEncoder``` takes integer columns as input. Here we first use the ```LabelEncoder``` first to prepare the ```color``` column and then use the ```OneHotEncoder```.

In [71]:
color_le = LabelEncoder()
df['color'] = color_le.fit_transform(df['color'])

df

Unnamed: 0,color,size,price,class label
0,1,1,10.4,0
1,2,2,13.5,1
2,0,4,15.4,1


In [76]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)

X = ohe.fit_transform(df[["color"]].values)
X

array([[0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

#### pandas get_dummies

Also, pandas comes with a convenience function to create new categories for nominal features, namely:
```get_dummies```. But first, let us quickly regenerate a fresh example ```DataFrame``` where the size and class label columns are already taken care of.

In [79]:
import pandas as pd
df = pd.DataFrame([
    ["green", "M", 10.1, "class1"],
    ["red", "L", 13.5, "class2"],
    ["blue", "XL", 15.8, "class1"]])

df.columns = ["color", "size", "prize", "class label"]

size_mapping = {
    "XL": 3,
    "L": 2,
    "M": 1}

df["size"] = df["size"].map(size_mapping)

class_mapping = {label:idx for idx,label in enumerate(set(df["class label"]))}
df["class label"] = df["class label"].map(class_mapping)

df

Unnamed: 0,color,size,prize,class label
0,green,1,10.1,0
1,red,2,13.5,1
2,blue,3,15.8,0


Uisng ```get_dummies``` will create a new column for every unique string in a certain column:

In [80]:
pd.get_dummies(df)

Unnamed: 0,size,prize,class label,color_blue,color_green,color_red
0,1,10.1,0,0,1,0
1,2,13.5,1,0,0,1
2,3,15.8,0,1,0,0


```get_dummies``` function leaves the numerical features untouched