# Encoding Categorical Data

The purpose of this notebook is to show you two methods to encode data. For examle to encode a yes/no pandas series to a 0/1 equivalent.

### Getting Our DataFrame Ready

I will use a simple dataframe for demonstration. Please, note that I dont have a csv file ready, I'll create one here.

## Libraries Import

In [85]:
import pandas as pd
from io import StringIO

In [86]:
data = '''
color,size,price,discount,commission,label
yellow,XL,10.5,2.0,3.1,yes
brown,L,8.1,1.6,1.2,no
blue,M,5.2,0.8,1.3,no
green,S,3.5,0.4,0.8,yes
pale,M,,2.0,3.1,no
brown,L,8.1,1.6,,no
blue,M,5.2,0.8,1.3,yes
white,XL,3.5,,0.8,yes
white,,8.5,2.0,3.1,yes
blue,S,,1.6,1.2,no
blue,XL,15.2,2.8,1.3,yes
pale,,6.8,,0.8,no
blue,,,2.0,3.1,yes
,L,8.1,1.6,1.2,yes
blue,M,5.2,0.8,1.3,no
,S,3.5,0.4,0.8,no
yellow,XL,10.5,2.0,3.1,no
brown,L,8.1,1.6,1.2,yes
blue,M,5.2,,1.3,yes
green,S,3.5,0.4,,no
'''

The data above is simply a CSV text input not a CSV file. I will use the imported library StringIO to convert it to CSV here. All these steps are not important to the lesson at hand. I don't want you to be wondering if someone needs StringIO class to encode categorical data to numerical data that is why I'm adding these explanations. If we had an already made CSV file, all these steps wouldn't be necessary.

In [87]:
# Converting the csv text input to a csv file. The result is a pandas dataframe
dataset = pd.read_csv(StringIO(data))
dataset2 = pd.read_csv(StringIO(data)) # I need two dataframes for the two examples
dataset.head()

Unnamed: 0,color,size,price,discount,commission,label
0,yellow,XL,10.5,2.0,3.1,yes
1,brown,L,8.1,1.6,1.2,no
2,blue,M,5.2,0.8,1.3,no
3,green,S,3.5,0.4,0.8,yes
4,pale,M,,2.0,3.1,no


In [88]:
# Let us view the unique values in the label. Note that the label is our target in this case. Having values yes/no
dataset['label'].value_counts()

yes    10
no     10
Name: label, dtype: int64

As you can see, we now have a pandas dataframe from the text file. The dataset has 10 labels for yes and 10 labels for no

# First Method - Mapping

This first method is probably the one you want to use all the time because it's a purely Pandas method. We call it mapping.

Let me explain how it works:
>1. We create a dictionary object which convert objects from the dataframe to what we want. In this case, the dictionary will convert yes to 1 and no to zero. Dictionaries are data structures that have key/value pairs. In our case, yes & no will be keys while 0 and 1 will be values
>2. We will map the dataframe series that we want to convert to the dictionary object.
>Let's see how that works

In [89]:
# Create a dictionary to map the objects
label_map = {'yes':1, 'no':0}

In [90]:
# Map the dictionary to the series [the part of the dataframe that you want to encode]
dataset['label'] = dataset['label'].map(label_map)

# Print a part of the dataframe to see the result
dataset.head()

Unnamed: 0,color,size,price,discount,commission,label
0,yellow,XL,10.5,2.0,3.1,1
1,brown,L,8.1,1.6,1.2,0
2,blue,M,5.2,0.8,1.3,0
3,green,S,3.5,0.4,0.8,1
4,pale,M,,2.0,3.1,0


In [91]:
dataset['label'].value_counts()

1    10
0    10
Name: label, dtype: int64

!BAM it works just fine. Let us move on to the next example.

## Inverse Transformation

Now that you have successfully encoded your classes, what happens if you need to decode the label. In order words, how do you have the dataframe as you had it before? This is important, especially in Machine Learning if after you have a prediction say 1 and you need to know which label corresponds to 1?

What you do is to reverse the process above. create an inverse dictionary and map it again to the dataframe.
Let us copy the dataframe into a new object to demonstrate this process

In [92]:
# Create an inverse dictionary from the original label_map dictionary above
# what we do here essentially is to reverse the label map.
inv_label_map = {v:k for k, v in label_map.items()} # Did you notice how we swapped v and k inside the dictionary comprehension?
inv_label_map

{1: 'yes', 0: 'no'}

Look at the result above. We have reversed the dictionary. Now, let's map it to the dataframe
Note: I copied the dataframe into a new object to preserve the original dataset

In [93]:
 #if you don't add the deep argument, a copy of the dataframe is created and changes made to the copy will reflect in the old
dataset_new = dataset.copy(deep=True)
dataset_new.head()

Unnamed: 0,color,size,price,discount,commission,label
0,yellow,XL,10.5,2.0,3.1,1
1,brown,L,8.1,1.6,1.2,0
2,blue,M,5.2,0.8,1.3,0
3,green,S,3.5,0.4,0.8,1
4,pale,M,,2.0,3.1,0


In [94]:
# Now, let us transform back to the original
dataset_new['label'] = dataset_new['label'].map(inv_label_map)
dataset_new.head()

Unnamed: 0,color,size,price,discount,commission,label
0,yellow,XL,10.5,2.0,3.1,yes
1,brown,L,8.1,1.6,1.2,no
2,blue,M,5.2,0.8,1.3,no
3,green,S,3.5,0.4,0.8,yes
4,pale,M,,2.0,3.1,no


# Second Method - LabelEncoding

The label encoding is a class from scikit-learn that helps us to encode categorical data very easily. Take note that it is only used to encode the labels or targets of your dataset. If you want to encode nominal or ordinal series in your dataframe, you are better of using 'OneHotEncoder' or OrdinalEncoder. Anyways, as a Data Analyst, you're better off with the first method. This second method is typically used in machine learning models. We touch it nonetheless

In [95]:
# Check the dataset
dataset2.head()

Unnamed: 0,color,size,price,discount,commission,label
0,yellow,XL,10.5,2.0,3.1,yes
1,brown,L,8.1,1.6,1.2,no
2,blue,M,5.2,0.8,1.3,no
3,green,S,3.5,0.4,0.8,yes
4,pale,M,,2.0,3.1,no


In [96]:
dataset2['label'].value_counts()

yes    10
no     10
Name: label, dtype: int64

In [97]:
# Import label encoder class from scikit-learn
from sklearn.preprocessing import LabelEncoder

# Create an object of LabelEncoder
le = LabelEncoder()

# Use the object to transform the column by calling the fit_transform method
dataset2['label'] = le.fit_transform(dataset2['label'])
dataset2.head()

Unnamed: 0,color,size,price,discount,commission,label
0,yellow,XL,10.5,2.0,3.1,1
1,brown,L,8.1,1.6,1.2,0
2,blue,M,5.2,0.8,1.3,0
3,green,S,3.5,0.4,0.8,1
4,pale,M,,2.0,3.1,0


In [98]:
dataset2['label'].value_counts()

1    10
0    10
Name: label, dtype: int64

!BAM, this works fine also

Note: The labelEncoder class implicitly 'determined' now to encode the data for you, you had no say as to how it does it, you simply use it like that. To know what encoding took place, simply call the class_ function on labelEnocer object le

In [99]:
le.classes_

array(['no', 'yes'], dtype=object)

## Inverse Transformation with LabelEncoders
To carry out inverse transformations with labelEncoders, we simply call the inverse_transform method

In [100]:
dataset2['label'] = le.inverse_transform(dataset2['label'])
dataset2.head()

Unnamed: 0,color,size,price,discount,commission,label
0,yellow,XL,10.5,2.0,3.1,yes
1,brown,L,8.1,1.6,1.2,no
2,blue,M,5.2,0.8,1.3,no
3,green,S,3.5,0.4,0.8,yes
4,pale,M,,2.0,3.1,no


# THE END