# Encoding Category Variable

We will examine a few methods of encoding category variables

We will evaluate the following options:

1. Pandas.get_dummies
2. sklearn encoders
3. Categorical Encoders

### References

- https://stackabuse.com/one-hot-encoding-in-python-with-pandas-and-scikit-learn/

## Data

Let's create the following data.

In [1]:
import pandas as pd
pd.options.display.float_format = '{:.2f}'.format # to make legible


df = pd.DataFrame({"age" : [65, 32, 24, 55, 45, 30, 35 ],
                   "gender" : ['Male', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female'],
                   "status":['married', 'single', 'single', 'divorced', 'married' ,'single', 'married' ]
                   })
df

Unnamed: 0,age,gender,status
0,65,Male,married
1,32,Male,single
2,24,Female,single
3,55,Male,divorced
4,45,Male,married
5,30,Female,single
6,35,Female,married


## Option 1 : Pandas

### 1A - Pandas: Indexing
Convert categorical variables into numbers


In [2]:
## factorize

df2 = df.copy() 

indexes , lookup_table = pd.factorize(df2['status'])

print ("indexes : ", indexes)
print ("lookup_table : ", lookup_table)

df2['status_index'] = indexes
df2

indexes :  [0 1 1 2 0 1 0]
lookup_table :  Index(['married', 'single', 'divorced'], dtype='object')


Unnamed: 0,age,gender,status,status_index
0,65,Male,married,0
1,32,Male,single,1
2,24,Female,single,1
3,55,Male,divorced,2
4,45,Male,married,0
5,30,Female,single,1
6,35,Female,married,0


In [3]:
# do reverse lookup
df2['reverse_status_lookup'] = lookup_table[df2['status_index']]
df2

Unnamed: 0,age,gender,status,status_index,reverse_status_lookup
0,65,Male,married,0,married
1,32,Male,single,1,single
2,24,Female,single,1,single
3,55,Male,divorced,2,divorced
4,45,Male,married,0,married
5,30,Female,single,1,single
6,35,Female,married,0,married


### 1B - Pandas: One Hot Encoding

In [5]:
df.dtypes

age        int64
gender    object
status    object
dtype: object

In [4]:
# encode the entire dataframe

print (df)
encoded1 = pd.get_dummies(df)
encoded1

   age  gender    status
0   65    Male   married
1   32    Male    single
2   24  Female    single
3   55    Male  divorced
4   45    Male   married
5   30  Female    single
6   35  Female   married


Unnamed: 0,age,gender_Female,gender_Male,status_divorced,status_married,status_single
0,65,0,1,0,1,0
1,32,0,1,0,0,1
2,24,1,0,0,0,1
3,55,0,1,1,0,0
4,45,0,1,0,1,0
5,30,1,0,0,0,1
6,35,1,0,0,1,0


In [None]:
# Check the types of encoded df
# Question: Why does it say  'unit8'?  

encoded1.dtypes

In [6]:
## We can also encode specific columns
## encode 'status' column only

print (df)

encoded2 = pd.get_dummies(df, columns=['status'])
encoded2

   age  gender    status
0   65    Male   married
1   32    Male    single
2   24  Female    single
3   55    Male  divorced
4   45    Male   married
5   30  Female    single
6   35  Female   married


Unnamed: 0,age,gender,status_divorced,status_married,status_single
0,65,Male,0,1,0
1,32,Male,0,0,1
2,24,Female,0,0,1
3,55,Male,1,0,0
4,45,Male,0,1,0
5,30,Female,0,0,1
6,35,Female,0,1,0


In [7]:
## Another approach

df2 = df.copy() 
print (df)
encoded = pd.get_dummies (df2['status'], prefix='status')
encoded

Unnamed: 0,status_divorced,status_married,status_single
0,0,1,0
1,0,0,1
2,0,0,1
3,1,0,0
4,0,1,0
5,0,0,1
6,0,1,0


In [8]:
## merge with original data 
df3 = df2.merge(encoded, how='outer', left_index=True, right_index=True)
df3

Unnamed: 0,age,gender,status,status_divorced,status_married,status_single
0,65,Male,married,0,1,0
1,32,Male,single,0,0,1
2,24,Female,single,0,0,1
3,55,Male,divorced,1,0,0
4,45,Male,married,0,1,0
5,30,Female,single,0,0,1
6,35,Female,married,0,1,0


## Option 2 : SciKit Learn

### 2A - Label Encoder is handy for indexing variables

In [9]:
from sklearn.preprocessing import LabelEncoder

print (df)
df2 = df.copy() 

label_encoder = LabelEncoder()
df2['status_encoded'] = label_encoder.fit_transform(df2['status'])
df2

Unnamed: 0,age,gender,status,status_encoded
0,65,Male,married,1
1,32,Male,single,2
2,24,Female,single,2
3,55,Male,divorced,0
4,45,Male,married,1
5,30,Female,single,2
6,35,Female,married,1


### 2B - SciKit One Hot Encoder

In [10]:
from sklearn.preprocessing import OneHotEncoder

print (df)
df2 = df.copy() 

encoder =  OneHotEncoder()

one_hot = encoder.fit_transform(df2[['status']])  # only encoding status col

print ('encoder.categories_ : ', encoder.categories_)
print ('one hot encodings : \n', one_hot.toarray())

encoder.categories_ :  [array(['divorced', 'married', 'single'], dtype=object)]
one hot encodings : 
 [[0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]


In [11]:
# integrate the one-hot encodings into dataframe

for i in range (0, len(encoder.categories_[0])):
    cat = encoder.categories_[0][i]
    print (i, cat)
    df2["status_" + cat] = one_hot.toarray()[ :, i ]

df2

0 divorced
1 married
2 single


Unnamed: 0,age,gender,status,status_divorced,status_married,status_single
0,65,Male,married,0.0,1.0,0.0
1,32,Male,single,0.0,0.0,1.0
2,24,Female,single,0.0,0.0,1.0
3,55,Male,divorced,1.0,0.0,0.0
4,45,Male,married,0.0,1.0,0.0
5,30,Female,single,0.0,0.0,1.0
6,35,Female,married,0.0,1.0,0.0


In [None]:
## Inspect types
df2.dtypes

In [None]:
# extract X for input
X = df2[['age', 'status_divorced', 'status_married', 'status_single']]
X

## Option 3 - Category Encoder

[Category Encoders](https://contrib.scikit-learn.org/category_encoders/)  are easy to use encoders.

You can install them as follows (choose one based on your system)

In [None]:
## if using PIP system
# ! pip3 install category_encoders

In [None]:
## If using anaconda
#! conda install -c conda-forge category_encoders

### 3A - CE Indexing / Ordinal

In [None]:
import category_encoders as ce

print (df)
df2 = df.copy() 

encoder = ce.OrdinalEncoder(cols = ['status'])
encoded = encoder.fit_transform(df2)
encoded['status_orig'] = df2['status']  # just for information purposes
encoded

In [None]:
# Inspect the types
encoded.dtypes

### 3B - CE One Hot Encoding

In [None]:
import category_encoders as ce

print (df)
df2 = df.copy() 

encoder = ce.OneHotEncoder()
encoded = encoder.fit_transform(df2)
encoded

In [None]:
encoded.dtypes

In [None]:
## or choose the columns we want

import category_encoders as ce

df2 = df.copy() 

encoder = ce.OneHotEncoder(cols=['status'])
encoded = encoder.fit_transform(df2)
encoded

In [None]:
encoded.dtypes