# Encoding Categorical Variables

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

**Categorical variables are those values which are selected from a group of categories or
labels.**
- For example, the variable Gender with the values of male or female is categorical, and so is the variable marital status with the values of never married, married, divorced, or widowed. 
- In some categorical variables, the labels have an **intrinsic order**, for example, in the variable Student's grade, the values of A, B, C, or Fail are ordered, A being the highest grade and Fail the lowest. These are called **ordinal categorical variables.** 
- Variables in which the categories **do not have an intrinsic order** are called **nominal categorical variables**, such as the variable City, with the values of London, Manchester, Bristol, and so on. 


The values of categorical variables are often encoded as strings.  Scikit-learn, the open
source Python library for machine learning, does not support strings as values, therefore,
we need to transform those strings into numbers. The act of replacing strings with numbers
is called **categorical encoding**.

## Category Encoders

A set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques.

https://contrib.scikit-learn.org/category_encoders/

Install: `category_encoders`

<img src="https://feature-engine.trainindata.com/en/latest/_images/categoricalSummary.png">

## Creating binary variables through one-hot encoding

In **one-hot encoding, we represent a categorical variable as a group of binary variables**,
where each binary variable represents one category. The binary variable indicates whether
the category is present in an observation (1) or not (0).

A categorical variable with k unique categories can be encoded in k-1 binary variables.

There are a **few occasions** in which we may prefer to encode the categorical variables with k binary variables:
- When **training decision trees**, as they do not evaluate the entire feature space at the same time
- When **selecting features recursively**
- When determining the importance of each category within a variable

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [None]:
data = pd.read_csv('data/creditApprovalUCI.csv')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(labels=['A16'], axis=1), data['A16'], test_size=0.3, random_state=0)

In [None]:
X_train['A4'].unique()

In [None]:
encoder = OneHotEncoder(categories='auto', drop='first', sparse_output=False)

In [None]:
vars_categorical = ['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10', 'A12', 'A13']
encoder.fit(X_train[vars_categorical])

> Scikit-learn's `OneHotEncoder()` function will **only encode the categories
learned from the train set**. If there are new categories in the test set, we
can instruct the encoder to ignore them or to return an error with the
`handle_unknown='ignore'` argument or
the `handle_unknown='error'` argument, respectively.

In [None]:
X_train_enc = encoder.transform(X_train[vars_categorical])
X_test_enc = encoder.transform(X_test[vars_categorical])

In [None]:
X_train_enc

## Performing one-hot encoding of frequent categories

One-hot encoding represents each category of a categorical variable with a binary variable.
Hence, **one-hot encoding of highly cardinal variables or datasets with multiple categorical
features can expand the feature space dramatically**. 

To reduce the number of binary variables, we can **perform one-hot encoding of the most frequent categories only**. One-hot
encoding of top categories is equivalent to treating the remaining, less frequent categories
as a single, unique category. 

In [None]:
from feature_engine.encoding import OneHotEncoder
from feature_engine.imputation import CategoricalImputer

In [None]:
data = pd.read_csv('data/creditApprovalUCI.csv')

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['A16'], axis=1), # predictors
    data['A16'], # target
    test_size=0.3, # percentage of observations in test set
    random_state=0) # seed to ensure reproducibility


imputer = CategoricalImputer()
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

In [None]:
X_train['A6'].unique()

In [None]:
X_train['A6'].value_counts().sort_values(ascending=False).head(5)

In [None]:
ohe_enc = OneHotEncoder(top_categories=5, variables=['A6', 'A7'], drop_last=False)

In [None]:
ohe_enc.fit(X_train)

In [None]:
X_train_enc = ohe_enc.transform(X_train)
X_test_enc = ohe_enc.transform(X_test)

In [None]:
X_train_enc.head()

In [None]:
ohe_enc.encoder_dict_

## Replacing categories with ordinal numbers

Ordinal encoding consists of **replacing the categories with digits from 1 to k** (or 0 to k-1,
depending on the implementation), where **k is the number of distinct categories of the
variable**. The numbers are **assigned arbitrarily**. 

Ordinal encoding is **better suited for nonlinear
machine learning models**, which can navigate through the arbitrarily assigned digits
to try and find patterns that relate to the target.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
data = pd.read_csv('data/creditApprovalUCI.csv')

X_train, X_test, y_train, y_test = train_test_split(data.drop(labels=['A16'], axis=1), data['A16'],test_size=0.3, random_state=0)

In [None]:
vars_categorical = ['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10', 'A12', 'A13']

le = OrdinalEncoder()

In [None]:
le.fit(X_train[vars_categorical])

In [None]:
X_train_enc = le.transform(X_train[vars_categorical])
X_test_enc = le.transform(X_test[vars_categorical])

## Replacing categories with counts or frequency of observations

In **count or frequency encoding**, we replace the categories with the **count or the percentage
of observations with that category**. 

That is, if 10 out of 100 observations show the category
blue for the variable color, we would replace blue with 10 when doing count encoding, or
by 0.1 if performing frequency encoding. 

These techniques, which capture the representation of each label in a dataset, are **very popular in data science competitions**. 

The assumption is that the number of observations per category is somewhat predictive of the
target.

> Note that if two different categories are present in the same percentage of
observations, they will be replaced by the same value, which may lead to
information loss.

In [None]:
data = pd.read_csv('data/creditApprovalUCI.csv')

In [None]:
data.head(3)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(labels=['A16'], axis=1), data['A16'],test_size=0.3, random_state=0)

imputer = CategoricalImputer()
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

In [None]:
from feature_engine.encoding import CountFrequencyEncoder

count_enc = CountFrequencyEncoder(encoding_method='count', variables=None)

In [None]:
count_enc.fit(X_train)

In [None]:
count_enc.encoder_dict_

In [None]:
X_train_enc = count_enc.transform(X_train)
X_test_enc = count_enc.transform(X_test)

In [None]:
X_train_enc.head()