# Encoding Categorical Variables

In [3]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

**Categorical variables are those values which are selected from a group of categories or
labels.**
- For example, the variable Gender with the values of male or female is categorical, and so is the variable marital status with the values of never married, married, divorced, or widowed. 
- In some categorical variables, the labels have an **intrinsic order**, for example, in the variable Student's grade, the values of A, B, C, or Fail are ordered, A being the highest grade and Fail the lowest. These are called **ordinal categorical variables.** 
- Variables in which the categories **do not have an intrinsic order** are called **nominal categorical variables**, such as the variable City, with the values of London, Manchester, Bristol, and so on. 


The values of categorical variables are often encoded as strings.  Scikit-learn, the open
source Python library for machine learning, does not support strings as values, therefore,
we need to transform those strings into numbers. The act of replacing strings with numbers
is called **categorical encoding**.

## Category Encoders

A set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques.

https://contrib.scikit-learn.org/category_encoders/

Install: `category_encoders`

<img src="https://feature-engine.trainindata.com/en/latest/_images/categoricalSummary.png">

## Creating binary variables through one-hot encoding

In **one-hot encoding, we represent a categorical variable as a group of binary variables**,
where each binary variable represents one category. The binary variable indicates whether
the category is present in an observation (1) or not (0).

A categorical variable with k unique categories can be encoded in k-1 binary variables.

There are a **few occasions** in which we may prefer to encode the categorical variables with k binary variables:
- When **training decision trees**, as they do not evaluate the entire feature space at the same time
- When **selecting features recursively**
- When determining the importance of each category within a variable

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [5]:
data = pd.read_csv('data/creditApprovalUCI.csv')

In [6]:
data

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.000,u,g,w,v,1.25,t,t,1,f,g,202.0,0,1
1,a,58.67,4.460,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
2,a,24.50,,u,g,q,h,,,,0,f,g,280.0,824,1
3,b,27.83,1.540,u,g,w,v,3.75,t,t,5,t,g,100.0,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,b,21.08,10.085,y,p,e,h,1.25,f,f,0,f,g,260.0,0,0
686,a,22.67,0.750,u,g,c,v,2.00,f,t,2,t,g,200.0,394,0
687,a,25.25,13.500,y,p,ff,ff,2.00,f,t,1,t,g,200.0,1,0
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,280.0,750,0


In [7]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(labels=['A16'], axis=1), data['A16'], test_size=0.3, random_state=0)

In [8]:
X_train['A4'].unique()

array(['u', 'y', nan, 'l'], dtype=object)

In [17]:
encoder = OneHotEncoder(categories='auto', drop='first', sparse_output=False)

In [18]:
vars_categorical = ['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10', 'A12', 'A13']
encoder.fit(X_train[vars_categorical])

> Scikit-learn's `OneHotEncoder()` function will **only encode the categories
learned from the train set**. If there are new categories in the test set, we
can instruct the encoder to ignore them or to return an error with the
`handle_unknown='ignore'` argument or
the `handle_unknown='error'` argument, respectively.

In [19]:
X_train_enc = encoder.transform(X_train[vars_categorical])
X_test_enc = encoder.transform(X_test[vars_categorical])

In [20]:
X_train_enc

array([[0., 0., 1., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 1., 0., 0.],
       [1., 0., 1., ..., 1., 0., 1.]])

### get_dummies()

In [21]:
X_train

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15
596,a,46.08,3.000,u,g,c,v,2.375,t,t,8,t,g,396.0,4159
303,a,15.92,2.875,u,g,q,v,0.085,f,f,0,f,g,120.0,0
204,b,36.33,2.125,y,p,w,v,0.085,t,t,1,f,g,50.0,1187
351,b,22.17,0.585,y,p,ff,ff,0.000,f,f,0,f,g,100.0,0
118,b,57.83,7.040,u,g,m,v,14.000,t,t,6,t,g,360.0,1332
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
359,a,36.75,4.710,u,g,ff,ff,0.000,f,f,0,f,g,160.0,0
192,b,41.75,0.960,u,g,x,v,2.500,t,f,0,f,g,510.0,600
629,a,19.58,0.665,u,g,w,v,1.665,f,f,0,f,g,220.0,5
559,a,22.83,2.290,u,g,q,h,2.290,t,t,7,t,g,140.0,2384


In [23]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 483 entries, 596 to 684
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A1      479 non-null    object 
 1   A2      472 non-null    float64
 2   A3      415 non-null    float64
 3   A4      479 non-null    object 
 4   A5      479 non-null    object 
 5   A6      479 non-null    object 
 6   A7      479 non-null    object 
 7   A8      415 non-null    float64
 8   A9      415 non-null    object 
 9   A10     415 non-null    object 
 10  A11     483 non-null    int64  
 11  A12     483 non-null    object 
 12  A13     483 non-null    object 
 13  A14     476 non-null    float64
 14  A15     483 non-null    int64  
dtypes: float64(4), int64(2), object(9)
memory usage: 60.4+ KB


In [25]:
pd.get_dummies(X_train)

Unnamed: 0,A2,A3,A8,A11,A14,A15,A1_a,A1_b,A4_l,A4_u,...,A7_z,A9_f,A9_t,A10_f,A10_t,A12_f,A12_t,A13_g,A13_p,A13_s
596,46.08,3.000,2.375,8,396.0,4159,True,False,False,True,...,False,False,True,False,True,False,True,True,False,False
303,15.92,2.875,0.085,0,120.0,0,True,False,False,True,...,False,True,False,True,False,True,False,True,False,False
204,36.33,2.125,0.085,1,50.0,1187,False,True,False,False,...,False,False,True,False,True,True,False,True,False,False
351,22.17,0.585,0.000,0,100.0,0,False,True,False,False,...,False,True,False,True,False,True,False,True,False,False
118,57.83,7.040,14.000,6,360.0,1332,False,True,False,True,...,False,False,True,False,True,False,True,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
359,36.75,4.710,0.000,0,160.0,0,True,False,False,True,...,False,True,False,True,False,True,False,True,False,False
192,41.75,0.960,2.500,0,510.0,600,False,True,False,True,...,False,False,True,True,False,True,False,True,False,False
629,19.58,0.665,1.665,0,220.0,5,True,False,False,True,...,False,True,False,True,False,True,False,True,False,False
559,22.83,2.290,2.290,7,140.0,2384,True,False,False,True,...,False,False,True,False,True,False,True,True,False,False


## Performing one-hot encoding of frequent categories

One-hot encoding represents each category of a categorical variable with a binary variable.
Hence, **one-hot encoding of highly cardinal variables or datasets with multiple categorical
features can expand the feature space dramatically**. 

To reduce the number of binary variables, we can **perform one-hot encoding of the most frequent categories only**. One-hot
encoding of top categories is equivalent to treating the remaining, less frequent categories
as a single, unique category. 

In [27]:
from feature_engine.encoding import OneHotEncoder
from feature_engine.imputation import CategoricalImputer

In [39]:
data = pd.read_csv('data/creditApprovalUCI.csv')

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['A16'], axis=1), # predictors
    data['A16'], # target
    test_size=0.3, # percentage of observations in test set
    random_state=0) # seed to ensure reproducibility


imputer = CategoricalImputer()
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))


In [46]:
X_train['A6'].unique()

array(['c', 'q', 'w', 'ff', 'm', 'i', 'e', 'cc', 'x', 'd', 'k', 'j',
       'Missing', 'aa', 'r'], dtype=object)

In [47]:
X_train['A6'].value_counts().sort_values(ascending=False).head(5)
# X_train['A6'].value_counts().sort_values(ascending=False)

A6
c     93
q     56
w     48
i     41
ff    38
Name: count, dtype: int64

In [48]:
ohe_enc = OneHotEncoder(top_categories=5, variables=['A6', 'A7'], drop_last=False)

In [49]:
ohe_enc.fit(X_train)

In [50]:
X_train_enc = ohe_enc.transform(X_train)
X_test_enc = ohe_enc.transform(X_test)

In [51]:
X_train_enc.head()

Unnamed: 0,A1,A2,A3,A4,A5,A8,A9,A10,A11,A12,...,A6_c,A6_q,A6_w,A6_i,A6_ff,A7_v,A7_h,A7_ff,A7_bb,A7_z
596,a,46.08,3.0,u,g,2.375,t,t,8,t,...,1,0,0,0,0,1,0,0,0,0
303,a,15.92,2.875,u,g,0.085,f,f,0,f,...,0,1,0,0,0,1,0,0,0,0
204,b,36.33,2.125,y,p,0.085,t,t,1,f,...,0,0,1,0,0,1,0,0,0,0
351,b,22.17,0.585,y,p,0.0,f,f,0,f,...,0,0,0,0,1,0,0,1,0,0
118,b,57.83,7.04,u,g,14.0,t,t,6,t,...,0,0,0,0,0,1,0,0,0,0


In [52]:
ohe_enc.encoder_dict_

{'A6': ['c', 'q', 'w', 'i', 'ff'], 'A7': ['v', 'h', 'ff', 'bb', 'z']}

## Replacing categories with ordinal numbers

Ordinal encoding consists of **replacing the categories with digits from 1 to k** (or 0 to k-1,
depending on the implementation), where **k is the number of distinct categories of the
variable**. The numbers are **assigned arbitrarily**. 

Ordinal encoding is **better suited for nonlinear
machine learning models**, which can navigate through the arbitrarily assigned digits
to try and find patterns that relate to the target.

In [53]:
from sklearn.preprocessing import OrdinalEncoder

In [54]:
data = pd.read_csv('data/creditApprovalUCI.csv')

X_train, X_test, y_train, y_test = train_test_split(data.drop(labels=['A16'], axis=1), data['A16'],test_size=0.3, random_state=0)

In [55]:
vars_categorical = ['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10', 'A12', 'A13']

le = OrdinalEncoder()

In [56]:
le.fit(X_train[vars_categorical])

In [57]:
X_train_enc = le.transform(X_train[vars_categorical])
X_test_enc = le.transform(X_test[vars_categorical])

In [60]:
X_train[vars_categorical]

Unnamed: 0,A1,A4,A5,A6,A7,A9,A10,A12,A13
596,a,u,g,c,v,t,t,t,g
303,a,u,g,q,v,f,f,f,g
204,b,y,p,w,v,t,t,f,g
351,b,y,p,ff,ff,f,f,f,g
118,b,u,g,m,v,t,t,t,g
...,...,...,...,...,...,...,...,...,...
359,a,u,g,ff,ff,f,f,f,g
192,b,u,g,x,v,t,f,f,g
629,a,u,g,w,v,f,f,f,g
559,a,u,g,q,h,t,t,t,g


In [61]:
X_train_enc

array([[0., 1., 0., ..., 1., 1., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [1., 2., 2., ..., 1., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 1., 1., 0.],
       [1., 1., 0., ..., 0., 1., 2.]])

## Replacing categories with counts or frequency of observations

In **count or frequency encoding**, we replace the categories with the **count or the percentage
of observations with that category**. 

That is, if 10 out of 100 observations show the category
blue for the variable color, we would replace blue with 10 when doing count encoding, or
by 0.1 if performing frequency encoding. 

These techniques, which capture the representation of each label in a dataset, are **very popular in data science competitions**. 

The assumption is that the number of observations per category is somewhat predictive of the
target.

> Note that if two different categories are present in the same percentage of
observations, they will be replaced by the same value, which may lead to
information loss.

In [62]:
data = pd.read_csv('data/creditApprovalUCI.csv')

In [63]:
data.head(3)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,1
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
2,a,24.5,,u,g,q,h,,,,0,f,g,280.0,824,1


In [65]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(labels=['A16'], axis=1), data['A16'],test_size=0.3, random_state=0)

imputer = CategoricalImputer()
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))


In [66]:
from feature_engine.encoding import CountFrequencyEncoder

count_enc = CountFrequencyEncoder(encoding_method='count', variables=None)

In [67]:
count_enc.fit(X_train)

  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))
  return is_numeric(pd.to_numeric(column, errors="ignore"))


In [68]:
count_enc.encoder_dict_

{'A1': {'b': 335, 'a': 144, 'Missing': 4},
 'A4': {'u': 363, 'y': 115, 'Missing': 4, 'l': 1},
 'A5': {'g': 363, 'p': 115, 'Missing': 4, 'gg': 1},
 'A6': {'c': 93,
  'q': 56,
  'w': 48,
  'i': 41,
  'ff': 38,
  'k': 38,
  'aa': 34,
  'cc': 30,
  'm': 26,
  'x': 24,
  'e': 21,
  'd': 21,
  'j': 8,
  'Missing': 4,
  'r': 1},
 'A7': {'v': 277,
  'h': 101,
  'ff': 41,
  'bb': 39,
  'z': 7,
  'dd': 5,
  'j': 5,
  'Missing': 4,
  'n': 3,
  'o': 1},
 'A9': {'t': 222, 'f': 193, 'Missing': 68},
 'A10': {'f': 230, 't': 185, 'Missing': 68},
 'A12': {'f': 263, 't': 220},
 'A13': {'g': 441, 's': 38, 'p': 4}}

In [69]:
X_train_enc = count_enc.transform(X_train)
X_test_enc = count_enc.transform(X_test)

In [71]:
X_train

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15
596,a,46.08,3.000,u,g,c,v,2.375,t,t,8,t,g,396.0,4159
303,a,15.92,2.875,u,g,q,v,0.085,f,f,0,f,g,120.0,0
204,b,36.33,2.125,y,p,w,v,0.085,t,t,1,f,g,50.0,1187
351,b,22.17,0.585,y,p,ff,ff,0.000,f,f,0,f,g,100.0,0
118,b,57.83,7.040,u,g,m,v,14.000,t,t,6,t,g,360.0,1332
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
359,a,36.75,4.710,u,g,ff,ff,0.000,f,f,0,f,g,160.0,0
192,b,41.75,0.960,u,g,x,v,2.500,t,f,0,f,g,510.0,600
629,a,19.58,0.665,u,g,w,v,1.665,f,f,0,f,g,220.0,5
559,a,22.83,2.290,u,g,q,h,2.290,t,t,7,t,g,140.0,2384


In [74]:
X_train["A4"].unique()

array(['u', 'y', 'Missing', 'l'], dtype=object)

In [70]:
X_train_enc.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15
596,144,46.08,3.0,363,363,93,277,2.375,222,185,8,220,441,396.0,4159
303,144,15.92,2.875,363,363,56,277,0.085,193,230,0,263,441,120.0,0
204,335,36.33,2.125,115,115,48,277,0.085,222,185,1,263,441,50.0,1187
351,335,22.17,0.585,115,115,38,41,0.0,193,230,0,263,441,100.0,0
118,335,57.83,7.04,363,363,26,277,14.0,222,185,6,220,441,360.0,1332
