# Encoding Categorical Variables

- nominal category variables: kategorije nimajo neke lestivce boljših in slabših (npr. barve)
- ordinal category variables: kategorije lahko razporedimo po vrednosti (npr. ameriške ocene A,B,C, ...)

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

## Category Encoders

A set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques.

- specializirano knjižnica za encodanje: https://contrib.scikit-learn.org/category_encoders/
- za naše primere bo zadostvoal sklearn in feature-engine

Install: `category_encoders`

<img src="https://feature-engine.trainindata.com/en/1.1.x/_images/categoricalSummary.png" />

## Creating binary variables through one-hot encoding
- posebej uporabno za drevesne modele
- uporabno tudi, kadar nam je pomembno da lahko odločamo o pomembnosti posameznih značilk (ostale lahko odstranimo)
    - primer: hot-encodamo dneve v tednu, nato upoštevamo samo je/ni nedelja, ostale pa odstranim

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [None]:
data = pd.read_csv('data/creditApprovalUCI.csv')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(labels=['A16'], axis=1), data['A16'], test_size=0.3, random_state=0)

In [None]:
X_train['A4'].unique()

In [None]:
encoder = OneHotEncoder(categories='auto', drop='first', sparse=False) 
# drop ='first' ostranimo original znacilko, ki ni encodana
# sparse=False ... samo nivo optimizacije

In [None]:
vars_categorical = ['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10', 'A12', 'A13']
encoder.fit(X_train[vars_categorical])

In [None]:
X_train_enc = encoder.transform(X_train[vars_categorical])
X_test_enc = encoder.transform(X_test[vars_categorical])

In [None]:
X_train_enc

## Performing one-hot encoding of frequent categories
- ta pristop se uporabi, da preveč ne razširimo feature-space-a
- določimo koliko top kategorij ohranimo, ostale nadomestimo npr. z "other" categorijo
- za primer bomo uporabili feature-engine, ki ima tak pristop že vgrajen/implementiran

In [None]:
from feature_engine.encoding import OneHotEncoder
from feature_engine.imputation import CategoricalImputer

In [None]:
data = pd.read_csv('data/creditApprovalUCI.csv')

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['A16'], axis=1), # predictors
    data['A16'], # target
    test_size=0.3, # percentage of observations in test set
    random_state=0) # seed to ensure reproducibility


imputer = CategoricalImputer()
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

In [None]:
X_train['A6'].unique()

In [None]:
X_train['A6'].value_counts().sort_values(ascending=False).head(5)

In [None]:
ohe_enc = OneHotEncoder(top_categories=5, variables=['A6', 'A7'], drop_last=False)

In [None]:
ohe_enc.fit(X_train)

In [None]:
X_train_enc = ohe_enc.transform(X_train)
X_test_enc = ohe_enc.transform(X_test)

In [None]:
X_train_enc.head()

In [None]:
ohe_enc.encoder_dict_

## Replacing categories with ordinal numbers
- primerneje za nelinearne ML modele

In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
data = pd.read_csv('data/creditApprovalUCI.csv')

X_train, X_test, y_train, y_test = train_test_split(data.drop(labels=['A16'], axis=1), data['A16'],test_size=0.3, random_state=0)

In [None]:
vars_categorical = ['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10', 'A12', 'A13']

le = OrdinalEncoder()

In [None]:
le.fit(X_train[vars_categorical])

In [None]:
X_train_enc = le.transform(X_train[vars_categorical])
X_test_enc = le.transform(X_test[vars_categorical])

In [None]:
X_train_enc

## Replacing categories with counts or frequency of observations
- kategorijo nadomestimo s številom, ki predstavlja odstotek s katerim se ta kategorija pojavlja
- če imata dve kategoriji točno enak procent se po tem encodingu združita v eno

In [None]:
data = pd.read_csv('data/creditApprovalUCI.csv')

In [None]:
data.head(3)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(labels=['A16'], axis=1), data['A16'],test_size=0.3, random_state=0)

imputer = CategoricalImputer()
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

In [None]:
from feature_engine.encoding import CountFrequencyEncoder

count_enc = CountFrequencyEncoder(encoding_method='count', variables=None)

In [None]:
count_enc.fit(X_train)

In [None]:
count_enc.encoder_dict_

In [None]:
X_train_enc = count_enc.transform(X_train)
X_test_enc = count_enc.transform(X_test)

In [None]:
X_train_enc.head()

## Encoding with integers in an ordered manner
- določimo povprečje ter sortiramo
- vrednost kodiranja korelira s številom pojavljanja te kategorije (podobno kot zgoraj, ampak ne v procentih)

In [None]:
data = pd.read_csv('data/creditApprovalUCI.csv')

X_train, X_test, y_train, y_test = train_test_split(data, data['A16'], test_size=0.3, random_state=0)

imputer = CategoricalImputer()
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

In [None]:
X_train["A7"].unique()

In [None]:
X_train.groupby(['A7'])['A16'].mean().plot()
plt.title('Relationship between A7 and the target')
plt.ylabel('Mean of target')
plt.show()

In [None]:
X_train.groupby(['A7'])['A16'].mean()

In [None]:
X_train.groupby(['A7'])['A16'].mean().sort_values()

In [None]:
ordered_labels = X_train.groupby(['A7'])['A16'].mean().sort_values().index
ordered_labels

In [None]:
ordinal_mapping = {k: i for i, k in enumerate(ordered_labels, 0)}
ordinal_mapping

In [None]:
X_train['A7'] = X_train['A7'].map(ordinal_mapping)
X_test['A7'] = X_test['A7'].map(ordinal_mapping)

In [None]:
X_train.groupby(['A7'])['A16'].mean().plot()
plt.title('Relationship between A7 and the target')
plt.ylabel('Mean of target')
plt.show()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(labels=['A16'], axis=1), data['A16'], test_size=0.3, random_state=0)

In [None]:
from feature_engine.encoding import OrdinalEncoder

ordinal_enc = OrdinalEncoder(encoding_method='ordered', variables=None)

In [None]:
ordinal_enc.fit(X_train, y_train)

In [None]:
ordinal_enc.encoder_dict_

In [None]:
X_train_enc = ordinal_enc.transform(X_train)
X_test_enc = ordinal_enc.transform(X_test)

In [None]:
X_test_enc.head()

## Encoding with the mean of the target

In [None]:
data = pd.read_csv('data/creditApprovalUCI.csv')

X_train, X_test, y_train, y_test = train_test_split(data, data['A16'], test_size=0.3, random_state=0)

imputer = CategoricalImputer()
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

In [None]:
from feature_engine.encoding import MeanEncoder

mean_enc = MeanEncoder(variables=None)

In [None]:
mean_enc.fit(X_train, y_train)

In [None]:
X_train_enc = mean_enc.transform(X_train)
X_test_enc = mean_enc.transform(X_test)

## Grouping rare or infrequent categories
- v določenih primerih, želimo večje število redkih kategorij združiti v eno kategorijo (npr. "other")
- recimo vse kategorije, ki se skupaj ne pojavljajo v več kot 5%
- na ta način lahko preprečimo overfitting (decision trees)

In [None]:
from feature_engine.encoding import RareLabelEncoder

In [None]:
data = pd.read_csv('data/creditApprovalUCI.csv')

X_train, X_test, y_train, y_test = train_test_split(data.drop(labels=['A16'], axis=1), data['A16'],test_size=0.3, random_state=0)

imputer = CategoricalImputer()
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

In [None]:
rare_encoder = RareLabelEncoder(tol=0.05, # pod 5% gre v skupno kategorijo
                                n_categories=4) # toda najmanj 4 kategorije morajo biti uporabljene

In [None]:
rare_encoder.fit(X_train)

In [None]:
X_train_enc = rare_encoder.transform(X_train)
X_test_enc = rare_encoder.transform(X_test)