# Welcome!

Machine learning models need data to be provided in numeric form. Categorical data often contain strings and should therefore be encoded. We will explore low and high cardinality encoding methods.

# <center> Low Cardinality Technique </center>

## One Hot Encoding

One Hot Encoding is a representation of categorical variables as binary vectors. The procedure creates a dummy variable for each unique value that exists within the categorical variable. The binary characteristic of a dummy variable is that it contains only two values: 
- 0 for data that are not represented by the dummy variable
- 1 for data that are represented by the dummy variable

One hot encoding is very popular, but it can be computationally expensive if the number of labels within a variable is high. In addition, the created binary variables predict each other perfectly, leading to multicollinearity which can be problematic for regression models. This is known as the dummy variable trap and is handled by dropping any one of them, usually the first or last. The remaining binary variables represent completely the original categorical variable. However, the drop can be problematic for tree-based algorithms because the additional label will not be considered. Tree-based algorithms are not affected by multicollinearity and the additional label should be kept - although such algorithms do not work well with high feature spaces. We may keep the extra label if recursive algorithms are going to be used.

<u>Prefer to use one hot encoding when:</u>

- Categorical variable is nominal
- There are less than 10 labels within the categorical variable 
- Tree-based algorithms are not used 
- Recursive algorithms (e.g. RFE) are not used 

There are many libraries for the one hot encoder like *category_encoders* and *feature_engine*, but I like the sklearn version.

In [None]:
from sklearn.preprocessing import OneHotEncoder

categorical_vars = ["input1", "input2"]

ohe_enc = OneHotEncoder(sparse=False, drop = "first")
enc_vars_array = ohe_enc.fit_transform(X_train[categorical_vars])
enc_feature_names = ohe_enc.get_feature_names(categorical_vars)

enc_vars_df = pd.DataFrame(enc_vars_array, columns = enc_feature_names)

X_train = pd.concat([X_train.reset_index(drop=True), enc_vars_df.reset_index(drop=True)], axis = 1)
X_train.drop(categorical_vars, axis = 1, inplace = True)

X_test = ohe_enc.transform(X_test)

# <center> High Cardinality Techniques </center>

## Generalized Linear Mixed Model Encoder *(supervised)*

This is similar to target encoding where a category is encoded based on its average target value. The Generalized Linear Mixed Model (GLMM) version of target encoding is well supported by research and [compares well](https://www.researchgate.net/publication/350578264_Regularized_target_encoding_outperforms_traditional_methods_in_supervised_machine_learning_with_high_cardinality_features) with other encoding techniques. It may be best to use this encoder with cross validation since target encoding tends to overfit.

In [None]:
# Generalized Linear Mixed Model Encoder Template

from category_encoders.glmm import GLMMEncoder

glmm_enc = GLMMEncoder(cols = ["input1", "input2"], 
                       handle_missing = "return_nan",
                       handle_unknown = "value",
                       random_state = 42,
                       binomial_target = False) # If binomial_target = True, then target variable should be binomial.
                                                # Elif binomial_target = False, then target must be continuous.
                

# On training data transform should be called with y, on test data without.
X_train = glmm_enc.fit_transform(X_train, y_train)
X_test = glmm_enc.transform(X_test)

## Weight of Evidence *(supervised)*

Weight of Evidence (WoE) creates a monotonic relationship between the input and the target variable. It is defined by the logarithm of the proportion of good events divded by the proportions of bad events. 

- Good for regression models (including logistic regression)
- Transformed variables are on the same scale and can be compared to explore their predictive power
- BUT may lead to overfitting

In [None]:
# Weight of Evidence Template

from category_encoders.woe import WOEEncoder

woe_enc = WOEEncoder(cols = ["input1", "input2"],
                     handle_missing = "return_nan",
                     handle_unknown = "value",
                     random_state = 42)

# On training data transfrom should be called with y, on test data without.
woe_enc.fit_transform(X_train, y_train)
woe_enc.transform(X_test)