In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

Macro `_latex_std_` created. To execute, type its name (without quotes).
=== Macro contents: ===
get_ipython().run_line_magic('run', 'Latex_macros.ipynb')
 no stored variable _latex_std_


$$
\newcommand{\o}{\mathbf{o}}
$$


In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1
%matplotlib inline

In [3]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Common imports
import os

import mnist_helper
%aimport mnist_helper

mnh = mnist_helper.MNIST_Helper()

import class_helper
%aimport class_helper

clh= class_helper.Classification_Helper()

import training_models_helper as tmh
%aimport training_models_helper

tm = tmh.TrainingModelsHelper()

# Categorical features: the Dummy Variable Trap for Linear Regression

In general, OHE of features is the best way to deal with Categorical features in
Machine Learning.

**However** there is a mathematical issue for some models
- linear models (like Linear Regression and Logistic Regression).

This is called the **Dummy Variable Trap** 

To avoid the trap, we need to perform OHE in a slightly different way for the affected models.

Special cases are unfortunate and we will only offer a quick explanation here.


For now, when using linear models there are several alternatives to avoid the trap
- if you have a categorical variable $v$ with $||C||$ classes

- The vector $\mathbf{v}$ should consist of $||C|| -1$ indicators rather than $||C||$
    - this solution is common enough that several toolkits provide functions to deal with it
        - `sklearn.preprocessing.OneHotEncoder` with argument `drop="first"`
        - Pandas: `pd.get_dummies` with argument `drop_first=True`
- Use a regularizer (e.g., Ridge regression)
- *Don't* include an intercept term
    - But this may cause problems
        - Having an intercept ensures that the errors are mean $0$

## Dummy variable trap: Multi-collinearity in Linear Regression

Consider the class $C = $ { "Red", "Green", "Blue" } and a categorical variable $v$ for this class.

Suppose we create $||C||$ indicator variables
- $\mathbf{v}_{Red}, \mathbf{v}_{Green}, \mathbf{v}_{Blue}$

By construction of the OHE of $v$,
for each example $i$:
$$
\sum_{c \in C} { \mathbf{v}^\ip_c } = 1
$$

This means that the indicators in $\mathbf{v}$ are perfectly collinear with the "constant" attribute 1 in each example
representing the intercept term, e.g, $\x_0$.

$
  \X'' = \begin{pmatrix}
  \mathbf{const} & \mathbf{Is Red} & \mathbf{Is Green} & \mathbf{Is Blue}\\
  1 &  1 & 0 & 0 \\ 
  1 &  0 & 0 & 1\\ 
  1 &  0 & 1 & 0 \\ 
   \vdots \\
  \end{pmatrix}
$

When one feature (e.g., the constant) is equal to a linear combination of some other features, this is
called Perfect Multi-collinearity.

Linear Regression has mathematical issues with Perfect Multi-colinearity (or even with Imperfect Multi-collinearity).

This manifests itself as 
- some variables with huge positive parameter values (e.g., $\Theta_{Red}, \Theta_{Blue}$)
- and other variables with huge (offsetting) negative parameter values (e.g., $\Theta_{Green}$).

Regularization skirts the issue by enforcing a constraint that restricts large values for parameters.

By turning the parameter value of one indicator in a class to $0$, we effectively eliminate
1 indicator and avoid perfect collinearity.

So where did we get lucky in our two versions of Tittanic ?
- In the first version, a binary variable for `Sex` is  same as $||C|| -1$ indicators since $||C|| = 2$
- In the second version, with a full set of indicators for `Sex` (2) and `Pclass` (3)
    - `LogisticRegression` defaults to a regularized cost function

So by luck or design, we avoided any potential Dummay Variable Trap issues.

In [48]:
print("Done")

Done
