In [1]:
%run Latex_macros.ipynb
%run beautify_plots.py

<IPython.core.display.Latex object>

$$
\newcommand{\o}{\mathbf{o}}
$$


In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1
%matplotlib inline

In [3]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Common imports
import os

import mnist_helper
%aimport mnist_helper

mnh = mnist_helper.MNIST_Helper()

import class_helper
%aimport class_helper

clh= class_helper.Classification_Helper()

import training_models_helper as tmh
%aimport training_models_helper

tm = tmh.TrainingModelsHelper()

# The dummy variable trap for Linear Regression
Suppose we have categorical variable `Sex` with categories (discrete values)  $\{ \text{Female}, \text{Male} \}$

One Hot Encoding represents each value as a vector $\mathbf{Is}$ of length 2
- Replace feature `Sex`
- with the two binary valued *indicator* variables
    - $\mathbf{Is}_\text{Female}, \mathbf{Is}_\text{Male}$

So for example $i$ 
- where $\text{Sex}^\ip = \text{Female}$
- $\mathbf{Is}^\ip_\text{Female} =1, \mathbf{Is}^\ip_\text{Male} = 0$

And for example $i'$
- where $\text{Sex}^{(i')} = \text{Male}$
- $\mathbf{Is}^{(i')}_\text{Female} = 0, \mathbf{Is}^{(i')}_\text{Male} = 1$

If we were to use the OHE in a Linear Regression, the design matrix might look something like

$
  \X' = \begin{pmatrix}
  \mathbf{const} & \mathbf{Is}_\text{Female} & \mathbf{Is}_\text{Male} & \ldots\\
  1 &  1 & 0 &   \ldots \\ 
  \vdots \\
  1 &  0 & 1 &   \ldots\\ 
   \vdots \\
  \end{pmatrix}
  \begin{matrix}
  \\
  \text{Female}  \\ 
  \vdots \\
  \text{Male} \\ 
   \vdots \\
  \end{matrix}
$

Note that, for every example $i$
$$
\mathbf{Is}_\text{Female}^\ip +  \mathbf{Is}_\text{Male}^\ip = 1
$$

When a categorical variable with categories from the set $C$ is encoded with OHE
$$
\sum_{c \in C} { \mathbf{Is}^\ip_c } = 1
$$

Also note that $\mathbf{const}^\ip = 1$ for every example $i$
so
$$
\sum_{c \in C} { \mathbf{Is}^\ip_c } = 1 = \mathbf{const}^\ip
$$

That is, for each example
- The linear combination of some features (i.e., the set of indicator variables for a categorical variable)
- Is exactly equal to some other feature (i.e, the constant)

Such a situation is called *Perfect Multi Collinearity*.

Multi collinearity (either Perfect or Imperfect) poses mathematical difficulties for Linear Regression
and must be eliminated


The problem manifests itself in Linear Regression as 
- Some variables with huge positive parameter values (e.g., $\Theta_{\mathbf{Is}_\text{Female}}$)
- And other variables with huge (offsetting) negative parameter values (e.g., $\Theta_{\mathbf{Is}_\text{Male}}$).


Multi collinearity
- Arises when categorical variables are One Hot Encoded, i.e., added dummy/indicator variables
- Is **not necessarily** a difficulty for models other than Linear Regression
- Is called the *Dummy Variable Trap for Linear Regression*

It may appear that the simplest way to avoid multi collinearity is to drop the constant feature $\mathbf{const}$

This will work if there is a *single* categorical variable but consider
what happens if there is a second categorical variable
- `Pclass` with values in $\{ \text{First}, \text{Second}, \text{Third} \}$
$$
(\mathbf{Is}_\text{Female} +  \mathbf{Is}_\text{Male}) = (\mathbf{Is}_\text{First} +  \mathbf{Is}_\text{Second} + \mathbf{Is}_\text{Third}) = 1
$$

Thus the indicator variables encoding `Sex` are multi collinear with the binary variables encoding `Pclass`.

The better solution is
- To retain the constant $\mathbf{const}$ variable
- Encode a categorical variable having category values in $C$ using $( ||C|| -1 )$ binary indicator variables
    - eliminate the indicator variable for a single category  in each class

The Linear Regression equations changes from
$$
\begin{array}[lll]\\
\y & = & \Theta_0 * \mathbf{const} + \Theta_{\mathbf{Is}_\text{Female}} * \mathbf{Is}_\text{Female} + \Theta_\text{Male} * \mathbf{Is}_\text{Male} \\
\text{to} \\
\y & = & \Theta'_0 * \mathbf{const} + \Theta'_{\mathbf{Is}_\text{Female}} * \mathbf{Is}_\text{Female} \\
\end{array}
$$

This will eliminate multi collinearity, but where did the indicator (binary variable)
for the missing category go ?

Answer: the constant !
$$
\begin{array}[lll]\\
\Theta'_0 & = &  \Theta_0  & + & \Theta_\text{Male} \\
\Theta'_\text{Female} & = &  \Theta_\text{Female} & - & \Theta_\text{Male}
\end{array}
$$



To convince yourself of this, consider the original equation for examples with either of the two 
values for `Sex`
$$
\begin{array}[lll]\\
\text{Sex}^\ip = \text{Female}: & \y^\ip = \Theta_0 + \Theta_{\mathbf{Is}_\text{Female}} \\
\text{Sex}^\ip = \text{Male}:   & \y^\ip = \Theta_0 + \Theta_{\mathbf{Is}_\text{Male}} \\
\end{array}
$$
and the corresponding changed equations
$$
\begin{array}[lll]\\
\text{Sex}^\ip = \text{Female}: & \y^\ip & = & \Theta'_0 + \Theta'_{\mathbf{Is}_\text{Female}} \\ 
& &   = & (\Theta_0   +  \Theta_{\mathbf{Is}_\text{Male}}) +  (\Theta_\text{Female} - \Theta_{\mathbf{Is}_\text{Male}}) \\
& &   = & \Theta_0   + \Theta_{\mathbf{Is}_\text{Female}}\\
\text{Sex}^\ip = \text{Male}:   & \y^\ip  & = & \Theta'_0  \\
& &   = & \Theta_0  +  \Theta_{\mathbf{Is}_\text{Male}} \\ \\
\end{array}
$$

Identical !

What has effectively happened
- The new intercept 
    - Captures the "base" contribution (e.g. $\Theta_{\mathbf{Is}_\text{Male}}$) of the missing category

- The remaining binary variables
    - Capture the *incremental* contribution (over the base contribution $\Theta_{\mathbf{Is}_\text{Male}}$)
    of the example being different than the base category (e.g., having  `Sex` = $\text{Male}$)

That is:
- $\Theta_{\mathbf{Is}_\text{Female}}$: *absolute* contribution to $\hat{\y}$ for being Female rather than Male
- $\Theta'_{\mathbf{Is}_\text{Female}}$: *incremental* contribution to $\hat{\y}$ for being Female
    - Over and above the contribution for being Male

# Titanic example:  why no problem when we added OHE for `Pclass` ?

When we encoded `Pclass` using OHE, we didn't drop one category for the feature.

How come this didn't manifest itself as large, offsetting positive/negative parameter values ?

- Answer: `LogisticRegression` in `sklearn` defaults to using regularization in the loss function.

Regularization skirts the issue by enforcing a constraint that restricts large values for parameters.

By turning the parameter value of one indicator in a class to $0$, we effectively eliminate
1 indicator and avoid perfect collinearity.

In short: we were lucky !

The regularizer that we didn't even know was there saved us.

Better to be informed than lucky
- that's why we mention the Dummy Variable trap and its solutions

# Titanic example: why no problem when feature `Sex` was not OHE ?

Recall our first pass at solving the Titanic classification problem
- Categorical variable `Sex` was encoded as the single binary variable with values 0/1
- Rather than 2 binary indicator variables $\mathbf{Is}_\text{Female}, \mathbf{Is}_\text{Male}$

Essentially: our representation was equivalent to having dropped one value for the variable `Sex`

**Bottom line**

Some models (Linear Regression and Logistic Regression by extension) may encounter problems with OHE of features.

By luck or design, we avoided any potential Dummy Variable Trap issues in the Titanic example.

But it's much better to be smart than lucky: do it the right way !

`sklearn` makes this easy for you
- `OneHotEncoder` transformer has optional argument `drop` that can drop one category per feature

In [4]:
print("Done")

Done
