In [1]:
%run Latex_macros.ipynb
%run beautify_plots.py

<IPython.core.display.Latex object>

In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

In [3]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Common imports
import os

idx = pd.IndexSlice

import mnist_helper
%aimport mnist_helper

mnh = mnist_helper.MNIST_Helper()

import class_helper
%aimport class_helper

clh= class_helper.Classification_Helper()

import training_models_helper as tmh
%aimport training_models_helper

tm = tmh.TrainingModelsHelper()

nbh = class_helper.NB_Helper()

# Classification via counting

A model for the Classification task constructs a probability distribution
$\hat{\y} = \pr{\y | \x}$ 
- Given feature vector $\x$
- Construct *vector*  $\hat{\y}$ (of length $| C |$, where $C$ are the distinct values for the target)
- Whose elements are probabilities: 
$\hat{\y}_c$ is the probability that $\x$ is in class $c$


**Notation abuse alert:**
- Subscripts of of vectors should be integers rather than class names
- So technically we should write $\hat{\y}_\loc{c}$ where $\loc{c}$ is the integer index of class named $c$

This sounds difficult at first glance.

Let's start with something simpler: counting.

We will show to to construct this probability using nothing more than counting the features and targets
of the training set!

## From counting to probability

We introduce the topic by assuming all our variables (features and target) are discrete.

We will subsequently adapt this to continuous variables.

First, let's compute the distribution of target classes.

Let $\X = \{ (\x^\ip, \y^\ip) | 1 \le i \le m \}$ be our $m$ training examples.

Then 
$$\cnt{\y = \y'} = \left| \{ i \, | \, \y^\ip = \y' \} \right|$$

is the number of training examples with target $\y'$.

We can easily convert this into an unconditional probability
$$
\pr{\y = \y'} = \frac{\cnt{\y=\y'}}{m}
$$

We can similarly compute the *joint* probability of any two features.

First we count the co-occurrences of the two variables

$$
\cnt{\x_j = \x'_j, \x_k = \x'_k} = \left| \, \{ i | \x^\ip_j = \x'_j, \x^\ip_k = \x'_k \} \, \right|
$$

And the the joint probability is
$$
\pr{\x_j = \x'_j, \x_k = \x'_k} = \frac{\cnt{\x_j = \x'_j, \x_k = \x'_k}}{m}
$$

Our illustration is with two features but the notation generalizes for 
- Any number of variables
- Any kind of variables: feature or target

Finally, we can define *conditional* probability
$$
\pr{\y = \y' | \x = \x'} = \frac{ \pr{\y = \y' , \x = \x'} }{\pr{\x=\x'}}
$$

That is, the conditional probability
- Is the joint probability
- As a fraction, relative to the unconditional probability of $\x = \x'$


# Bayes theorem

The key for converting counts (really, associated probabilities) to predictions lies in
Bayes Theorem.


Bayes Theorem relates conditional and unconditional probabilities.

**Bayes Theorem**

$$
\pr{\y = \y' | \x = \x'} = \frac{ \pr{ \x = \x' | \y = \y' } * \pr{\y=\y'} }{\pr{\x=\x'}}
$$


Let's think about Bayes Theorem in terms of our classification task:
- The left hand side is our prediction for the class probabilities, given the features
- The right hand side involves
    - The conditional probability of seeing examples with features $\x'$ and target $\y'$.
    - The unconditional probability of seeing examples with label $\y'$
    - The unconditional probability of seeing examples with feature vector $\x'$.

All these elements can be obtained by counting (and filtering) the training set !

Hence, we can build an extremely simple classifier using nothing more than counting.

## Posterior, Prior Probability, Evidence

Let's break down the parts of Bayes theorem and give them some names:
- $\pr{\y = \y' | \x = \x'}$: *posterior probability*
    - Our prediction
    - This is the probability distribution of $\y$ *conditional* on the features being $\x$
- $\pr{\y=\y'}$: *prior probability*
    - This is the unconditional distribution of $\y$
 
- $\pr{\x = \x '| \y = \y'}$: *likelihood*
    - Given that $\y = \y'$, what is the probability that $\x = \x'$ ?
    - This is the counting part: how often does the label $\y'$ occur when the features are $\x'$ ?
- $\pr{\x = \x'}$: *evidence*
    - How often do we see the features $\x'$ ?

We can re-state Bayes Theorem as
$$
\begin{array}{lll}
\text{posterior} = \frac{ \text{prior} * \text{likelihood}} { \text{evidence} } 
\end{array}
$$

That is: 
- Starting from an uninformed *prior* distribution of $\y$
- Derive
a conditional *posterior* distribution (i.e., informed by *evidence* $\x$) by updating via
the *likelihood* of seeing $\x, \y$ together.

## Proof of Bayes Theorem

$$
\begin{array}[llll]\\
\pr{\y = \y' | \x = \x'} & = &  \frac{\pr{\y = \y' ,\x = \x'}} {\pr{\x = \x'} } & \text{(def. of conditional probability)}\\
 & = & \frac{ \pr{\y = \y' ,\x = \x'}} {\pr{\x = \x'} } * \frac{ \frac{1}{\pr{\y = \y'}}}{\frac{1}{\pr{\y = \y'}}} & \text{(multiply by identity)} \\
 & = & \frac{\pr{\x = \x' | \y = \y'}}{\pr{\x = \x'}} *  \frac{1}{\frac{1}{\pr{\y = \y'}}} & \text{(def. of conditional probability)} \\
 & = & \frac{\pr{\x = \x' | \y = \y'}}{\pr{\x = \x'}} * \pr{\y = \y'}
\end{array}
$$


## Length of $\x$ is $n$

Remember that $\x$ is a vector, so that $\pr{\x = \x' \, | \,\y = \y'}$ is
a *joint* probability of $n$ terms
$$
\pr{\x_1 = \x'_1, \x_2 = \x'_2, \ldots, \x_n = \x'_n \,| \, \y = \y'}
$$

We an obtain this by counting (as described above) 
- Let $| \x_j |$ denote the number of distinct values for the $j^{th}$ feature
- There are $$\prod_{1 \le j \le n} | \x_j | $$ potential combinations for $\x$

That's a lot of counting !

More importantly, it's a lot of parameters to remember (i.e, size of $\Theta$ is big).

We need a short-cut.

## The Naive part of Naive Bayes

We will assume that each feature is *conditionally* independent of one another
$$
\pr{\x_j = \x'_j, \x_k = \x'_k, | \y = \y' } \, =  \,\pr{\x_j = \x'_j | \y = \y' } * \pr{\x_k = \x'_k | \y = \y' }
$$

That is
- $\x_j$ an $\x_k$ are **not** independent unconditionally
- They **are** independent *conditional* on $\y = \y'$

Think of $\x_j$ and $\x_k$ being correlated through their individual relationships with $\y$.

Excluding that mutual dependence, they may be uncorrelated.

Generalizing the assumption to feature vectors $\x$ of length $n$:

$$
\pr{ \x = \x' | \y = \y' } = \prod_{i=1}^n { \pr{\x_i = \x'_i | \y = \y'} }
$$

That is
- The joint conditional probability of the vector of length $n$ 
- Is **assumed** to be the product of
the individual conditional probabilities of each element of the vector.


This  assumption is probably not true but
- Makes $\pr{ \x = \x' | \y = \y' }$ very easy to compute
    - Don't have to compute it for possible combination of values for $\x$
- Uses few parameters
- May be close enough

Thus the "naive" assumption has many benefits !

What about computing the *unconditional* $\pr{\x = \x'}$ ?
                           
We can obtain this from conditional probabilities as well
$$
\pr{\x = \x'} = \sum_{c \in C} { \pr{\x = \x' | \y = c} } * \pr{ \y = c }
$$

That is, the unconditional probability follows from the
- Conditional probability given $\y$
- Weighted by the probability $\pr{\y}$ for each possible value of $\y$

This follows from the definition of conditional probability.

What this means is that the only parameters we need to remember are
- The unconditional probabilities $\pr{\y}$
    - Depends on number of classes $|C|$
- Probabilities conditional on $\y$: $\pr{\x | \y}$
    - Depends on length of $\x$: $n$

# Example

Here is a hypothetical trading example for equities
- There are two categorical features
    - Valuation: possible values $\{ \text{Rich}, \text{Cheap} \}$
        - Is the current stock price expensive (Rich) or inexpensive (Cheap) ?
    - Yield: possible values $\{ \text{High}, \text{Low} \}$ 
        - Is the dividend yield of the stock desirable (High) or undesirable (Low) ?
- Target: An Action with possible values $\{ \text{Long}, \text{Short}, \text{Neutral} \}$
    - What should our position be ?
    

We are given a number of examples (on which to train).

Our Classification task is 
- Given an equity (test example) with values for the two features Valuation and Yield
- Decide what our position (Long/Short/Neutral) should be


Here are our training examples

In [4]:
d_df = pd.read_csv("valuation_yield_action.csv")
target_name = "Action"
d_df

Unnamed: 0,Valuation,Yield,Action
0,Cheap,High,Long
1,Cheap,High,Long
2,Cheap,High,Long
3,Cheap,High,Neutral
4,Rich,Low,Short
5,Rich,Low,Short
6,Rich,Low,Short
7,Rich,Low,Short
8,Rich,Low,Neutral
9,Cheap,Low,Neutral


And a quick look at the data, sliced by Action

In [5]:
grouped_by_target = d_df.groupby(target_name)
for gp in grouped_by_target.groups.keys():
    print(gp, "\n")
    print(grouped_by_target.get_group(gp).head())
    print("\n\n")
 

Long 

   Valuation Yield Action
0      Cheap  High   Long
1      Cheap  High   Long
2      Cheap  High   Long
10     Cheap   Low   Long
13      Rich  High   Long



Neutral 

   Valuation Yield   Action
3      Cheap  High  Neutral
8       Rich   Low  Neutral
9      Cheap   Low  Neutral
12     Cheap   Low  Neutral
15      Rich  High  Neutral



Short 

   Valuation Yield Action
4       Rich   Low  Short
5       Rich   Low  Short
6       Rich   Low  Short
7       Rich   Low  Short
11     Cheap   Low  Short





Looks like we
- Go Long if the stock is Cheap (Valuation) and High (Yield)
- Go Short if the stock is Rich (expensive Valuation) and Low (Yield)

Here's the empirical distribution of the training examples

In [6]:
d_df["dummy"] = 1  # Need to aggregate on something
t = d_df.pivot_table(index=target_name, columns=["Valuation", "Yield"], values="dummy", aggfunc=["count"],
                 margins=True)

t

Unnamed: 0_level_0,count,count,count,count,count,count,count
Valuation,Cheap,Cheap,Fair,Fair,Rich,Rich,All
Yield,High,Low,High,Low,High,Low,Unnamed: 7_level_2
Action,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3
Long,3.0,2.0,2.0,1.0,1.0,,9
Neutral,1.0,2.0,1.0,1.0,1.0,1.0,7
Short,,1.0,1.0,2.0,1.0,4.0,9
All,4.0,5.0,4.0,4.0,3.0,5.0,25


This gives us everything we need for the Naive Bayes algorithm
- These are counts
- We can easily turn the counts into unconditional probabilities by dividing by total number of examples
- Will leave them as counts for now

Let's parse this table:
- Columns: $\cnt{\y | \x}$
    - A column (defined by concrete values for each of the two attributes)
    - Defines a distribution over the target (Action)
- Column Sum: $\cnt{\x} = \sum_{a \in \text{Action}}{ \cnt{ \x | a} }$
    - Total number of examples with attribute pair $\x$
- Rows: $\cnt{ \x | \y }$
    - A row (defined by a concrete value for the Action)
    - Defines a distribution over the attributes pairs for which this action is taken
- Row sums: $\cnt{a} = \sum_{\x} { \cnt{\x|a } }$
    - Total number of examples with Action $a$

Let's simplify the table by looking at the marginal with respect to each attribute
- Distribution over a single attribute rather than the pair

First, by Valuation

In [7]:
# Single feature (Valuation), rather than pair
d_df.drop(columns=["dummy"]).pivot_table(index=target_name, columns=["Valuation"], aggfunc=["count"], 
                  fill_value=0, margins=True)

Unnamed: 0_level_0,count,count,count,count
Unnamed: 0_level_1,Yield,Yield,Yield,Yield
Valuation,Cheap,Fair,Rich,All
Action,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3
Long,5,3,1,9
Neutral,3,2,2,7
Short,1,3,5,9
All,9,8,8,25


And by Yield

In [8]:
# Single feature (Yield), rather than pair
d_df.drop(columns=["dummy"]).pivot_table(index=target_name, columns=["Yield"], aggfunc=["count"], 
                  fill_value=0, margins=True)

Unnamed: 0_level_0,count,count,count
Unnamed: 0_level_1,Valuation,Valuation,Valuation
Yield,High,Low,All
Action,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3
Long,6,3,9
Neutral,3,4,7
Short,2,7,9
All,11,14,25


And the target (Action) distribution

In [9]:
t.loc[:, idx["count","All",:]]

Unnamed: 0_level_0,count
Valuation,All
Yield,Unnamed: 1_level_2
Action,Unnamed: 1_level_3
Long,9
Neutral,7
Short,9
All,25


Here is the target distribution as probabilities rather than counts

In [10]:
num_examples = t.loc["All", idx["count","All",:]][0]

print("There are {e:d} training examples".format(e=int(num_examples)) )

# Class probabilities
t.loc[:, idx["count","All",:]]/t.loc["All", idx["count","All",:]]

There are 25 training examples


Unnamed: 0_level_0,count
Valuation,All
Yield,Unnamed: 1_level_2
Action,Unnamed: 1_level_3
Long,0.36
Neutral,0.28
Short,0.36
All,1.0


## Why not just use the empirical distribution ?

At this point, it's fair to ask: 
- Given a test example $\x = \x'$
- We can read $\pr{\y = \y'| \x = \x'}$ *directly from the table*

Why do we need Naive Bayes ?

**Answer**: Because the table can be big !
- One entry for every possible combination of features

The "naive" conditional independence assumption allows us
- To have a single vector for each feature $\x_1, \x_2$ individually: total $|\x_1| + |\x_2|$
- Rather than *combinations* of features $\x_1, \x_2$: total $|\x_1| * |\x_2|$
    - In this case $n=2$ but in general: total $\prod_{j=0}^n{|\x_j|}$

This is usually much smaller.

Moreover:
- There may be **no** training examples for some combination of features $\x'$ that shows up as a test example
- The Naive Bayes method allows us to *interpolate* for such a text example

# Drawbacks

## The zero frequency problem

In order for Naive Bayes to work we must have
$$
\pr{\x_j = \x'_j | \y = \y'}
$$
for *all* possible values of $\x'_j$ that we will encounter during *inference* (test) time.

There is no guarantee that we will see each of these values in the training set.
- Especially when the training set is small

If we don't, the probability is $0$, which is not only probably wrong but can cause problems

**Note**

The Zero Frequency problem is different than the issue in the previous section
- Here, the issue is: Not seeing a particular value for a single feature
- Previous: the issue was not seeing a particular value for the *entire vector* in training

### Additive smoothing

[additive smoothing](https://en.wikipedia.org/wiki/Additive_smoothing)

There is a simple solution to the zero frequency problem
- Artificially inflate all counts by some parameter $\alpha$.

This eliminates zero counts at the cost of biasing all counts.

Note that when converting counts to probabilities
- We have increased the count of each of the $|C|$ classes by $\alpha$
- So the total count for the denominator is $m + |C| * \alpha$


### Replace empirical distributions by functional forms

Another way to address the zero frequency problem is to avoid the empirical distribution
of training data (the counts)
- assume the features come from a parameterized distribution
    - Bernoulli distribution for binary variables
    - Multinomial distribution for variables with more than two classes
    
This also has the advantage of fewer parameters: one parameter per feature.

## Assumption of conditional independence

This is a questionable assumption.

In its defense: 
- If $n$ (the number of features) is very large
    - The conditional independence assumption
is more likely to hold.

# Advantages

- Very simple: just counting !
    - Easy and powerful Baseline Model to use in your Recipe for Machine Learning


# Continuous variables for features

The above discussion was limited to features that could take on discrete values.

We now discuss how to include features that are continuous variables.

## Discretizing continuous variables

The simplest way to deal with a continuous feature $\x_j$ is to turn it into one or more discrete variables.
- Define a threshold $t_j$ and replace the continuous $\x_j$ with a binary variable
    - $\text{Is}_{\x_j < t_j}$
- Define multiple intervals on the range of $\x_j$  and create a binary variable per interval
    - $\text{Is}_{ t_{j, l-1} \le \x_j \le t_{j,l}}$
- The thresholds are a hyper-parameter: can search for optimal
    
Unfortunately the ordering relationship between continuous values is lost
- We have made them categorical

We see this same technique used in Decision Trees, so it's worth mentioning.

## Replacing empirical distributions by functional forms for continous variables

Another technique for continuous variables
- Replace the discrete empirical distribution by
a functional form
    - Gaussian

This has the advantage that many distributions are characterized by a small number of parameters
- Gaussian: 2 per feature -- a mean and standard deviation 

This also deals with the zero frequency problem by eliminating the empirical distribution.

In [11]:
print("Done")

Done
