# The naïve Bayes classifier

Max Collard, May 2021

## 5: "Naïveté" is assumed independence

### Simplifying the dependence graph by assuming independence

We ran up against the problem that dependencies between our neurons ended up making our model's complexity *blow up very quickly*.

<img src="images/5/IMG_0144.PNG" />

The first pass solution is to ask: what if we just **assume all our neurons are independent**? If we did that, we could immediately delete all of the edges between the neurons' spike counts (the blue dots):

<img src="images/5/IMG_0142.PNG" />

Even for just 5 neurons, this saves us *a lot* of dependencies (10) to model, and the savings only gets larger with the number of neurons in our model.

This is precisely what makes the **naïve Bayes classifier** so "naïve": it *builds in* the assumption that the variables being used for classification (here, the individual spike counts from the recorded neurons) are all **independent**.

> Is this a good assumption? **No**, absolutely not. But *all models are wrong; some are useful*. And the assumption of independence can serve as a good "starting point" before building larger, more complicated models that incorporate some dependencies: if you get decoding performance in the "naïve" case, it should *get better on average* by incorporating more and more information about the interconnection between neurons (although [there are caveats to this](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff)).

### The naïve Bayes classifier

Our classifier was based on picking the larger of the two "estimated" conditional probabilities—the probability that the mouse went left or right **given the knowledge of the data** —based on some assumed model $\mathcal{M}$ of what generated the data $\mathcal{D}$,

$$
\begin{eqnarray*}
\mathrm{Pr}(\textrm{direction} = \textrm{Left} \mid \mathcal{D}) & = & \mathrm{Pr}(\textrm{direction} = \textrm{Left})\,\frac{\mathcal{M}(\mathcal{D} \mid \textrm{direction} = \textrm{Left})}{\mathrm{Pr}(\mathcal{D})} \\
\mathrm{Pr}(\textrm{direction} = \textrm{Right} \mid \mathcal{D}) & = & \mathrm{Pr}(\textrm{direction} = \textrm{Right})\,\frac{\mathcal{M}(\mathcal{D} \mid \textrm{direction} = \textrm{Right})}{\mathrm{Pr}(\mathcal{D})} \\
\end{eqnarray*}
$$

But $\mathcal{D}$ was really the combination of a bunch of individual data points from each neuron we recorded:

$$ \mathcal{D} = (N_1 = n_1) \cap (N_2 = n_2) \cap \ldots \cap (N_5 = n_5) $$

where each capital $N$ is the *random variable* corresponding to the observation of the spike count in *any possible universe*, and each lowercase $n$ is the *actual data* we know about *the universe we know we live in*; each one of the terms on the right side is a set of universes corresponding to the recording we made in a single neuron, and the *intersection of all of them* is the set of universes where *all of our simultaneous recordings* happened.

If we **assume that all of these are independent**, then by the definition of independence we can simplify our calculation of the **size** (or **probability**) of these sets:

$$
\begin{eqnarray*}
\mathrm{Pr}(\mathcal{D}) & = & \mathrm{Pr}((N_1 = n_1) \cap (N_2 = n_2) \cap \ldots \cap (N_5 = n_5)) \\
& = & \mathrm{Pr}(N_1 = n_1) \cdot \mathrm{Pr}(N_2 = n_2) \cdot \ldots \cdot \mathrm{Pr}(N_5 = n_5)
\end{eqnarray*}
$$

We *don't know* the true "ruler" $\mathrm{Pr}$ though; instead, we have to use our cruddy model $\mathcal{M}$. The same "rules" for independence apply though; so we define our model by just replacing one ruler $\mathrm{Pr}$ with the other $\mathcal{M}$:

$$
\begin{eqnarray*}
\mathcal{M}(\mathcal{D}) & = & \mathcal{M}(N_1 = n_1) \cdot \mathcal{M}(N_2 = n_2) \cdot \ldots \cdot \mathcal{M}(N_5 = n_5)
\end{eqnarray*}
$$

This works for conditioning, too; so, the exact thing we need to plug into Bayes' rule above is actually, for example,

$$
\begin{eqnarray*}
\mathcal{M}(\mathcal{D} \mid \textrm{direction} = \textrm{Left}) & = & \mathcal{M}(N_1 = n_1 \mid \textrm{direction} = \textrm{Left}) \cdot \ldots \cdot \mathcal{M}(N_5 = n_5 \mid \textrm{direction} = \textrm{Left})
\end{eqnarray*}
$$

It's convenient to come up with a simpler notation for the big product on the right; just like the big sum $\sum$, we can also make a big product $\prod$, $\Pi$ (capital pi) for **p**roduct:

$$
\begin{eqnarray*}
\mathcal{M}(\mathcal{D} \mid \textrm{direction} = \textrm{Left}) & = & \prod_i \mathcal{M}(N_i = n_i \mid \textrm{direction} = \textrm{Left})
\end{eqnarray*}
$$

This looks complicated, but it is actually great news! This means that *if we can model each neuron on its own*, then **we can model all of the neurons together**! But we already built the model for a single neuron: it was just the **Poisson process model** from the earlier example! So if we want to come up with the model for how a whole bucket of neurons behaves, under these "naïve" assumptions, we just have to generate the predictions from a whole bucket of single-neuron models and then *multiply them all together*.

> This is exactly what we did when we built the axioms for $\mathrm{Pr}$ originally: we are *building a model that is convenient for* **us**.

Plugging this into the formula for the classification probabilities above,

$$
\begin{eqnarray*}
\mathrm{Pr}(\textrm{direction} = \textrm{Left} \mid \mathcal{D}) & = & \mathrm{Pr}(\textrm{direction} = \textrm{Left})\,\frac{\mathcal{M}(\mathcal{D} \mid \textrm{direction} = \textrm{Left})}{\mathrm{Pr}(\mathcal{D})} \\
\mathrm{Pr}(\textrm{direction} = \textrm{Right} \mid \mathcal{D}) & = & \mathrm{Pr}(\textrm{direction} = \textrm{Right})\,\frac{\mathcal{M}(\mathcal{D} \mid \textrm{direction} = \textrm{Right})}{\mathrm{Pr}(\mathcal{D})} \\
\end{eqnarray*}
$$

we get the formula for the **naïve Bayes classification probabilities**:

$$
\begin{eqnarray*}
\mathrm{Pr}(\textrm{direction} = \textrm{Left} \mid \mathcal{D}) & = & \mathrm{Pr}(\textrm{direction} = \textrm{Left})\,\frac{\prod_i \mathcal{M}(N_i = n_i \mid \textrm{direction} = \textrm{Left})}{\mathrm{Pr}(\mathcal{D})} \\
\mathrm{Pr}(\textrm{direction} = \textrm{Right} \mid \mathcal{D}) & = & \mathrm{Pr}(\textrm{direction} = \textrm{Right})\,\frac{\prod_i \mathcal{M}(N_i = n_i \mid \textrm{direction} = \textrm{Right})}{\mathrm{Pr}(\mathcal{D})} \\
\end{eqnarray*}
$$

As a callback to an earlier point, the denominators $\mathrm{Pr}(\mathcal{D})$ **don't actually matter at all** for picking the larger of the two values. So, for our final formula, we will just *take them out completely*. So, here is the final algorithm for naïve Bayes.

> #### The naïve Bayes classifier
> Compute the scores
>
> $$
\begin{eqnarray*}
S_\textrm{Left} & = & \mathrm{Pr}(\textrm{direction} = \textrm{Left})\,\prod_i \mathcal{M}(N_i = n_i \mid \textrm{direction} = \textrm{Left}) \\
S_\textrm{Right} & = & \mathrm{Pr}(\textrm{direction} = \textrm{Right})\,\prod_i \mathcal{M}(N_i = n_i \mid \textrm{direction} = \textrm{Right}) \\
\end{eqnarray*}
$$
>
> Whichever one is larger is the best choice for which direction the mouse went.


Let's implement it!

In [None]:
# TODO Need to implement it