<a href="https://colab.research.google.com/github/jfogarty/machine-learning-intro-workshop/blob/master/misc/bayes-rule.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Some Notes on Bayes Rule

This starts with [A different take on Bayes Rule](https://towardsdatascience.com/a-different-take-on-bayes-rule-e303c1d7d5f6) in [towardsdatascience.com](https://towardsdatascience.com) by **Andy Patterson**

Most people have seen demonstrations of each probability distribution in Bayes Rule. Most people reading this have been formally introduced to the terms “**posterior**”, “**prior**”, and “**likelihood**”. If not, even better!

I think that viewing Bayes Rule as an incremental learning rule would be a novel perspective for many. Further, I believe this perspective would give much better intuition for why we use the terms “posterior” and “prior”. Finally, I think this perspective helps explain why I don’t think [Bayesian statistics](https://www.analyticsvidhya.com/blog/2016/06/bayesian-statistics-beginners-simple-english/) lead to any more inductive bias than [Frequentist statistics](https://www.statisticshowto.datasciencecentral.com/frequentist-statistics/).

<figure>
  <center><img src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/04/frequentists_vs_bayesians.png" />
  </center>
</figure>


## Notation

Before diving too deeply into [Bayes Rule](https://en.wikipedia.org/wiki/Bayes%27_theorem) itself, it is first important to agree on notation. $θ$ refers to our model. A model is identifiable by a class and a set of parameters. As an example, our model class could be the Gaussian distribution. The parameters of this model are the [**mean**](https://en.wikipedia.org/wiki/Mean) and the [**variance**](https://en.wikipedia.org/wiki/Variance).


Another example model class is a neural network with one hidden layer and 256 nodes. The parameters of this model are the weights associated with each node of the network.

$D$ refers to a set of data points. In the above example $D$ is a set of scalar values.

<figure>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/gaussian.png?raw=1" />
  <figcaption>X is the set of parameters mu and sigma-squared referring to the mean and variance respectively.</figcaption></center>
</figure>

So here the red line is the standard gaussian curve we've all come to know and love.


## Terminology

To be perfectly honest, I hate jargon. I understand it allows for faster and more precise communication when both parties share a common vocabulary, but I find that over-reliance on jargon leads to forgetting what the information dense terminology even means. That said, let’s get into some jargon!


### Posterior

$$
  P(\theta\,|\,D)
$$

The probability of the model given the data. In the Bayesian world, we are not dealing with a single instantiated model. We deal with a probability distribution over models. This distribution says “given the data I have observed, I think the probability that your model is θ is 22%.” We can query this distribution with many different models to ask how likely each is given the observed data, or we could ask the distribution what the most likely model is.

### Prior

$$
  P(\theta)
$$

The probability of a model. One of the most contentious parts of Bayesian probability is the inclusion of the prior distribution. In machine learning terms, you can think of this as a form of regularizer. The prior allows you to specify the probability of a particular model, regardless of what data you observe. This allows you to say “I know the mean height for male humans is not greater than eight feet, assign zero probability to models that make that claim.”

### Likelihood

$$
  P(D\,|\,\theta\,)
$$

What would be the probability of observing this data if it was generated using my model? The likelihood allows us to ask what would happen if the world behaved according to our model. Would we have seen these particular samples?
Imagine we used a model that said there is a 99% chance that a random person is extremely rich. Then I sample a set of random humans from around the world and ask “what is the probability that I drew this set given my model of the world.” Because most humans are not extremely rich, far fewer than 99% of the humans in the set will be extremely rich. The likelihood of this model is quite low.

## Incremental Learning

The first step towards viewing Bayes Rule as an incremental learning rule is recognizing the relationship between the posterior and the prior. By naively analyzing the words — ignoring the equation itself — one might guess that the prior comes first, followed by the posterior. Let’s add time indices to the equation to make this more clear.

$$
  P(\theta_{t+1}|D_t) = \frac{P(D_t | \theta_t)P(\theta_t)}{P(D_t)}
$$


It’s also useful to recall a property of the conditional probability. The probability of a random event is unchanged when it is conditioned on a non-random event.

$$
  P(A|B) = P(A)
$$

If A is a random event but B is not, then the probability of A occurring does not change when it is conditioned on B.

Using this property, we can then make a final change to our Bayes Rule equation.

$$
  P(\theta_{t+1}|D_t) = \frac{P(D_t | \theta_t)P(\theta_t | D_{t-1})}{P(D_t)}
$$

The only thing that changed was the prior probability distribution.

Our prior distribution is now conditioned on the previous data that we have observed. This is a legal change to the prior because past data is no longer random.

To understand why past data is not random, imagine you are on time-step $t-1$. You’ve just been handed a data sample and you would like to know what the chances are that you received that particular data point. That works because your data was randomly drawn. You change your model accordingly and take a step forward. On time-step $t$ can the sample that you used at $t-1$ suddenly change? Nope. That sample is fixed in history forever; it is no longer a random variable.

$$
  P(\theta_{t+1}|D_t) = f(P(\theta_t | D_{t-1}))
$$

Writing Bayes Rule in this way, we can see the relationship between the posterior and the prior. After receiving a data point, we can step away from the prior estimate of our model and towards our new estimate, the posterior.

Here is the stochastic gradient descent update rule for comparison:

$$
  w_{t+1} = q_t - \alpha D_w(X_t)
$$

We compute a new estimate for our weights, $W$, after taking a step away from our previous weights.

## Starting Points

A highly controversial point of Bayesian Statistics is the choice of the prior distribution. When viewing Bayes Rule as an incremental learning rule, we can easily see that this isn’t unique to Bayesian statistics. Any incremental learning rule *must* pick a starting point, an initial set of weights or parameters. Bayes Rule is no different.

I point this out simply because the online learning community seems to take particular issue with Bayesian statistics. There are many arguments against Bayesian statistics (and many for!), but I think that the inductive bias argument is not a valid example.

## Conclusion

Taking an incremental learning perspective with Bayes Rule can make sense out of the choice of terminology associated with it. Understanding this perspective also opens the door to a deeper understanding of Bayesian Inference and Bayesian Optimization.

### End of notebook.