# Probabilistic Machine Learning

By: Chengyi (Jeff) Chen

---
## Introduction

The purpose of these sets of notes is to connect ideas crossing the realms of frequentist, bayesian, probabilistic machine learning vernacular. I'm in no way an expert of the philosophical and practical differences between the the frequentist vs. bayesian perspective nor am I close to being good at mathematics -- here's just what I've gathered from my readings, subject to my own interpretation. Throughout, I'll be drawing ideas from computer programming as well. Starting from first principles, we ask: "What are we even trying to do in machine learning?" Before we distinguish between supervised, unsupervised, semi-supervised learning, here's the general ML setting:

Given: A matrix of observed training data $\mathbf{X}_{\text{train}} = \{ \mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3, \ldots \mathbf{x}_N \}$ as independent samples generated from a true data distribution $f(\mathcal{X})$, where $\mathbf{x} \in \mathcal{X}$ (the set of observed data values).

Objective: Learn a probabilistic model $p(\mathcal{X}, \mathcal{Z} ; \Theta = \theta)$ from $\mathbf{X}_{\text{train}}$ to approximate $f(\mathcal{X})$, where $\mathbf{z} \in \mathcal{Z}$ are a set of latent / unobserved random variables, as we make no assumptions on whether the observable dataset contains all information about the system. This probabilistic model is often called the **complete data likelihood**. $p(\mathcal{X} ; \Theta = \theta) = \int_{\mathbf{z} \in \mathcal{Z}} p(\mathcal{X}, \mathcal{Z} = \mathbf{z}; \Theta = \theta) d\mathbf{z}$ is then called the **incomplete data likelihood** / **evidence** / **marginal likelihood** (because we marginalized out $\mathcal{Z}$ to keep only $\mathcal{X}$. $\Theta = \theta$ are fixed parameters ("$;$" is used instead of "$\vert$" in the conditioning of $\theta$ to indicate that it is a "frequentist" fixed parameter and not a "bayesian" random variable). Furthermore, it's called a likelihood function because it is a function over the $\theta = \Theta$, the thing we're conditioning on, instead of $\mathcal{X}$ (fixed because its the data provided) and $\mathcal{Z}$ (unobserved). 

[Formulation](https://slideplayer.com/slide/9502040/): To learn the $p(\mathcal{X}, \mathcal{Z} ; \Theta = \theta)$, we can start by trying to minimize a sort of "distance" between the probabilitic model that we're building and the true complete data distribution. Because we can only observe $\mathcal{X}$, we will minimize the distance between our incomplete data likelihood $p(\mathcal{X} ; \Theta = \theta)$ (instead of the complete data likelihood $p(\mathcal{X}, \mathcal{Z} ; \Theta = \theta)$) and the true data distribution $f(\mathcal{X})$. A common "distance" measure used is the [KL Divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) ("distance" because KL Divergence is asymmetric, does not satisfy triangle inequality, $D_{KL}(P \vert\vert Q) \not= D_{KL}(Q \vert\vert P)$). $D_{KL}(f(\mathcal{X}) \vert \vert p(\mathcal{X};\Theta=\theta))$ measures how well [$p$ approximates $f$](https://stats.stackexchange.com/questions/111445/analysis-of-kullback-leibler-divergence):

\begin{align}
    \theta^* 
    &= \arg\underset{\theta \in \Theta}{\min} D_{KL}(f \vert \vert p) \\
    &= \arg\underset{\theta \in \Theta}{\min}\int_{\mathbf{x} \in \mathcal{X}, \mathbf{z} \in \mathcal{Z}} f(\mathcal{X}=\mathbf{x}, \mathcal{Z}=\mathbf{z}) \log \frac{f(\mathcal{X}=\mathbf{x}, \mathcal{Z}=\mathbf{z})}{p(\mathcal{X}=\mathbf{x}, \mathcal{Z}=\mathbf{z} ; \Theta = \theta)} d\mathbf{x}d\mathbf{z} \\
    &= \arg\underset{\theta \in \Theta}{\min}\mathbb{E}_{\mathbf{x}, \mathbf{z} \sim f} [\log f(\mathcal{X}=\mathbf{x}, \mathcal{Z}=\mathbf{z})] - \mathbb{E}_{\mathbf{x}, \mathbf{z} \sim f} [\log p(\mathcal{X}=\mathbf{x}, \mathcal{Z}=\mathbf{z} ; \Theta = \theta)] \\
    &= \arg\underset{\theta \in \Theta}{\min}-\mathbb{H}[f(\mathcal{X}, \mathcal{Z})] - \mathbb{E}_{\mathbf{x}, \mathbf{z} \sim f} [\log p(\mathcal{X}=\mathbf{x}, \mathcal{Z}=\mathbf{z} ; \Theta = \theta)] \\
    &= \arg\underset{\theta \in \Theta}{\max} \mathbb{E}_{\mathbf{x}, \mathbf{z} \sim f} [\log p(\mathcal{X}=\mathbf{x}, \mathcal{Z}=\mathbf{z} ; \Theta = \theta)] \\
    &\approx \arg\underset{\theta \in \Theta}{\max} \frac{1}{N}\sum_{\mathbf{x} \in \mathbf{X}_{\text{train}}} \log p(\mathcal{X}=\mathbf{x}, \mathcal{Z}=\mathbf{z} ; \Theta = \theta) \\
    &= \arg\underset{\theta \in \Theta}{\max} \frac{1}{N}\sum_{\mathbf{x} \in \mathbf{X}_{\text{train}}} \int_{\mathbf{z} \in \mathcal{Z}} \log p(\mathcal{X}=\mathbf{x}, \mathcal{Z}=\mathbf{z} ; \Theta = \theta) d\mathbf{z} \\
\end{align}


```{note} Mathematical Notation

The math notation of my content, including the ones in this post follow the conventions in Christopher M. Bishop's Pattern Recognition and Machine Learning. In addition, I use caligraphic capitalized roman and capitalized greek symbols like $\mathcal{X}, \mathcal{Y}, \mathcal{Z}, \Omega, \Psi, \Xi, \ldots$ to represent **BOTH** a set of values that the random variables can take as well as the argument of a function in python (e.g. `def p($\Theta$=$\theta$)`).

```



https://pyro.ai/examples/intro_long.html#Background:-inference,-learning-and-evaluation

---
## MLE Vs. MAP Vs. Full Bayesian 

Objective:

\begin{align}
    
\end{align}

Specifically in [Pyro](https://pyro.ai/examples/mle_map.html), to get MLE estimates of $\theta$, simply declare $\theta$ as a fixed parameter using `.param` in the `model` and have an empty `guide` (variational distribution). To get MAP estimates instead, declare $\theta$ just like a regular latent random variable by `.sample` in the `model`, but in the `guide`, declare $\theta$ as being drawn from a dirac delta function.

### Parameter Learning / Inference

Frequentist: Parameters are fixed

Bayesian: Parameters are random variables

We often see 

https://stats.stackexchange.com/questions/74082/what-is-the-difference-in-bayesian-estimate-and-maximum-likelihood-estimate

### Parameter Uncertainty

Frequentist: Uncertainty is estimated with confidence intervals

Bayesian: Uncertainty is estimated with credible intervals

### Prediction Intervals