# Example: Gaussian Linear Regression  
- Input Space: $R^d$  
- Output Space: $R^d$  
- Family of conditional probability densities  
$$
y \mid x, w \quad \sim \quad \mathcal{N}\left(w^{T} x, \sigma^{2}\right)
$$  
for some known $\sigma^2$  
- Parameter Space $R^d$  
- Data $\mathcal{D} = \{y_1,...y_n\}$  
- Assume $y_i$’s are conditionally independent, given $x_i$’s and $w$.  

## Gaussian Likelihood and MLE  
- The likelihood of $w \in R^d$ for the data $D$ is given by the likelihood function  
$$
\begin{aligned}
&L_{\mathcal{D}}(w)=\prod_{i=1} p\left(y_{i} \mid x_{i}, w\right) \quad \text { by conditional independence }\\
&=\prod_{i=1}^{n}\left[\frac{1}{\sigma \sqrt{2 \pi}} \exp \left(-\frac{\left(y_{i}-w^{T} x_{i}\right)^{2}}{2 \sigma^{2}}\right)\right]
\end{aligned}
$$  
- You should see in your head1 that the MLE is
$$
\begin{aligned}
\hat{w}_{\mathrm{MLE}} &=\underset{w \in \mathbf{R}^{d}}{\arg \max } L_{\mathcal{D}}(w) \\
&=\underset{w \in \mathbf{R}^{d}}{\arg \min } \sum_{i=1}^{n}\left(y_{i}-w^{T} x_{i}\right)^{2}
\end{aligned}
$$


# Bayesian Conditional Probability Models  
## Bayesian Conditional Models
- Input space $X =R^d$ Outcome space $Y =R$  
- Two components to Bayesian conditional model:  
  - A parametric family of conditional densities:  
$$
\{p(y \mid x, \theta): \theta \in \Theta\}
$$  
  - A prior distribution for $\theta\in \Theta$.  
  
## The Posterior Distribution
- The posterior distribution for $\theta$ is  
$$
\begin{aligned}
p\left(\theta \mid \mathcal{D}, x_{1}, \ldots, x_{n}\right) & \propto p\left(\mathcal{D} \mid \theta, x_{1}, \ldots, x_{n}\right) p(\theta) \\
&=\underbrace{L_{\mathcal{D}}(\theta)}_{\text {likelihood prior }} \underbrace{p(\theta)}_{ }
\end{aligned}
$$  

## Gaussian Example: Priors and Posteriors
- Choose a Gaussian prior distribution $p(w)$ on $R^d$:  
$$
w \sim \mathcal{N}\left(0, \Sigma_{0}\right)
$$  
for some covariance matrix $\Sigma_{0} \succ 0$ (i.e. $\Sigma_0$ is spd).  
- Posterior distribution
$$
\begin{aligned}
p\left(w \mid \mathcal{D}, x_{1}, \ldots, x_{n}\right)=& p\left(w \mid \mathcal{D}, x_{1}, \ldots, x_{n}\right) \\
\propto & L_{\mathcal{D}}(w) p(w) \\
=& \prod_{i=1}^{n}\left[\frac{1}{\sigma \sqrt{2 \pi}} \exp \left(-\frac{\left(y_{i}-w^{T} x_{i}\right)^{2}}{2 \sigma^{2}}\right)\right](\text { likelihood }) \\
&\left.\times\left|2 \pi \Sigma_{0}\right|^{-1 / 2} \exp \left(-\frac{1}{2} w^{T} \Sigma_{0}^{-1} w\right)\right)(\text { prior })
\end{aligned}
$$  

## Predictive Distributions
- We have a parametric family of conditional densities:  
$$
\{p(y \mid x, \theta): \theta \in \Theta\}
$$  
- Each $p(y | x,\theta)$ is a conditional density, but also a prediction function  
  - For $x \in X$, the action produced is a probability density on $y$.  
- In Bayesian statistics we have two distributions on $\Theta$:  
  - the prior distribution $p(\theta)$  
  - the posterior distribution $p(\theta | D,x_1,...,x_n)$.  
- Each distribution on $\Theta$ induces a **distributions over prediction functions**  
- For any give $x$, this gives a single distribution on $y$.  
- This distribution is called a **predictive distribution**  

# Gaussian Regression Example  
## Example in 1-Dimension: Setup
- Input space $X = [−1,1]$ Output space $Y =R$  
- Given $x$, the world generates $y$ as  
$$
y=w_{0}+w_{1} x+\varepsilon
$$  
where $\varepsilon \sim \mathcal{N}\left(0,0.2^{2}\right)$  
- Written another way, the **conditional probability model** is  
$$
y \mid x, w_{0}, w_{1} \sim \mathcal{N}\left(w_{0}+w_{1} x, 0.2^{2}\right)
$$  
- What’s the parameter space? $R^2$  
- **Prior distribution:** $w=\left(w_{0}, w_{1}\right) \sim \mathcal{N}\left(0, \frac{1}{2} l\right)$  
## Example in 1-Dimension: Prior Situation  
<div align="center"><img src = "./1d.jpg" width = '500' height = '100' align = center /></div>    
- On right, $y(x)=\mathbb{E}[y \mid x, w]=w_{0}+w_{1} x$, for randomly chosen $w \sim p(w)=\mathcal{N}\left(0, \frac{1}{2} /\right)$  






# Gaussian Regression Continued  
## Closed Form for Posterior
- Model:  
$$
\begin{array}{r}
w \quad \sim \quad \mathcal{N}\left(0, \Sigma_{0}\right) \\
y_{i} \mid x, w \quad \text { i.i.d. } \quad \mathcal{N}\left(w^{T} x_{i}, \sigma^{2}\right)
\end{array}
$$  
- Design matrix $X$ Response column vector $y$  
- **Posterior distribution is a Gaussian distribution:** 
$$
\begin{aligned}
w \mid \mathcal{D} & \sim \mathcal{N}\left(\mu_{P}, \Sigma_{P}\right) \\
\mu_{\mathbf{P}} &=\left(X^{T} X+\sigma^{2} \Sigma_{0}^{-1}\right)^{-1} X^{T} y \\
\Sigma_{\mathbf{P}} &=\left(\sigma^{-2} X^{T} X+\Sigma_{0}^{-1}\right)^{-1}
\end{aligned}
$$  
- Posterior Variance $\Sigma_P$ gives us a natural uncertainty measure  
- For the prior variance $\Sigma_{0}=\frac{\sigma^{2}}{\lambda} I$, we get  
$$
\mu_{P}=\left(X^{T} X+\lambda l\right)^{-1} X^{T} y
$$  
which is of course the ridge regression solution  



## Posterior Variance vs. Traditional Uncertainty
- Traditional regression: OLS estimator (also the MLE) is a random variable – why?  
  - Because estimator is a function of data $D$ and data is random  
- Common assumption: data are iid with Gaussian noise: $y=w^{T} x+\varepsilon,$ with $\varepsilon \sim \mathcal{N}\left(0, \sigma^{2}\right)$  
- Then OLS estimator $\hat{w}$ has a sampling distribution that is Gaussian with mean $w$ and  
$$
\operatorname{Cov}(\hat{w})=\left(\sigma^{-2} X^{T} X\right)^{-1}
$$  
- By comparison, the posterior variance is  
$$
\Sigma_{P}=\left(\sigma^{-2} X^{T} X+\Sigma_{0}^{-1}\right)^{-1}
$$  

## Posterior Mean and Posterior Mode (MAP)   
- Posterior density for $\Sigma_0 = \frac{\sigma^2}{\lambda}I$  
$$
p(w \mid \mathcal{D}) \propto \underbrace{\exp \left(-\frac{\lambda}{2 \sigma^{2}}\|w\|^{2}\right)}_{\text {prior }} \underbrace{\prod_{i=1}^{n} \exp \left(-\frac{\left(y_{i}-w^{T} x_{i}\right)^{2}}{2 \sigma^{2}}\right)}_{\text {likelihood }}
$$  
- To ﬁnd MAP, suﬃcient to minimize the negative log posterior  
$$
\begin{aligned}
\hat{w}_{\mathrm{MAP}} &=\underset{w \in \mathbf{R}^{d}}{\arg \min }[-\log p(w \mid \mathcal{D})] \\
&=\underset{w \in \mathbf{R}^{d}}{\arg \min } \underbrace{\sum_{i=1}^{n}\left(y_{i}-w^{T} x_{i}\right)^{2}}_{\text {log-likelihood }}+\underbrace{\lambda\|w\|^{2}}_{\text {log-prior }}
\end{aligned}
$$  

## Predictive Distribution
- Given a new input point $x_{new}$, how to predict $y_{new}$ ?  
- Predictive distribution  
$$
\begin{aligned}
p\left(y_{\text {new }} \mid x_{\text {new }}, \mathcal{D}\right) &=\int p\left(y_{\text {new }} \mid x_{\text {new }}, w, \mathcal{D}\right) p(w \mid \mathcal{D}) d w \\
&=\int p\left(y_{\text {new }} \mid x_{\text {new }}, w\right) p(w \mid \mathcal{D}) d w
\end{aligned}
$$  
- For Gaussian regression, predictive distribution has closed form.
$$
\begin{aligned}
y_{\text {new }} \mid x_{\text {new }}, \mathcal{D} & \sim \mathcal{N}\left(\eta_{\text {new }}, \sigma_{\text {new }}\right) \\
\eta_{\text {new }} &=\mu_{\mathrm{P}}^{T} x_{\text {new }} \\
\sigma_{\text {new }} &=\underbrace{x_{\text {new }}^{T} \Sigma_{\text {P }} x_{\text {new }}}_{\text {from variance in } w}+\underbrace{\sigma^{2}}_{\text {inherent variance in } y}
\end{aligned}
$$  
With predictive distributions, can give mean prediction with error bands:
<div align="center"><img src = "./narrow.jpg" width = '500' height = '100' align = center /></div>    
