# Classical Statistics  
## Parametric Family of Densities
- A parametric family of densities is a set  
$$\{p(y \mid \theta): \theta \in \Theta\}$$  
where $p(y|\theta)$ is a density on sample space $Y$, $\theta$ is a parameter in a finite dimension space $\Theta$  

## Frequentist or “Classical” Statistics
- Assume that $p(y | \theta)$ governs the world we are observing, for some $\theta \in \Theta$.  
- If we knew the right $θ\in \Theta$, there would be no need for statistics  
- Instead of $\theta$, we have data $D: y_1,...,y_n$ sampled i.i.d. $p(y | \theta)$.  
- Statistics is about how to get by with $D$ in place of $\theta$.  

## Point Estimation
- One type of statistical problem is point estimation.  
- A statistic $s = s(D)$ is any function of the data.  
- A statistic $\hat{\theta} = \hat{\theta}(D)$ taking values in $\Theta$ is a point estimator of $\theta$.   
  - A good point estimator will have $\hat{\theta} \approx \theta$.  
  
## Desirable Properties of Point Estimators
- **Consistency**: As data size $n \to \infty$, we get $\hat{\theta}_n \to \theta$.  
- **Efficiency**: (Roughly speaking) $\hat{\theta}_n$ is as accurate as we can get from a sample of size $n$.  
eg.  Maximum likelihood estimators are consistent and eﬃcient under reasonable conditions  

## The Likelihood Function
- For parametric family $\{p(y | \theta):\theta \in \Theta \}$ and i.i.d. sample $D = (y_1,...,y_n)$
- The density for sample $D$ for $\theta \in \Theta$ is  
$$p(\mathcal{D} \mid \theta)=\prod_{i=1}^{n} p\left(y_{i} \mid \theta\right)$$  
- $p(D | \theta)$ is a function of $D$ and $\theta$.  
- For ﬁxed $θ$,$ p(D | \theta)$ is a density function on $Y^n$  
- For ﬁxed $D$, the function $\theta \mapsto p(D | \theta)$ is called the **likelihood function**  
$$L_{\mathcal{D}}(\theta):=p(\mathcal{D} \mid \theta)$$  

## Maximum Likelihood Estimation
Deﬁnition:  
- The **maximum likelihood estimator (MLE)** for $\theta$ in the model $\{p(y,\theta) | \theta \in \Theta \}$ is  
$$\hat{\theta}_{\mathrm{MLE}}=\underset{\theta \in \Theta}{\arg \max } L_{\mathcal{D}}(\theta)$$  
- Maximum likelihood is just one approach to getting a point estimator for $θ$.  
- Method of moments is another general approach one learns about in statistics.  
- Later we’ll talk about MAP and posterior mean as approaches to point estimation

## Coin Flipping: Setup
Parametric family of mass functions:
$$p(\text { Heads } \mid \theta)=\theta$$  
for $\theta \in \Theta = (0,1)$  
Note that every $\theta \in \Theta$ gives us a diﬀerent probability model for a coin  

## Coin Flipping: Likelihood function
- Data $D = (H,H,T,T,T,T,T,H,...,T)$  
  - $n_h$: number of heads  
  - $n_t$: number of tails  
- Likelihood function for data $D$:  
$$L_{\mathcal{D}}(\theta)=p(\mathcal{D} \mid \theta)=\theta^{n_{h}}(1-\theta)^{n_{t}}$$  

## Coin Flipping: MLE
- As usual, easier to maximize the log-likelihood function:
$$\begin{aligned}
\hat{\theta}_{\mathrm{MLE}} &=\underset{\theta \in \Theta}{\arg \max } \log L_{\mathcal{D}}(\theta) \\
&=\underset{\theta \in \Theta}{\arg \max }\left[n_{h} \log \theta+n_{t} \log (1-\theta)\right]
\end{aligned}$$  
- First order condition:  
$$\begin{aligned}
\frac{n_{h}}{\theta} &-\frac{n_{t}}{1-\theta}=0 \\
& \Longleftrightarrow \theta=\frac{n_{h}}{n_{h}+n_{t}}
\end{aligned}$$  
- So $\hat{\theta}_{MLE}$ is the empirical fraction of heads

# Bayesian Statistics: Introduction  
## Bayesian Statistics
- Introduces a new ingredient: the **prior distribution**  
- A prior distribution $p(\theta)$ is a distribution on parameter space $\Theta$.  
- A prior reﬂects our belief about $\theta$, before seeing any data..

## A Bayesian Model
- A Bayesian model consists of two pieces  
  -  a parametric family of densities
$$\{p(\mathcal{D} \mid \theta) \mid \theta \in \Theta\}$$  
  - A prior distribution $p(\theta)$ on parameter space $\Theta$.  
- Putting pieces together, we get a joint density on $\theta$ and $D$:  
$$p(\mathcal{D}, \theta)=p(\mathcal{D} \mid \theta) p(\theta)$$  

## The Posterior Distribution
- The posterior distribution for $\theta$ is $p(\theta | D)$.  
- Prior represents belief about $\theta$ before observing data $D$.  
- Posterior represents the rationally “updated” beliefs after seeing $D$.

## Expressing the Posterior Distribution  
- By Bayes rule, can write the posterior distribution as
$$p(\theta \mid \mathcal{D})=\frac{p(\mathcal{D} \mid \theta) p(\theta)}{p(\mathcal{D})}$$  
- Let’s consider both sides as functions of $\theta$ for ﬁxed $D$.  
- Then both sides are densities on $\Theta$ and we can write  
$$\underbrace{p(\theta \mid \mathcal{D})}_{\text {posterior }} \propto \underbrace{p(\mathcal{D} \mid \theta)}_{\text {likelihood prior }} \underbrace{p(\theta)}$$  

## Coin Flipping: Bayesian Model
- Parametric family of mass functions:
$$p(\text { Heads } \mid \theta)=\theta$$  
- Need a prior distribution $p(\theta) on \Theta = (0,1)$.  
- A distribution from the Beta family will do the trick...

## Coin Flipping: Beta Prior
- Prior:  
$$\begin{aligned}
\theta & \sim \operatorname{Beta}(\alpha, \beta) \\
p(\theta) & \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}
\end{aligned}$$  
<div align="center"><img src = "./beta.jpg" width = '500' height = '100' align = center /></div>    
- Mean of Beta distribution:  
$$\mathbb{E} \theta=\frac{h}{h+t}$$   
- Mode of Beta distribution:
$$\underset{\theta}{\arg \max } p(\theta)=\frac{h-1}{h+t-2}$$  
for $h,t > 1$  
- Likelihood model  
$$p(\mathcal{D} \mid \theta)=\theta^{n_{h}}(1-\theta)^{n_{t}}$$  
- Posterior density  
$$\begin{aligned}
p(\theta \mid \mathcal{D}) & \propto p(\theta) p(\mathcal{D} \mid \theta) \\
& \propto \theta^{h-1}(1-\theta)^{t-1} \times \theta^{n_{h}}(1-\theta)^{n_{t}} \\
&=\theta^{h-1+n_{h}}(1-\theta)^{t-1+n_{t}}
\end{aligned}$$



## Posterior is Beta
- Posterior is in the beta family:   
$$\theta \mid \mathcal{D} \sim \operatorname{Beta}\left(h+n_{h}, t+n_{t}\right)$$  
- Interpretation:   
  - Prior initializes our counts with $h$ heads and $t$ tails  
  - Posterior increments counts by observed $n_h$ and $n_t$  
  
## Sidebar: Conjugate Priors
- Interesting that posterior is in same distribution family as prior.  
- Let $\pi$ be a family of prior distributions on $\Theta$.  
- Let $P$ parametric family of distributions with parameter space $\Theta$.  

Deﬁnition:  
A family of distributions $\pi$ is conjugate to parametric model $P$ if for any prior in $\pi$, the posterior is always in $\pi$.  
- The beta family is conjugate to the coin-ﬂipping (i.e. Bernoulli) model.  
- The family of all probability distributions is conjugate to any parametric model. [Trvially]

## Example: Coin Flipping - Concrete Example
- Suppose we have a coin, possibly biased (**parametric probability model**):   
$$p(\text{Heads}| \theta) = \theta$$  
  - Parameter space $\theta \in \Theta = [0,1]$.  
  - Prior distribution: $ \theta ∼ \text{Beta}(2,2)$
<div align="center"><img src = "./beta2.jpg" width = '500' height = '100' align = center /></div>   

## Example: Coin Flipping
- Next, we gather some data $D = \{H,H,T,T,T,T,T,H,...,T\}$   
  - Heads: 75 Tails: 60  
  - $\hat{\theta}_{MLE} = \frac{75} {75+60} \approx 0.556$  
- Posterior distribution: $\theta | D ∼ \text{Beta}(77,62)$:




## Bayesian Point Estimates
- So we have posterior $\theta | D$...  
- But we want a point estimate $\hat{\theta}$ for $\theta$.  
- Common options:  
   - **posterior mean** $\hat{\theta} = \mathbb{E}[\theta | D]$  
   - **maximum a posteriori (MAP) estimate**  $\hat{\theta} = \arg\max_{\theta}p(\theta | D)$  
     - Note: this is the **mode** of the posterior distribution

## What else can we do with a posterior?
- Extract “credible set” for $\theta$ (a Bayesian conﬁdence interval).   
  - e.g. Interval $[a,b]$ is a 95% credible set if  
$$
\mathbb{P}(\theta \in[a, b] \mid \mathcal{D}) \geqslant 0.95
$$  
- The most “Bayesian” approach is Bayesian decision theory:   
  - Choose a loss function.  
  - Find action minimizing **expected risk w.r.t. posterior**  
  


# Bayesian Decision Theory  
- Ingredients:   
  - Parameter space $\Theta$.  
  - Prior: Distribution $p(\theta)$ on $\Theta$.  
  - Action space $A$.   
  - Loss function: $l : A×\Theta \to R$.  
- The posterior risk of an action $a\in A$ is  
$$
\begin{aligned}
r(a) &:=\mathbb{E}[\ell(\theta, a) \mid \mathcal{D}] \\
&=\int \ell(\theta, a) p(\theta \mid \mathcal{D}) d \theta
\end{aligned}
$$  
- It’s the expected loss under the posterior.  
- A Bayes action $a^*$ is an action that minimizes posterior risk:  
$$
r\left(a^{*}\right)=\min _{a \in \mathcal{A}} r(a)
$$  

## Bayesian Point Estimation  
- General Setup:  
  - Data $D$ generated by $p(y | \theta)$, for unknown $\theta \in \Theta$.   
  - Want to produce a **point estimate** for $\theta$.  
- Choose the following:  
  - **Loss:** $\ell(\hat{\theta}, \theta)=(\theta-\hat{\theta})^{2}$  
  - **Prior:** $p(\theta)$ on $\Theta$  
- Find action $\hat{\theta} \in \Theta$ that minimizes posterior risk:   
$$
\begin{aligned}
r(\hat{\theta}) &=\mathbb{E}\left[(\theta-\hat{\theta})^{2} \mid \mathcal{D}\right] \\
&=\int(\theta-\hat{\theta})^{2} p(\theta \mid \mathcal{D}) d \theta
\end{aligned}
$$  

## Bayesian Point Estimation: Square Loss
- Find action $\hat{\theta} \in \Theta$ that minimizes **posterior risk**  
$$
r(\hat{\theta})=\int(\theta-\hat{\theta})^{2} p(\theta \mid \mathcal{D}) d \theta
$$  
- Differentiate:  
$$
\begin{aligned}
\frac{d r(\hat{\theta})}{d \hat{\theta}} &=-\int 2(\theta-\hat{\theta}) p(\theta \mid \mathcal{D}) d \theta \\
&=-2 \int \theta p(\theta \mid \mathcal{D}) d \theta+2 \hat{\theta} \int_{=1} p(\theta \mid \mathcal{D}) d \theta \\
&=-2 \int \theta p(\theta \mid \mathcal{D}) d \theta+2 \hat{\theta}
\end{aligned}
$$  
- First order condition $\frac{d r(\hat{\theta})}{d \hat{\theta}}=0$ gives:  
$$
\begin{aligned}
\hat{\theta} &=\int \theta p(\theta \mid \mathcal{D}) d \theta \\
&=\mathbb{E}[\theta \mid \mathcal{D}]
\end{aligned}
$$
- Bayes action for square loss is the posterior mean.

## Bayesian Point Estimation: Absolute Loss
**Loss:** $\ell(\theta, \hat{\theta})=|\theta-\hat{\theta}|$  
- **Bayesian Action** for **Absolute loss** is **posterior median**  

## Bayesian Point Estimation: Zero-One Loss
- Suppose $\Theta$ is discrete  
- **Zero-one loss:** $\ell(\theta, \hat{\theta})=1(\theta \neq \hat{\theta})$  
$$
\begin{aligned}
r(\hat{\theta}) &=\mathbb{E}[1(\theta \neq \hat{\theta}) \mid \mathcal{D}] \\
&=\mathbb{P}(\theta \neq \hat{\theta} \mid \mathcal{D}) \\
&=1-\mathbb{P}(\theta=\hat{\theta} \mid \mathcal{D}) \\
&=1-p(\hat{\theta} \mid \mathcal{D})
\end{aligned}
$$  
- Bayesian Action is:  
$$
\hat{\theta}=\underset{\theta \in \Theta}{\arg \max } p(\theta \mid \mathcal{D})
$$  
- This $\hat{\theta}$ is called the maximum a posteriori (MAP) estimate  
- The MAP estimate is the mode of the posterior distribution.  


# Summary  
- Prior represents belief about $\theta$ before observing data $D$.  
- Posterior represents the rationally “updated” beliefs after seeing $D$.  
- All inferences and action-taking are based on the posterior distribution.  
- In the Bayesian approach,   
  - No issue of “choosing a procedure” or justifying an estimator.   
  - Only choices are the prior and the likelihood model. 
  - For decision making, need a loss function. 
  - Everything after that is computation.
