# Linear Probabilistic Models vs GLMs  
- Most books and software libraries related to this topic are actually about 
  - generalized linear models (GLMs). 
- They’re “special” because 
  - they’re a restriction of our setting, but more importantly 
  - we can state theorems for GLMs, and 
  - all GLMs can be implemented in essentially the same way. 

# Generalized Regression  
- Given x, predict probability distribution $p(y | x) $
- How do we represent the probability distribution?   
- We’ll consider parametric families of distributions  
  - distribution represented by parameter vector  
- Examples:   
  - Logistic regression (Bernoulli distribution) 
  - Probit regression (Bernoulli distribution) 
  - Poisson regression (Poisson distribution) 
  - Linear regression (Normal distribution, ﬁxed variance) 
  - Generalized Linear Models (GLM) (encompasses all of the above) 
  - Generalized Additive Models (GAM) 
  - Gradient Boosting Machines (GBM) / AnyBoost   
  
# Bernoulli Regression  
## Probabilistic Binary Classiﬁers
- Setting: $X =R^d$, $Y = \{0,1\}$  
- For each $x$, need to predict a distribution on $Y = \{0,1\}$  
- How can we deﬁne a distribution supported on $\{0,1\}$?  
- Suﬃcient to specify the Bernoulli parameter $\theta = p(y = 1)$.  
- We can refer to this distribution as Bernoulli($\theta$).  

## Linear Probabilistic Classiﬁers  
- Setting: $X =R^d$, $Y = \{0,1\}$  
- Want prediction function to map each $x \in R^d$ to the right $\theta \in [0,1]$  
- We ﬁrst extract information from $x \in R^d$ and summarize in a single number.  
  - That number is analogous to the score in classiﬁcation.  
- For a linear method, this extraction is done with a linear function  
$$\underbrace{x}_{\in \mathbf{R}^{d}} \mapsto \underbrace{w^{T} x}_{\in \mathbf{R}}$$
- As usual, $x \mapsto w^Tx$ will include aﬃne functions if we include a constant feature in $x$  
- $w^Tx$ is called linear predictor  
- Still need to map this to $[0,1]$.




## The Transfer Function
Need a function to map the linear predictor in $R$ to $[0,1]$:  
$$\underbrace{x}_{\in \mathbf{R}^{d}} \mapsto \underbrace{w^{T} x}_{\in \mathbf{R}} \mapsto \underbrace{f\left(w^{T} x\right)}_{\in[0,1]}=\theta$$  
where $f :R\to [0,1]$. We’ll call $f$ the transfer function  
So prediction function is $x \mapsto f (w^Tx)$, which gives value for $\theta = p(y = 1 | x)$.   
### Terminology Alert  
In generalized linear models (GLMs), if θ is the distribution mean, then f is called the response function or inverse link function. Transfer function is not standard terminology, but we’re avoiding the heavy set of deﬁnitions needed for a full development of GLMs, which is actually more restrictive than our current framework.

## Transfer Functions for Bernoulli
Two commonly used transfer functions to map from $w^Tx to \theta$:  
<div align="center"><img src = "./transfer.jpg" width = '500' height = '100' align = center /></div>    
- Logistic function $$f(\eta)=\frac{1}{1+e^{-\eta}}$$  
- Normal CDF  
$$f(\eta)=\int_{-\infty}^{\eta} \frac{1}{\sqrt{2 \pi}} e^{-x^{2} / 2}$$  

## Learning  
- $X =R^d$  
- $Y = \{0,1\}$  
- $A = [0,1]$ (Representing Bernoulli($\theta$) distributions by $\theta \in [0,1]$   
- $H =\{x \mapsto f (w^Tx) | w \in R^d\}$	(Each prediction function represented by $w \in R^d$.)   
- We can choose w using maximum likelihood...

## Bernoulli Regression: Likelihood Scoring
Suppose we have data $D = \{(x_1,y_1),...,(x_n,y_n)\}$.  
Compute the model likelihood for $D$:  
$$\begin{aligned}
p_{w}(\mathcal{D}) &=\prod_{i=1}^{n} p_{w}\left(y_{i} \mid x_{i}\right)[\text { by independence }] \\
&=\prod_{i=1}^{n}\left[f\left(w^{T} x_{i}\right)\right]^{y_{i}}\left[1-f\left(w^{T} x_{i}\right)\right]^{1-y_{i}}
\end{aligned}$$  
Easier to work with the log-likelihood:
$$\log p_{w}(\mathcal{D})=\sum_{i=1}^{n} y_{i} \log f\left(w^{T} x_{i}\right)+\left(1-y_{i}\right) \log \left[1-f\left(w^{T} x_{i}\right)\right]$$  

## Bernoulli Regression: MLE
Maximum Likelihood Estimation (MLE) ﬁnds $w$ maximizing $\text{log}p_w(D)$.   
Equivalently, minimize the negative log-likelihood objective function   
$$J(w)=-\left[\sum_{i=1}^{n} y_{i} \log f\left(w^{T} x_{i}\right)+\left(1-y_{i}\right) \log \left[1-f\left(w^{T} x_{i}\right)\right]\right]$$  
For diﬀerentiable $f$, $J(w)$ is diﬀerentiable, and we can use our standard tools

# Poisson Regression  
## Setup  
- Input space $X =R^d$, Output space $Y = \{0,1,2,3,4,...\}$  
- In Poisson regression, prediction functions produce a **Poisson distribution**  
  - Represent $\text{Poisson}(\lambda)$ distribution by the mean parameter $\lambda \in(0,\infty)$.   
- Action space $A = (0,\infty)$  
- In Poisson Regression, $x$ enters linearly, $$x \mapsto \underbrace{w^{T} x}_{R} \mapsto \lambda=\underbrace{f\left(w^{T} x\right)}_{(0, \infty)}$$  
- What can we use as transfer function $f$  
- Standard approach is to take $$f\left(w^{T} x\right)=\exp \left(w^{T} x\right)$$  

**Complementary**  
Poisson Distribution:  
If we want to find the distribution the number of occurrence per unit time, we can use **Poisson Distribution**  
eg.  
- The number of customers coming in a shopping store in 1 hour 
- The number of a cross intersection in 10 minutes   

Now we want to find out the relationship between $X$ and $Y$, where $Y = \{0,1,2,3,4,...\}$ satisfies Poisson Distribution   
We cannot simply use linear regression to predict $y$ based on $x$ since $Y$ is discrete while the result of the linear regression is continuous  
One good idea is using $X$ to predict $\lambda$, the parameter of Poission Distribution, after knowing $\lambda$, we can get the distribution of $Y$  

****  
For any sample $(x_i,y_i)$  
$$P(y_i) = \frac{\lambda_i ^{y_i}}{y_i!}e^{-\lambda_i}$$  
Calculating the likelihood  
$$L = \prod_{i = 1}^n \frac{\lambda_i^{y_i}}{y_i!}e^{-\lambda_i}$$  
Log-Likelihood  
$$\log L = \sum_{i = 1}^n k_i\log(\lambda_i) - \log(y_i!) - \lambda_i$$  
Let $\lambda_i = f(w^Tx_i) = e^{w^Tx_i}$  
then we get  
$$\log L(w) = \sum_{i = 1}^n k_iw^Tx_i - \log(y_i!) - e^{w^Tx_i}$$  
Then we can use optimization method to solve it

 

# Conditional Gaussian Regression  
## Gaussian Linear Reegression  
- Input space $X =R^d$, Output space $Y =R$  
- In Gaussian regression, prediction functions produce a distribution $N(\mu, \sigma^2)$  
  - Assume $\sigma^2$ is known  
- Represent $N(\mu,\sigma^2)$ by the mean parameter $\mu\in R$.  
- Action space $A = R$  
- In Gaussian linear regression, $x$ enters **linearly**  
$$x \mapsto \underbrace{w^{T} x}_{\mathbf{R}} \mapsto \mu=\underbrace{f\left(w^{T} x\right)}_{R}$$  
- since $\mu \in R$ we can take the identity link function $f(w^Tx) = w^Tx$  

## Gaussian Regression: Likelihood Scoring
- Suppose we have data $D = \{(x_1,y_1),...,(x_n,y_n)\}$  
- Compute the model likelihood for $D$:  
$$\begin{array}{c}
\sum_{i=1}^{n} \log p_{w}\left(y_{i} \mid x_{i}\right) \\
=\sum_{i=1}^{n} \log \left[\frac{1}{\sigma \sqrt{2 \pi}} \exp \left(-\frac{\left(y_{i}-w^{T} x_{i}\right)^{2}}{2 \sigma^{2}}\right)\right] \\
=\underbrace{\sum_{i=1}^{n} \log \left[\frac{1}{\sigma \sqrt{2 \pi}}\right]}_{\text {independent of } w}+\sum_{i=1}^{n}\left(-\frac{\left(y_{i}-w^{T} x_{i}\right)^{2}}{2 \sigma^{2}}\right)
\end{array}$$  
- The MLE is
$$w^{*}=\underset{w \in \mathbf{R}^{d}}{\arg \min } \sum_{i=1}^{n}\left(y_{i}-w^{T} x_{i}\right)^{2}$$  
- This is exactly the objective function for least squares  
- From here, can use usual approaches to solve for w∗ (SGD, linear algebra, calculus, etc.)



# Multinomial Logistic Regression  
- Setting: $X =R^d$, $Y = \{1,...,k\}$  
- For each $x$, we want to produce a distribution on $k$ classes.
- Such a distribution is called a “multinoulli” or “categorical” distribution.  
- Represent categorical distribution by probability vector $\theta = (\theta_1,...,\theta_k) \in R^k$  
  - $\sum_{i = 1}^k \theta_i = 1$, and $\theta_i \geq 0$  
- So $\forall y \in \{1,2,...,k\}$, $p(y) = \theta_y$  
- From each x, we compute a linear score function for each class
$$x \mapsto\left(\left\langle w_{1}, x\right\rangle, \ldots,\left\langle w_{k}, x\right\rangle\right) \in \mathbf{R}^{k}$$  
for parameter vectors $w_1,...,w_k \in R^d$  
We need to map this $R^k$ vector into probability vector  
- Using softmax function  
$$\left(\left\langle w_{1}, x\right\rangle, \ldots,\left\langle w_{k}, x\right\rangle\right) \mapsto \theta=\left(\frac{\exp \left(w_{1}^{T} x\right)}{\sum_{i=1}^{k} \exp \left(w_{i}^{T} x\right)}, \cdots, \frac{\exp \left(w_{k}^{T} x\right)}{\sum_{i=1}^{k} \exp \left(w_{i}^{T} x\right)}\right)$$  
- Note that $\theta \in R^k$ and  
$$\begin{aligned}
\theta_{i} &>0 \quad i=1, \ldots, k \\
\sum_{i=1}^{k} \theta_{i} &=1
\end{aligned}$$  
Putting this together, we write multinomial logistic regression as
$$p(y \mid x)=\frac{\exp \left(w_{y}^{T} x\right)}{\sum_{i=1}^{k} \exp \left(w_{i}^{T} x\right)}$$  

- Do we still see score functions in here?  
- Can view $x \mapsto w_y^Tx$ as score for class $y$, for $y \in \{1,...,k\}$  
- How do we do learning here? What parameters are we estimatimg?   
- Our model is speciﬁed once we have $w_1,...,w_k \in R^d$  
- Find parameter settings maximizing the log-likelihood of data $D$.  
- This objective function is concave in w’s and straightforward to optimize  


# Maximum Likelihood as ERM  
## Conditional Probability Modeling as Statistical Learning
- Input space $X$  
- Outcome space $Y$  
- All pairs $(x,y)$ are independent with distribution $P_{X×Y}$.  
- Action space $A = \{p(y) | p \qquad \text{  is a probability density or mass function on } Y\}$.  
- Hypothesis space $F$ contains prediction functions $f : X\to A$.  
  - Given an $x \in X$, predict a probability distribution $p(y)$ on $Y$.  
- Maximum likelihood estimation for dataset $D=((x_1,y_1),...,(x_n,y_n))$  
$$\hat{f}_{\mathrm{MLE}}=\underset{f \in \mathcal{H}}{\arg \max } \sum_{i=1}^{n} \log \left[f\left(x_{i}\right)\left(y_{i}\right)\right]$$  
## Conditional Probability Modeling as Statistical Learning  
- Take loss $l : A×Y\to R$ for a predicted PDF or PMF $p(y)$ and outcome $y$ to be  
$$\ell(p, y)=-\log p(y)$$  
The risk of decision function $f : X\to A$ is  
$$R(f)=-\mathbb{E}_{x, y} \log [f(x)(y)]$$  
where $f(x)$ is a PDF or PMF on $Y$, and we’re evaluating it on $y$.

The empirical risk of $f$ for a sample $D = \{y_1,...,y_n\}\in Y$ is  
$$\hat{R}(f)=-\frac{1}{n} \sum_{i=1}^{n} \log \left[f\left(x_{i}\right)\right]\left(y_{i}\right)$$  
This is called the **negative conditional log-likelihood**.  
Thus for the **negative log-likelihood loss**, ERM and MLE are equivalent

