# Generalized Linear Models

We have devoted a great deal of attention to the linear model:

$$Y \sim \mathcal{N}(x \cdot \beta, \sigma^2)$$

In lecture we also learned about logistic regression, which can be framed in the following form:

$$
\begin{align*}
Y \sim \textrm{Bernoulli}(p)\\
\log\left( \frac{1-p}{p}\right) = x \cdot \beta
\end{align*} 
$$

Note that in this case $\mathbb{E}(Y|x) = p$.

We can see that in both cases, some function of the conditional outcome means is a linear function of the predictor variables, i.e.

$$
g\left(\mathbb{E}(Y|x)\right) = x \cdot \beta
$$

We also make an assumption on how $(Y|x)$ is distributed:  normal in one case, Bernoulli in the other.  In both cases, these distributions are part of what are known as *exponential families*.  These families of distributions include most of the named distributions you are probably familiar with.  We will see that the maximum likelihood estimation of the parameters is particularly nice for these distributions as well.



### Natural Exponential Families

Let $h:\mathbb{R}^2 \to [0,\infty)$.  The **natural exponential family** of distributions associated with $h$ is a collection of distributions parameterized by two real numbers $\theta$ and $\phi$.  The probability density function of the distribution $y_{\theta, \phi}$ is

$$
f_Y(y;\theta) = \exp\left(y\theta - A(\theta)\right)h(y)
$$

where $A(\theta)$ is chosen to make the integral with respect to $y$ equal to $1$:

$$
\begin{align*}
1 &= \int_{-\infty}^{\infty} \exp\left( y\theta - A(\theta)\right)h(y) \mathop{dy}\\
\exp\left(A(\theta)\right) &= \int_{-\infty}^{\infty} \exp\left( y\theta\right)h(y) \mathop{dy}\\
A(\theta) &= \log \int_{-\infty}^{\infty} \exp\left(y\theta\right)h(y) \mathop{dy}\\
\end{align*}
$$

Some terminology:

* We call $\theta$ the *natural parameter*.
* We call $h(y)$ the base density. 
* We call $A$ the *cumulant function* or *log-partition function*.

Note:  We are sacrificing some generality for the sake of clarity here.  More generally, an exponential family of distributions has the form 

$$
f_Y(y ; \theta) = \exp(T(y) \cdot \eta(\theta) - A(\theta)) h(y)
$$

where $\theta \in \mathbb{R}^s$, $\eta: \mathbb{R}^s \to \mathbb{R}^d$ and $T:\mathbb{R} \to \mathbb{R}^d$. 

#### Example:  Normal distributions with known variance

Say that the variance $\sigma^2$ is known and consider the family

$$
\begin{align*}
f_Y(y) &= \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left( -\frac{(y-\mu)^2}{2\sigma^2}\right)\\
&= \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left( -\frac{y^2 - 2y\mu +\mu^2}{2\sigma^2}\right)\\
&= \exp\left(y (\frac{\mu}{\sigma^2}) - \frac{1}{2}\sigma^2 (\frac{\mu}{\sigma^2})^2\right) \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{y^2}{2\sigma^2}\right)\\
\end{align*}
$$

So the normal distributions with a known variance form a natural exponential family with 

* $\theta = \mu/\sigma^2$
* $h(y) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{y^2}{2\sigma^2}\right)$
* $A(\theta) = \frac{1}{2}\sigma^2 \theta^2$

spine surgery 
216-444-1889

#### Example: Bernoulli distributions

A discrete random variable $Y$ is Bernoulli distributed when $P(Y = 1) = p$ and $P(Y = 0) = 1-p$.  We can rewrite the pmf as follows:

$$
\begin{align*}
f_Y(y) &= p^y(1-p)^{1-y} \textrm{ with $y \in \{0,1\}$}\\
&= \exp(y\log(p) + (1-y)\log(1-p))\\
&= \exp(y\log(\frac{p}{1-p}) + \log(1-p))\\
\end{align*}
$$

To write this in the form of an exponential family we need to express $-\log(1-p)$ as a function of $\theta = \log(\frac{p}{1-p})$.  Let's first solve for $p$ as a function of $\theta$:

$$
\begin{align*}
\theta &= \log(\frac{p}{1-p})\\
e^\theta &= \frac{p}{1-p}\\
e^{-\theta} &= \frac{1-p}{p}\\
e^{-\theta} &= \frac{1}{p} - 1\\
p &= \frac{1}{1+e^{-\theta}}
\end{align*}
$$

Thus 

$$
\begin{align*}
-\log(1-p) &= -\log(1 - \frac{1}{1+e^{-\theta}})\\
&= -\log(\frac{e^{-\theta}}{1+e^{-\theta}})\\
&= -\log(\frac{1}{1+e^{\theta}})\\
&= \log(1+e^{\theta})\\
\end{align*}
$$


So the Bernoulli distributions are a natural exponential family with 

* $\theta = \log(\frac{p}{1-p})$
* $h(y) = 1$
* $A(\theta) = \log(1 + \exp(\theta))$

#### Example: Poisson distributions

Say that some event occurs at irregular time intervals.  These events are mutually independent.  If the average number of events in a unit time interval is $\lambda$, then it can be shown that the number of events $Y$ in any unit time interval follows a [Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution) with pmf:

$$
f_Y(y) = \frac{1}{y!} \lambda^ye^{-\lambda} \textrm{ where $y \in \{0,1,2,3,\dots\}$}
$$

Let's try to write this in the form of a natural exponential family:

$$
\begin{align*}
\frac{1}{y!} \lambda^ye^{-\lambda} &= \frac{1}{y!} \exp(y \log(\lambda) - \lambda )
\end{align*}
$$

So the Poisson distributions form a natural exponential family with

* $\theta = \log(\lambda)$
* $h(y) = \frac{1}{y!}$
* $A(\theta) = e^\theta$ 

### Results about the cumulant function $A(\theta)$

We will prove that for a natural exponential family, $A'(\theta) = \mathbb{E}(Y)$ and $A''(\theta) = \operatorname{Var}(Y)$

$$
\begin{align}
\exp\left(A(\theta)\right) &= \int_{-\infty}^{\infty} \exp\left( y\theta\right)h(y) \mathop{dy}\\
\frac{d}{d\theta}\exp\left(A(\theta)\right) &= \frac{d}{d\theta}\int_{-\infty}^{\infty} \exp\left( y\theta\right)h(y) \mathop{dy}\\
A'(\theta) \exp\left(A(\theta)\right) &= \int_{-\infty}^{\infty} y\exp\left( y\theta\right)h(y) \mathop{dy}\\
A'(\theta) &= \int_{-\infty}^{\infty} y\exp\left( y\theta - A(\theta)\right)h(y) \mathop{dy}\\
A'(\theta) &= \int_{-\infty}^{\infty} yf_Y(y)\mathop{dy}\\
A'(\theta) &= \mathbb{E}(y)\\
\end{align}
$$

Starting at line (3) and differentiating with respect to $\theta$ again we have

$$
\begin{align*}
\frac{d}{d\theta} \left[A'(\theta) \exp\left(A(\theta)\right) \right] &= \frac{d}{d\theta}\int_{-\infty}^{\infty} y\exp\left( y\theta\right)h(y) \mathop{dy}\\
A''(\theta) \exp\left(A(\theta)\right) + A'(\theta)^2\exp\left(A(\theta)\right)  &= \int_{-\infty}^{\infty} y^2\exp\left( y\theta\right)h(y) \mathop{dy}\\
A''(\theta) + (\mathbb{E}(Y))^2 &= \mathbb{E}(Y^2)\\
A''(\theta) &= \operatorname{Var}(Y) 
\end{align*}
$$

Note also that $A''(\theta) > 0$ since variance is positive.  This also implies that $A'$ is an increasing function, and thus has a well defined inverse.

**Definition**: We call $g = (A')^{-1}$ the **canonical link function** of the natural exponential family.

## Generalized Linear Models

**Definition**:  A **Generalized Linear Model with canonical link** is specified by choosing a natural exponential family for the target variable. We make the assumption that 

$$
g(\mathbb{E}(Y)) = \textbf{x} \cdot \beta
$$

where $g$ is the canonical link function of the family.

Note that since $g$ is the canonical link, we can also say that
$$
\begin{align*}
\textbf{x} \cdot \beta 
&= g(\mathbb{E}(Y))\\
&= g(A'(\theta))\\
&= \theta
\end{align*}
$$

In other words, when we are using a canonical link, the assumption we are making when we specify a GLM is that

$$
f_Y(y) = \exp(y \textbf{x} \cdot \beta - A(\textbf{x} \cdot \beta ))h(y)
$$

Note:  We are restricting ourselves to natural exponential families and canonical links for this notebook.  One can use more general links and more expansive definitions of exponential families, but the math gets a bit more complicated. 

#### Example:  Bernoulli Distrubution $\to$ Logistic Regression

As noted previously, the the Bernoulli distributions are a natural exponential family with 

* $\theta = \log(\frac{p}{1-p})$
* $h(y) = 1$
* $A(\theta) = \log(1 + \exp(\theta))$

Since $A'(\theta) = \frac{\exp(\theta)}{1 + \exp(\theta)} = \frac{1}{1+\exp(-\theta)}$, the canonical link function is the inverse of $A'$ which is $g(u) = \log(\frac{u}{1-u})$.

Thus the GLM with a Bernoulli distributed outcome and canonical link is specified by

$$
\begin{align*}
Y &\sim \operatorname{Bernoulli}(p)\\
\log(\frac{p}{1-p}) &= \textbf{x} \cdot \beta
\end{align*}
$$

This is exactly Logistic Regression:  the log odds are linear in the predictors, and the response is Bernoulli distributed.

#### Example:  Poisson Distrubutions $\to$ Poisson Regression

The Poisson distributions form a natural exponential family with

* $\theta = \log(\lambda)$
* $h(y) = \frac{1}{y!}$
* $A(\theta) = e^\theta$ 

So the GLM with canonical link $g(u) = \log(u)$ is specified by

$$
\begin{align*}
Y &\sim \operatorname{Poisson}(\lambda)\\
\log(\lambda) &= \textbf{x} \cdot \beta
\end{align*}
$$

[This Desmos graph](https://www.desmos.com/3d/yiiihjiudp) might help you to visualize what is going on with Poisson Regression.  This form of regression is often used when the outcome is a count.

Note:  Unfortunately vanilla OLS doesn't fit exactly into the simplified framework we are addressing in this notebook.  We would need to introduce an additional dispersion parameter into our definition of "generalized linear model" to get it to work out nicely.  You can try using the normal distribution with fixed variance, and you will get something which is very close to OLS linear regression but with some annoying additional factors of $\sigma^2$ sprinkled around.  Essentially, it models not the mean conditional response, but this mean conditional response when scaled down by a factor of $\sigma^2$.

## MLE estimation of GLMs with canonical links

Given a GLM, how can we use data to estimate the model parameters $\beta$?


Say that we have fixed predictors $\textbf{x}_i$ for $i=1,2,3, \dots, n$.  Let the corresponding random variable for the outcome be denoted $Y_i$.  Let $y_i$ be the observed outcomes from these distributions.

The likelihood function is then

$$
L(\beta) = \prod_1^n f_{Y_i}(y_i) = \prod_1^n \exp(y_i \textbf{x}_i \cdot \beta - A(\textbf{x}_i \cdot \beta))h(y_i)
$$

Then the negative log likelihood is

$$
\ell(\beta) = \sum_1^n \left(A(\textbf{x}_i \cdot \beta) - y_i \textbf{x}_i \cdot \beta\right) - \sum_1^n \log(h(y_i))
$$

We want to compute the gradient and Hessian of $\ell$.

Taking the partial derivative with respect to $\beta_j$ we obtain

$$
\begin{align*}
\partial_j \ell 
&= \sum_1^n A'(\textbf{x}_i \cdot \beta) \textbf{x}_{ij} - y_i \textbf{x}_{ij}\\
&= \sum_1^n \left(g^{-1}(\textbf{x}_i \cdot \beta)  - y_i \right)\textbf{x}_{ij}\\
\end{align*}
$$

so, packaging the $\textbf{x}_i$ into a matrix $X$ and $y_i$ into a vector $\vec{y}_{\textrm{obs}}$ as usual, we have

$$
\nabla \ell = X^\top (g^{-1}(X \beta) - \vec{y}_{\textrm{obs}})
$$

where $g^{-1}$ is being applied coordinate wise.

Introducing the notation $g^{-1}(X \beta) = \hat(y)$ (reasonable since it is the vector of predicted outcome means), we can write this as 

$$
\nabla \ell = X^\top (\hat{y} - \vec{y}_{\textrm{obs}})
$$

In other words, "The gradient of the negative log likelihood is $X^\top$ applied to the deviations of the observed outcomes from the predicted outcome means".  This simple form of the gradient makes gradient descent super easy to implement for fitting GLMs.

In the case of Logistic Regression this is $\nabla \ell = X^\top (\sigma (X\beta) - \vec{y}_{\textrm{obs}})$ as claimed in the lecture notebook!

For Poisson regression it would be $\nabla \ell = X^\top (\exp (X\beta) - \vec{y}_{\textrm{obs}})$.

We can also obtain the Hessian

$\nabla^2 \ell = X^\top \operatorname{diag}(A''(X\beta)) X$

which is manifestly positive definite.  This shows that the fitting the GLM through MLE is a smooth convex optimization problem, which gives us uniqueness of the minimizing parameters (assuming they exist).  Note that it is possible that $\ell$ has no global minimum, as occurs in the case of perfect separation in Logistic Regression.
