# Generalized linear model

There are two essential components for generalized linear model (GLM): a distribution for response $Y_i$ and a link function that relates the mean of $Y_i$ to its covariates $X_i$.

### Exponential family distribution
In GLM, the distribution of $Y_i$ is from the exponential familty of distributions of form
$$
  f(y \mid \theta, \phi) = \exp \left[ \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right],
$$
where $\theta$ is called the **canonical parameter** or **natural parameter** and represents the location, while $\phi$ is the **dispersion parameter** and represents the scale. Note that the canonical parameter $\theta$ is not necessarily the mean $\mu$. Examples of the exponential family distribution include normal, inverse normal, binomial, Poisson, and gamma distributions. An exponential family distribution have the following mean and variance.
\begin{eqnarray*}
  \int f(y \mid \theta, \phi) \ dy &=& \int \exp \left[ \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right] dy 
  = 1\\
  \frac{\partial}{\partial \theta} \int f(y \mid \theta, \phi) \ dy &=& 
  \int \frac{\partial}{\partial \theta} \ f(y \mid \theta, \phi) \ dy = 0 \\
  &=& \int \frac{y - b'(\theta)}{a(\phi)} \exp \left[ \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right] dy \\
  \therefore \mathbb{E}Y &=& \mu = b'(\theta) \\
  \frac{\partial^2}{\partial \theta^2} \int f(y \mid \theta, \phi) \ dy &=& 
  \int \frac{\partial^2}{\partial \theta^2} \ f(y \mid \theta, \phi) \ dy = 0 \\
  &=& \int \frac{(y - b'(\theta))^2 - b''(\theta)a(\phi)}{a(\phi)^2} 
  \exp \left[ \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right] dy \\ 
  &=& \int \frac{y^2 - 2 y b'(\theta) + b'(\theta)^2 - b''(\theta)a(\phi)}{a(\phi)^2} 
  \exp \left[ \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right] dy \\ 
  \mathbb{E}Y^2 &=& b'(\theta)^2 + b''(\theta)a(\phi) \\
  \therefore \operatorname{Var}Y &=& \sigma^2 = b''(\theta) a(\phi)
\end{eqnarray*}

### Link function

Given the linear predictor (or systematic component) $\eta = \mathbf{x}^T \boldsymbol{\beta}$, 
a link function, $g$, relates the mean $\mathbb{E} Y = \mu$ to the covariates $\eta = g(\mu)$. In principal, any monotone, continuous, and differentiable function can be a link function. For example, for binomial distribution, logit, probit, complementary log-log, and Cauchit link functions are viable choices. The **canonical link** has $g$ such that $\eta = g(\mu) = g(b'(\theta)) =\theta$.

### Fisher scoring algorithm

GLM regression coefficients are estimated by MLE. Recall that the Newton-Raphson algorithm for maximizing a log-likelihood $\ell(\boldsymbol{\beta})$ proceeds as
$$
  \boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + s [- \nabla^2 \ell(\boldsymbol{\beta}^{(t)})]^{-1} \nabla \ell(\boldsymbol{\beta}^{(t)}),
$$
where $s>0$ is a step length, $\nabla \ell$ is the score (gradient) vector, and $-\nabla^2 \ell$ is the observed information matrix (negative Hessian). For GLM, we can use the chain rule $\frac{\partial}{\partial \boldsymbol{\beta}} = \frac{\partial}{\partial \theta_i} \frac{\partial \theta_i}{\partial \mu_i} \frac{\partial \mu_i}{\partial \eta_i} \frac{\partial \eta_i}{\partial \boldsymbol{\beta}}$ and $\mu_i = b'(\theta_i), \sigma_i^2 = b''(\theta_i) a(\phi), \eta_i = \mathbf{x}_i^T \beta = g(\mu_i), \frac{\partial \mu_i}{\partial \theta_i} = b''(\theta_i)$ to derive
\begin{eqnarray*}
  \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \\
  \nabla \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n \frac{(y_i - \mu_i)}{a(\phi)} 
  \frac{1}{b''(\theta_i)} \mu_i'(\eta_i) \mathbf{x}_i = 
  \sum_{i=1}^n \frac{(y_i - \mu_i) \mu_i'(\eta_i)}{\sigma_i^2} \mathbf{x}_i \\
  - \nabla^2 \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n \frac{[\mu_i'(\eta_i)]^2}{\sigma_i^2} \mathbf{x}_i
  \mathbf{x}_i^T - \sum_{i=1}^n \frac{(y_i - \mu_i) \mu_i''(\eta_i)}{\sigma_i^2} \mathbf{x}_i \mathbf{x}_i^T \\
  & & + \sum_{i=1}^n \frac{(y_i - \mu_i) [\mu_i'(\eta_i)]^2 (d \sigma_i^{2} / d\mu_i)}{\sigma_i^4} \mathbf{x}_i
  \mathbf{x}_i^T. \\
\end{eqnarray*}
For GLMs with canonical links, the second term and third term cancel using the following
$$
\frac{d\mu_i}{d \eta_i} = \frac{d\mu_i}{d \theta_i} = \frac{d \, b'(\theta_i)}{d \theta_i} = b''(\theta_i) = \frac{\sigma_i^2}{a(\phi)}.
$$
Therefore for canonical link the negative Hessian is positive semidefinte and Newton's algorithm with line search is stable. Meanwhile, for non-canonical link, we can use the expected (Fisher) information matrix
$$
  \mathbb{E} [- \nabla^2 \ell(\boldsymbol{\beta})] = \sum_{i=1}^n \frac{[\mu_i'(\eta_i)]^2}{\sigma_i^2} \mathbf{x}_i \mathbf{x}_i^T = \mathbf{X}^T \mathbf{W} \mathbf{X} \succeq 0,
$$
where $\mathbf{W} = \text{diag}([\mu_i'(\eta_i)]^2/\sigma_i^2)$. This modified Newton-Raphson algorithm is called the **Fisher scoring algorithm**. 
The Fisher scoring algorithmn proceeds as
\begin{eqnarray*}
  \boldsymbol{\beta}^{(t+1)} &=& \boldsymbol{\beta}^{(t)} + s(\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T (\mathbf{y} - \widehat{\boldsymbol{\mu}}^{(t)}) \\
  &=& (\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W}^{(t)} [\mathbf{X} \boldsymbol{\beta}^{(t)} + s (\mathbf{W}^{(t)})^{-1} (\mathbf{y} - \widehat{\boldsymbol{\mu}}^{(t)})] \\
  &=& (\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W}^{(t)} \mathbf{z}^{(t)},
\end{eqnarray*} 
where
$$
  \mathbf{z}^{(t)} = \mathbf{X} \boldsymbol{\beta}^{(t)} + s (\mathbf{W}^{(t)})^{-1} (\mathbf{y} - \widehat{\boldsymbol{\mu}}^{(t)})
$$
are working responses. In this sense, the Fisher scoring algorithm for GLM is also called the IRWLS (iteratively reweighted least squares).


### Hypothesis testing
TODO