# Generalized linear model

## Q1. Moments of exponential family distributions

The exponential family distributions have mean and variance as below:
\begin{eqnarray*}
  f(y \mid \theta, \phi) &=& \exp \left[ \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right] \\
  \int f(y \mid \theta, \phi) \ dy &=& \int \exp \left[ \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right] dy 
  = 1\\
  \frac{\partial}{\partial \theta} \int f(y \mid \theta, \phi) \ dy &=& 
  \int \frac{\partial}{\partial \theta} \ f(y \mid \theta, \phi) \ dy = 0 \\
  &=& \int \frac{y - b'(\theta)}{a(\phi)} \exp \left[ \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right] dy \\
  \therefore \mathbb{E}Y &=& \mu = b'(\theta) \\
  \frac{\partial^2}{\partial \theta^2} \int f(y \mid \theta, \phi) \ dy &=& 
  \int \frac{\partial^2}{\partial \theta^2} \ f(y \mid \theta, \phi) \ dy = 0 \\
  &=& \int \frac{(y - b'(\theta))^2 - b''(\theta)a(\phi)}{a(\phi)^2} 
  \exp \left[ \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right] dy \\ 
  &=& \int \frac{y^2 - 2 y b'(\theta) + b'(\theta)^2 - b''(\theta)a(\phi)}{a(\phi)^2} 
  \exp \left[ \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right] dy \\ 
  \mathbb{E}Y^2 &=& b'(\theta)^2 + b''(\theta)a(\phi) \\
  \therefore \operatorname{Var}Y &=& \sigma^2 = b''(\theta) a(\phi)
\end{eqnarray*}

## Q2. Score and information matrix of GLM

The gradient (score), negative Hessian, and Fisher information matrix (expected negative Hessian) of GLM are as below:

\begin{eqnarray*}
  \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \\
  
  \mu_i &=& b'(\theta_i) \\

  \sigma_i^2 &=& b''(\theta_i) a(\phi) \\
  
  \eta_i &=& \mathbf{x}_i^T \beta = g(\mu_i) \\
  
  \frac{\partial \mu_i}{\partial \theta_i} &=& b''(\theta_i) \\
  
  \frac{\partial}{\partial \beta_j} &=& \frac{\partial}{\partial \theta_i} 
  \frac{\partial \theta_i}{\partial \mu_i} \frac{\partial \mu_i}{\partial \eta_i} 
  \frac{\partial \eta_i}{\partial \beta_j} \\
  
  \therefore \nabla \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n \frac{(y_i - \mu_i)}{a(\phi)} 
  \frac{1}{b''(\theta_i)} \mu_i'(\eta_i) \mathbf{x}_i = 
  \sum_{i=1}^n \frac{(y_i - \mu_i) \mu_i'(\eta_i)}{\sigma_i^2} \mathbf{x}_i \\
  
  - \nabla^2 \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n \frac{[\mu_i'(\eta_i)]^2}{\sigma_i^2} \mathbf{x}_i
  \mathbf{x}_i^T - \sum_{i=1}^n \frac{(y_i - \mu_i) \mu_i''(\eta_i)}{\sigma_i^2} \mathbf{x}_i \mathbf{x}_i^T \\
  & & + \sum_{i=1}^n \frac{(y_i - \mu_i) [\mu_i'(\eta_i)]^2 (d \sigma_i^{2} / d\mu_i)}{\sigma_i^4} \mathbf{x}_i
  \mathbf{x}_i^T

\end{eqnarray*}

Of note, it is trivial to show the last equality above (for Hessian matrix) by the chain rule. 

## Q3. ELMR Exercise 8.1 (p171)

### Q3.a 

\begin{eqnarray*}
  f(y \mid \theta, \phi) &=& \exp \left[ \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right] \\
  &=& \lambda \exp(-\lambda y) = \exp(-\lambda y + \log \lambda) \\
  
  \therefore \theta &=& -\lambda, \phi = 1, a(\phi) = 1, b(\theta) = -\log(-\theta), c(y, \phi) = 0

\end{eqnarray*}

### Q3.b
The canonical link and variance function are $-\lambda = -\frac{1}{\mu}$ and $\mu^2$, respectively.

### Q3.c
The canonical link can only take the negative values, since $\lambda$ is greater than equal to zero, which is in contrast to the systematic component that can take any value. 

### Q3.d
We would use $\chi^2$ test when comparing nested models here, as $\phi = 1$ is known. 

### Q3.e
$$
  D(\mathbf{y}, \widehat{\boldsymbol{\mu}}) = 
  2 \sum_i [y_i(\tilde \theta_i - \hat \theta_i) - b(\tilde \theta_i) + b(\hat \theta_i)] = 
  2 \sum_i (\frac{y_i - \hat{u_i}}{\hat{u_i}} - \log \frac{y_i}{\hat{u_i}})
$$


## Q9. (10 pts) GLM

You can use the formulae in lecture notes or homework. 

1. For a GLM with canonical link, explain why the log-likelihood is concave. 
\begin{eqnarray*}
-\nabla^2 \ell(\boldsymbol{\beta}) = \sum_{i=1}^n \frac{[\mu_i'(\eta_i)]^2}{\sigma_i^2} \mathbf{x}_i \mathbf{x}_i^T
\end{eqnarray*}

Hessian matrix of negative log-likelihood is positive semidefinite, so it is concave. 

2. For a GLM with canonical link, explain why the Fisher scoring algorithm is the same as the Newton-Raphson algorithm. 

Because expected Fisherian Information is the same as negative Hessian matrix. 

3. Use Poisson regression (with log link) as example. Show that the Fisher scoring algorithm is equivalent to the IRWLS (iteratively reweighted least squares) procedure. Clarify what are the weights and working responses in this case (Poisson regression with canonical link). 

\begin{eqnarray*}
- \nabla^2\ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n \mu_i \mathbf{x}_i \mathbf{x}_i^T = \mathbf{X}^T \mathbf{W} \mathbf{X}, \quad \mathbf{W} = \text{diag}(w_1, \ldots, w_n), w_i = \mu_i \\
\mathbf{z}^{(t)} &=& \mathbf{X} \boldsymbol{\beta}^{(t)} + s (\mathbf{W}^{(t)})^{-1} (\mathbf{y} - \widehat{\boldsymbol{\mu}}^{(t)})
\end{eqnarray*}


## Q10. (6 pts) Link functions

Write down the (1) names, (2) expressions, and (3) the name of corresponding latent variable distribution of 3 commonly used link functions for a Bernoulli or binomial parameter $p$. 

Example: Identiy link, $\eta = g(p) = p$, corresponds to a uniform distribution for the latent variable.

Three commonly used link functions for a Bernoulli or binomial parameter are logit, probit, and complementary log-log. 

$$
\eta = g(p) = \log \frac{p}{1-p}
$$

$$
\eta = g(p) = \Phi^{-1}(p)
$$

$$
\eta = g(p) = \log ( - \log(1-p))
$$


## Q11. (10 pts) Inverse Gaussian

The inverse Gaussian distribution $IG(\mu, \lambda)$ has density 
$$
f(y) = \left( \frac{\lambda}{2 \pi y^3} \right)^{1/2} e^{- \frac{\lambda (y - \mu)^2}{2 \mu^2 y}}, \quad y, \mu, \lambda > 0.
$$

1. Show that $IG(\mu, \lambda)$ belongs to the exponential family distributions.
\begin{eqnarray*}
  f(y \mid \theta, \phi) &=& \exp \left[ \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right] \\
  &=& \exp(-\frac{y}{2 \mu^2} + \frac{1}{\mu} - \frac{1}{2}(\frac{1}{y} - \log \frac{\lambda}{2 \pi y^3}))
\end{eqnarray*}

2. What is the canonical parameters? 
$-\frac{1}{2 \mu^2}$

3. Derive the mean and variance of inverse Gaussian.

4. What is the canonical link function for inverse Gaussian? 

5. Derive the deviance formula for $IG(\mu, \lambda)$.

## Q1. Concavity of logistic regression log-likelihood 

### Q1.1

From the lecture note, we have previously seen that the log-likelihood function of logistic regression looks as below.

\begin{eqnarray*}
\ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n \left[ y_i \log p_i + (m_i - y_i) \log (1 - p_i) + \log \binom{m_i}{y_i} \right] \\
&=& \sum_{i=1}^n \left[ y_i \eta_i - m_i \log ( 1 + e^{\eta_i}) + \log \binom{m_i}{y_i} \right] \\
&=& \sum_{i=1}^n \left[ y_i \cdot \mathbf{x}_i^T \boldsymbol{\beta} - m_i \log ( 1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}) + \log \binom{m_i}{y_i} \right].
\end{eqnarray*}


### Q1.2

Then the gradient vector of the log-likelhood function is as below assuming that there are _q_ parameters being estimated. 

\begin{eqnarray*}
\nabla\ell(\boldsymbol{\beta}) = \begin{pmatrix} \sum_{i=1}^n y_i \cdot x_{i1} - m_i \frac{e^{\mathbf{x}_i^T \boldsymbol{\beta}} \cdot x_{i1}}{1+e^{\mathbf{x}_i^T \boldsymbol{\beta}}} \\
\vdots \\
\sum_{i=1}^n y_i \cdot x_{iq} - m_i \frac{e^{\mathbf{x}_i^T \boldsymbol{\beta}} \cdot x_{iq}}{1+e^{\mathbf{x}_i^T \boldsymbol{\beta}}}
\end{pmatrix} = \begin{pmatrix} \sum_{i=1}^n x_{i1} (y_i - m_i \frac{e^{\mathbf{x}_i^T \boldsymbol{\beta}}}{1+e^{\mathbf{x}_i^T \boldsymbol{\beta}}}) \\
\vdots \\
\sum_{i=1}^n x_{iq} (y_i - m_i \frac{e^{\mathbf{x}_i^T \boldsymbol{\beta}}}{1+e^{\mathbf{x}_i^T \boldsymbol{\beta}}})
\end{pmatrix}
\end{eqnarray*}

The Hessian matrix of the log-likelihood function is the Jacobian of gradient vector above.

\begin{eqnarray*}
\nabla^2\ell(\boldsymbol{\beta}) = \begin{pmatrix} \sum_{i=1}^n -x_{i1}^2 \cdot m_i \frac{e^{\mathbf{x}_i^T \boldsymbol{\beta}}}{(1+e^{\mathbf{x}_i^T \boldsymbol{\beta}})^2} & \cdots & \sum_{i=1}^n -x_{i1} \cdot x_{iq} \cdot m_i \frac{e^{\mathbf{x}_i^T \boldsymbol{\beta}}}{(1+e^{\mathbf{x}_i^T \boldsymbol{\beta}})^2}\\
\vdots & \ddots & \vdots \\
\sum_{i=1}^n -x_{i1} \cdot x_{iq} \cdot m_i \frac{e^{\mathbf{x}_i^T \boldsymbol{\beta}}}{(1+e^{\mathbf{x}_i^T \boldsymbol{\beta}})^2} & \cdots & \sum_{i=1}^n -x_{iq}^2 \cdot m_i \frac{e^{\mathbf{x}_i^T \boldsymbol{\beta}}}{(1+e^{\mathbf{x}_i^T \boldsymbol{\beta}})^2}
\end{pmatrix}
\end{eqnarray*}

A positive self-definite matrix by definition satisfies $\boldsymbol{w}'H\boldsymbol{w} \ge 0$ for all $\boldsymbol{w} \in \mathbb{R}^{qx1}$. If we assume $\boldsymbol{w} = (w_1 + w_2 + \cdots + w_q)'$, and perform the above matrix multiplication after taking the negative Hessian, we can complete the squares for given $i$, as $\left[ m_i \frac{e^{\mathbf{x}_i^T \boldsymbol{\beta}}}{(1+e^{\mathbf{x}_i^T \boldsymbol{\beta}})^2} \sum_{j=1}^q (x_{ij} \cdot w_j)\right]^2$. This form is obviously greater than or equal to zero, and hence after expanding to other $i$'s, we show that the negative Hessian is a positive semidefinite matrix. 

## Q1. Beta-Binomial 

### Q1.1

\begin{eqnarray*}
\pi &\sim& \text{Be}(\alpha, \beta), \quad \pi \in [0, 1], \alpha > 0, \beta > 0 \\
\mathbf{E}[\pi] = \int_{0}^{1} \pi \cdot f(\pi; \alpha, \beta) \; d\pi &=& \int_{0}^{1} \pi \cdot \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \pi^{\alpha - 1} (1 - \pi)^{\beta - 1} \; d\pi \\ 
&=& \int_{0}^{1} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \pi^{\alpha} (1 - \pi)^{\beta - 1} \; d\pi \\
&=& \frac{\Gamma(\alpha + \beta) \Gamma(\alpha + 1)}{\Gamma(\alpha) \Gamma(\alpha + \beta + 1)} = \frac{\alpha}{\alpha + \beta} \\
\mathbf{E}[\pi^2] = \int_{0}^{1} \pi^2 \cdot f(\pi; \alpha, \beta) \; d\pi &=& \int_{0}^{1} \pi^2 \cdot \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \pi^{\alpha - 1} (1 - \pi)^{\beta - 1} \; d\pi \\ 
&=& \int_{0}^{1} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \pi^{\alpha + 1} (1 - \pi)^{\beta - 1} \; d\pi \\
&=& \frac{\Gamma(\alpha + \beta) \Gamma(\alpha + 2)}{\Gamma(\alpha) \Gamma(\alpha + \beta + 2)} = \frac{(\alpha + 1) \alpha}{(\alpha + \beta + 1)(\alpha + \beta)} \\
\mathbf{E}[\pi^2] - \mathbf{E}[\pi]^2 &=& \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}
\end{eqnarray*}

### Q1.2

\begin{eqnarray*}
Y_i &\sim& \text{Bin}(n_i, \pi_i) \\
f(Y_i; \alpha, \beta, n_i) &=& \int_{0}^{1} \binom{n_i}{y_i} \pi^{y_i} (1 - \pi)^{n_i - y_i} f(\pi; \alpha, \beta) \; d\pi \\
&=& \int_{0}^{1} \binom{n_i}{y_i} \pi^{y_i} (1 - \pi)^{n_i - y_i} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \pi^{\alpha - 1} (1 - \pi)^{\beta - 1} \; d\pi \\
&=& \int_{0}^{1} \binom{n_i}{y_i} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \pi^{y_i + \alpha - 1} (1 - \pi)^{n_i - y_i + \beta - 1} \; d\pi \\
&=& \binom{n_i}{y_i} \frac{\Gamma(\alpha + \beta) \Gamma(y_i + \alpha) \Gamma(n_i - y_i + \beta)}{\Gamma(\alpha) \Gamma(\beta) \Gamma(n_i + \alpha + \beta)}\\
\mathbf{E}[Y_i] &=& \frac{n_i \alpha}{\alpha + \beta} \\
\mathbf{Var}[Y_i] &=& \frac{n_i \alpha \beta (\alpha + \beta + n_i)}{(\alpha + \beta)^2 (\alpha + \beta + 1)}
\end{eqnarray*}

Mean and variance of $Y_i$ can be calculated using the marginal probability mass function above $f(Y_i; \alpha, \beta, n_i)$. Variance of a Binomial random variable with the same batch size ($n_i$) and mean (i.e. $p = \frac{\alpha}{\alpha + \beta}$ is $n_i \cdot p(1-p) = n_i * \frac{\alpha \beta}{(\alpha + \beta)^2}$, and as a result is always equal to or less than the variance of $Y_i$ as below. 

\begin{eqnarray*}
\frac{\frac{n_i \alpha \beta (\alpha + \beta + n_i)}{(\alpha + \beta)^2 (\alpha + \beta + 1)}}{\frac{n_i \alpha \beta}{(\alpha + \beta)^2}} = \frac{\alpha + \beta + n_i}{\alpha + \beta + 1} \ge 1
\end{eqnarray*}

## Q2. Motivation for quasi-binomial

The log-likilihood $\ell_i$ of a binomial proportion $Y_i$, where $m_i Y_i \sim \text{Bin}(m_i, p_i)$, satisfies  
\begin{eqnarray*}

\ell_i &=& m_i y_i \log p_i + m_i (1 - y_i) \log (1 - p_i) + \log \binom{m_i}{m_i y_i} \\

\mathbb{E} \frac{\partial \ell_i}{\partial \mu_i} &=& \mathbb{E} \frac{m_i y_i}{p_i} - \frac{m_i(1 - y_i)}{1 - p_i} = \mathbb{E} \frac{m_i y_i - m_i p_i}{p_i (1 - p_i)} = 0 \\
\operatorname{Var} \frac{\partial \ell_i}{\partial \mu_i} &=& \operatorname{Var} \frac{m_i y_i - m_i p_i}{p_i (1 - p_i)} =  \operatorname{Var} \frac{m_i y_i}{p_i (1 - p_i)} = \left[ \frac{m_i}{p_i (1 - p_i)} \right]^2 \frac{p_i (1 - p_i)}{m_i} = \frac{m_i}{p_i (1 - p_i)} = \frac{1}{\phi V(\mu_i)} \\

\mathbb{E} \frac{\partial \ell_i^2}{\partial^2 \mu_i} &=& = \mathbb{E} \frac{- m_i p_i (1 - p_i) - m_i (y_i - p_i) (1 - 2 p_i)}{p_i^2 (1 - p_i)^2} = \frac{- m_i}{p_i (1 - p_i)} = - \frac{1}{\phi V(\mu_i)}

\end{eqnarray*}

where $\phi = 1$, $\mu_i = p_i$, and $V(\mu_i) = p_i (1 - p_i)/m_i$. Therefore, the $U_i$ in quasi-binomial method mimics the behavior of a binomial model.

## Q3. Concavity of Poisson regression log-likelihood 

### Q3.1

Let $Y_1,\ldots,Y_n$ be independent random variables with $Y_i \sim \text{Poisson}(\mu_i)$ and $\log \mu_i = \mathbf{x}_i^T \boldsymbol{\beta}$, $i = 1,\ldots,n$. Then the log-likelihood function is as below. 

\begin{eqnarray*}
\ell(\boldsymbol{\beta}) &=& \sum_i y_i \log \mu_i - \mu_i - \log y_i! \\
&=& \sum_i y_i \cdot \mathbf{x}_i^T \boldsymbol{\beta} - e^{\mathbf{x}_i^T \boldsymbol{\beta}} - \log y_i!
\end{eqnarray*}

### Q3.2

Then the gradient vector and Hessian matrix of the log-likelhood function is as below assuming that there are _q_ parameters being estimated.

\begin{eqnarray*}
\nabla\ell(\boldsymbol{\beta}) = \begin{pmatrix} \sum_{i=1}^n y_i \cdot x_{i1} - e^{\mathbf{x}_i^T \boldsymbol{\beta}} \cdot x_{i1} \\
\vdots \\
\sum_{i=1}^n y_i \cdot x_{iq} - e^{\mathbf{x}_i^T \boldsymbol{\beta}} \cdot x_{iq}
\end{pmatrix} = \begin{pmatrix} \sum_{i=1}^n x_{i1} (y_i - e^{\mathbf{x}_i^T \boldsymbol{\beta}}) \\
\vdots \\
\sum_{i=1}^n x_{iq} (y_i - e^{\mathbf{x}_i^T \boldsymbol{\beta}})
\end{pmatrix} \\ \\

\nabla^2\ell(\boldsymbol{\beta}) = \begin{pmatrix} \sum_{i=1}^n -x_{i1}^2 e^{\mathbf{x}_i^T \boldsymbol{\beta}} & \cdots & \sum_{i=1}^n -x_{i1} x_{iq} e^{\mathbf{x}_i^T \boldsymbol{\beta}} \\
\vdots & \ddots & \vdots \\
\sum_{i=1}^n -x_{i1} x_{iq} e^{\mathbf{x}_i^T \boldsymbol{\beta}} & \cdots & \sum_{i=1}^n -x_{iq}^2 e^{\mathbf{x}_i^T \boldsymbol{\beta}}
\end{pmatrix}
\end{eqnarray*}


### Q3.3

A positive self-definite matrix by definition satisfies $\boldsymbol{w}'H\boldsymbol{w} \ge 0$ for all $\boldsymbol{w} \in \mathbb{R}^{qx1}$. If we assume $\boldsymbol{w} = (w_1 + w_2 + \cdots + w_q)'$, and perform the above matrix multiplication after taking the negative Hessian, we can complete the squares for given $i$, as $e^{\mathbf{x}_i^T \boldsymbol{\beta}} \left[  \sum_{j=1}^q (x_{ij} \cdot w_j)\right]^2$. This form is obviously greater than or equal to zero, and hence after expanding to other $i$'s, we show that the negative Hessian is a positive semidefinite matrix. 

### Q3.4

Assuming that $\beta_1$ is the intercept term, then the fitted values $\widehat{\mu}_i = e^{\mathbf{x}_i^T \boldsymbol{\widehat{\beta}}}$ from maximum likelihood estimates satisfies

\begin{eqnarray*}

\nabla\ell(\boldsymbol{\widehat{\beta}}) &=& 0 \\
\sum_{i=1}^n x_{i1} (y_i - e^{\mathbf{x}_i^T \boldsymbol{\widehat{\beta}}}) &=& 
\sum_{i=1}^n x_{i1} (y_i - \widehat{\mu}_i) = 
\sum_{i=1}^n 1 \cdot (y_i - \widehat{\mu}_i) = 0

\end{eqnarray*}

## Q4. Odds ratios

### Q4.1

For the simple logistic model, if there is no difference between exposed and non-exposed groups (i.e. $\beta_1 = \beta_2$), then

\begin{eqnarray*}
\pi_1 &=& \frac{e^{\beta_1}}{1 + e^{\beta_1}} = \pi_2 = \frac{e^{\beta_2}}{1 + e^{\beta_2}} \\ \\
\phi &=& \frac{O_1}{O_2} = \frac{\pi_1(1 - \pi_2)}{\pi_2 (1 - \pi_1)} = \frac{\pi_1(1 - \pi_1)}{\pi_1 (1 - \pi_1)} = 1
\end{eqnarray*}

### Q4.2 

Assume that there are $J$ $2 \times 2$ tables, one for each level $x_j$ of a factor, such as age group, with $j=1,\ldots, J$. We further assume the logistic model as below,
$$
\pi_{ij} = \frac{e^{\alpha_i + \beta_i x_j}}{1 + e^{\alpha_i + \beta_i x_j}}, \quad i = 1,2, \quad j= 1,\ldots, J.
$$
Then for arbitrary $j$, if $\beta_1 = \beta_2$,
\begin{eqnarray*}
\log \phi &=& \log \pi_1 + \log (1 - \pi_1) - \log \pi_2 - \log (1 - \pi_1) \\
&=& \alpha_1 + \beta_1 x_j - (\alpha_2 + \beta_2 x_j) = \alpha_1 - \alpha_2
\end{eqnarray*}

As such, there is no $j$ term and $\log \phi$ is constant over all tables.