vignettes/kenward.Rmd

---
title: "Kenward-Roger"
package: mmrm
bibliography: '`r system.file("REFERENCES.bib", package = "mmrm")`'
csl: '`r system.file("jss.csl", package = "mmrm")`'
output:
  rmarkdown::html_vignette:
          toc: true
vignette: |
  %\VignetteIndexEntry{Kenward-Roger}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options:
  chunk_output_type: console
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

Here we describe the details of the calculations for the Kenward-Roger degrees of freedom and the adjusted covariance matrix of the coefficients.

## Model definition

The model definition is the same as what we have in [Details of the model fitting in `mmrm`](algorithm.html).
We are using the same notations.

### Linear model

For each subject $i$ we observe a vector
\[
Y_i = (y_{i1}, \dotsc, y_{im_i})^\top \in \mathbb{R}^{m_i}
\]
and given a design matrix
\[
X_i \in \mathbb{R}^{m_i \times p}
\]
and a corresponding coefficient vector $\beta \in \mathbb{R}^{p}$ we assume
that the observations are multivariate normal distributed:
\[
Y_i \sim N(X_i\beta, \Sigma_i)
\]
where the covariance matrix $\Sigma_i \in \mathbb{R}^{m_i \times m_i}$ is derived
by subsetting the overall covariance matrix $\Sigma \in \mathbb{R}^{m \times m}$
appropriately by
\[
\Sigma_i = G_i^{-1/2} S_i^\top \Sigma S_i G_i^{-1/2}
\]
where the subsetting matrix $S_i \in \{0, 1\}^{m \times m_i}$ contains
in each of its $m_i$ columns contains a single 1 indicating which overall time point
is matching $t_{ih}$.
$G_i \in \mathbb{R}_{\gt 0}^{m_i \times m_i}$ is the diagonal weight matrix.

Conditional on the design matrices $X_i$, the coefficient vector $\beta$ and the
covariance matrix $\Sigma$ we assume that the observations are independent between
the subjects.

We can write the linear model for all subjects together as
\[
Y = X\beta + \epsilon
\]
where $Y \in \mathbb{R}^N$ combines all subject specific observations vectors $Y_i$
such that we have in total $N = \sum_{i = 1}^{n}{m_i}$ observations,
$X \in \mathbb{R}^{N \times p}$ combines all subject specific design matrices
and $\epsilon \in \mathbb{R}^N$ has a multivariate normal distribution
\[
\epsilon \sim N(0, \Omega)
\]
where $\Omega \in \mathbb{R}^{N \times N}$ is block-diagonal containing the
subject specific $\Sigma_i$ covariance matrices on the diagonal and 0 in the
remaining entries.

## Mathematical Details of Kenward-Roger method

The mathematical derivation of the Kenward-Roger method is based on the Taylor expansion of the obtained covariance matrix
of $\hat\beta$ to get a more accurate estimate for it. All these derivations are based on the restricted maximum likelihood.
Following the same [notation](algorithm.html#covariance-matrix-model), the covariance matrix, $\Omega$ can be represented as
a function of covariance matrix parameters $\theta = (\theta_1, \dotsc, \theta_k)^\top$, i.e. $\Omega(\theta)$.
Here after model fitting with `mmrm`, we obtain the estimate $\hat\beta = \Phi(\hat\theta)X^\top\Omega(\hat\theta)^{-1}Y$,
where $\Phi(\theta) = \left\{X^\top \Omega(\theta)^{-1} X\right\} ^{-1}$ is the asymptotic covariance matrix of $\hat\beta$.
However, @kackar1984 suggests that although the $\hat\beta$ is unbiased for $\beta$, the covariance matrix,
$\hat\Phi = \left\{X^\top \hat\Omega X\right\}^{-1}$ can be biased. They showed that the variability of $\hat\beta$ can be partitioned into
two components,

\[
  \Phi_A = \Phi + \Lambda
\]

where $\Phi$ is the variance-covariance matrix of the asymptotic distribution of $\hat\beta$ as $n\rightarrow \infty$ as defined above, and $\Lambda$ represents the amount
to which the asymptotic variance-covariance matrix underestimates $\Phi_A$.

Based on a Taylor series expansion around $\theta$, $\Lambda$ can be approximated by

\[
  \Lambda \simeq \Phi \left\{\sum_{h=1}^k{\sum_{j=1}^k{W_{hj}(Q_{hj} - P_h \Phi P_j)} }\right\} \Phi
\]
where
\[
  P_h = X^\top \frac{\partial{\Omega^{-1}}}{\partial \theta_h} X
\]
\[
  Q_{hj} = X^\top \frac{\partial{\Omega^{-1}}}{\partial \theta_h} \Omega \frac{\partial{\Omega^{-1}}}{\partial \theta_j} X
\]

$W$ is the inverse of the Hessian matrix of the log-likelihood function of $\theta$ evaluated at the estimate $\hat\theta$, i.e. the observed Fisher Information matrix as a consistent estimator of the variance-covariance matrix of $\hat\theta$.

Again, based on a Taylor series expansion about $\theta$, @kenward1997 show that
\[
  \hat\Phi \simeq \Phi + \sum_{h=1}^k{(\hat\theta_h - \theta_h)\frac{\partial{\Phi}}{\partial{\theta_h}}} + \frac{1}{2} \sum_{h=1}^k{\sum_{j=1}^k{(\hat\theta_h - \theta_h)(\hat\theta_j - \theta_j)\frac{\partial^2{\Phi}}{\partial{\theta_h}\partial{\theta_j}}}}
\]
Ignoring the possible bias in $\hat\theta$,
\[
  E(\hat\Phi) \simeq \Phi + \frac{1}{2} \sum_{h=1}^k{\sum_{j=1}^k{W_{hj}\frac{\partial^2{\Phi}}{\partial{\theta_h}\partial{\theta_j}}}}
\]
Using previously defined notations, this can be further written as
\[
  \frac{\partial^2{\Phi}}{\partial{\theta_h}\partial{\theta_j}} = \Phi (P_h \Phi P_j + P_j \Phi P_h - Q_{hj} - Q_{jh} + R_{hj}) \Phi
\]
where
\[
  R_{hj} = X^\top\Omega^{-1}\frac{\partial^2\Omega}{\partial{\theta_h}\partial{\theta_j}} \Omega^{-1} X
\]

substituting $\Phi$ and $\Lambda$ back to the $\hat\Phi_A$, we have

\[
  \hat\Phi_A = \hat\Phi + 2\hat\Phi \left\{\sum_{h=1}^k{\sum_{j=1}^k{W_{hj}(Q_{hj} - P_h \hat\Phi P_j - \frac{1}{4}R_{hj})} }\right\} \hat\Phi
\]

where $\Omega(\hat\theta)$ replaces $\Omega(\theta)$ in the right-hand side.

Please note that, if we ignore $R_{hj}$, the second-order derivatives, we will get a different estimate of adjusted covariance matrix, and we call this the linear Kenward-Roger approximation.

### Special Considerations for mmrm models

In mmrm models, $\Omega$ is a block-diagonal matrix, hence we can calculate  $P$, $Q$ and $R$ for each subject and add them up.

\[
  P_h = \sum_{i=1}^{N}{P_{ih}} = \sum_{i=1}^{N}{X_i^\top \frac{\partial{\Sigma_i^{-1}}}{\partial \theta_h} X_i}
\]

\[
  Q_{hj} = \sum_{i=1}^{N}{Q_{ihj}} = \sum_{i=1}^{N}{X_i^\top \frac{\partial{\Sigma_i^{-1}}}{\partial \theta_h} \Sigma_i \frac{\partial{\Sigma_i^{-1}}}{\partial \theta_j} X_i}
\]

\[
  R_{hj} = \sum_{i=1}^{N}{R_{ihj}} = \sum_{i=1}^{N}{X_i^\top\Sigma_i^{-1}\frac{\partial^2\Sigma_i}{\partial{\theta_h}\partial{\theta_j}} \Sigma_i^{-1} X_i}
\]


### Derivative of the overall covariance matrix $\Sigma$

The derivative of the overall covariance matrix $\Sigma$ with respect to the variance parameters can be calculated through the derivatives of the Cholesky factor, and hence obtained through automatic differentiation,
following [matrix identities calculations](https://en.wikipedia.org/wiki/Matrix_calculus#Identities_in_differential_form).
\[
  \frac{\partial{\Sigma}}{\partial{\theta_h}} = \frac{\partial{LL^\top}}{\partial{\theta_h}} = \frac{\partial{L}}{\partial{\theta_h}}L^\top + L\frac{\partial{L^\top}}{\partial{\theta_h}}
\]

\[
  \frac{\partial^2{\Sigma}}{\partial{\theta_h}\partial{\theta_j}} = \frac{\partial^2{L}}{\partial{\theta_h}\partial{\theta_j}}L^\top + L\frac{\partial^2{L^\top}}{\partial{\theta_h}\partial{\theta_j}} + \frac{\partial{L}}{\partial{\theta_h}}\frac{\partial{L^T}}{\partial{\theta_j}} + \frac{\partial{L}}{\partial{\theta_j}}\frac{\partial{L^\top}}{\partial{\theta_h}}
\]

### Derivative of the $\Sigma^{-1}$

The derivatives of $\Sigma^{-1}$ can be calculated through

\[
  \frac{\partial{\Sigma\Sigma^{-1}}}{\partial{\theta_h}}\\
  = \frac{\partial{\Sigma}}{\partial{\theta_h}}\Sigma^{-1} + \Sigma\frac{\partial{\Sigma^{-1}}}{\partial{\theta_h}} \\
  = 0
\]
\[
  \frac{\partial{\Sigma^{-1}}}{\partial{\theta_h}} = - \Sigma^{-1} \frac{\partial{\Sigma}}{\partial{\theta_h}}\Sigma^{-1}
\]

### Subjects with missed visits

If a subject do not have all visits, the corresponding covariance matrix can be represented as
\[
  \Sigma_i = S_i^\top \Sigma S_i
\]

and the derivatives can be obtained through

\[
  \frac{\partial{\Sigma_i}}{\partial{\theta_h}} = S_i^\top \frac{\partial{\Sigma}}{\partial{\theta_h}} S_i
\]

\[
  \frac{\partial^2{\Sigma_i}}{\partial{\theta_h}\partial{\theta_j}} = S_i^\top \frac{\partial^2{\Sigma}}{\partial{\theta_h}\partial{\theta_j}} S_i
\]

The derivative of the $\Sigma_i^{-1}$, $\frac{\partial\Sigma_i^{-1}}{\partial{\theta_h}}$ can be calculated through $\Sigma_i^{-1}$ and $\frac{\partial{\Sigma_i}}{\partial{\theta_h}}$
using the above.

### Scenario under group specific covariance estimates

When fitting grouped `mmrm` models, the covariance matrix for subject i of group $g(i)$, can be written as
\[
  \Sigma_i = S_i^\top \Sigma_{g(i)} S_i$.
\]
Assume there are $B$ groups, the number of parameters is increased by $B$ times. With the fact that for each group, the corresponding
$\theta$ will not affect other parts, we will have block-diagonal $P$, $Q$ and $R$ matrices, where the blocks are given by:

\[
P_h = \begin{pmatrix}
P_{h, 1} & \dots & P_{h, B} \\
\end{pmatrix}
\]

\[
Q_{hj} = \begin{pmatrix}
Q_{hj, 1} & 0 & \dots & \dots & 0 \\
0 & Q_{hj, 2} & 0 & \dots & 0\\
\vdots & & \ddots & & \vdots \\
\vdots & & & \ddots & \vdots \\
0 & \dots & \dots & 0 & Q_{hj, B}
\end{pmatrix}
\]

\[
R_{hj} = \begin{pmatrix}
R_{hj, 1} & 0 & \dots & \dots & 0 \\
0 & R_{hj, 2} & 0 & \dots & 0\\
\vdots & & \ddots & & \vdots \\
\vdots & & & \ddots & \vdots \\
0 & \dots & \dots & 0 & R_{hj, B}
\end{pmatrix}
\]

Use $P_{j, b}$ to denote the block diagonal part for group $b$, we have
\[
  P_{h,b} = \sum_{g(i) = b}{P_{ih}} = \sum_{g(i) = b}{X_i^\top \frac{\partial{\Sigma_i^{-1}}}{\partial \theta_h} X_i}
\]

\[
  Q_{hj,b} = \sum_{g(i) = b}{Q_{ihj}} = \sum_{g(i) = b}{X_i^\top \frac{\partial{\Sigma_i^{-1}}}{\partial \theta_h} \Sigma_i \frac{\partial{\Sigma_i^{-1}}}{\partial \theta_j} X_i}
\]

\[
  R_{hj,b} = \sum_{g(i) = b}{R_{ihj}} = \sum_{g(i) = b}{X_i^\top\Sigma_i^{-1}\frac{\partial^2\Sigma_i}{\partial{\theta_h}\partial{\theta_j}} \Sigma_i^{-1} X_i}
\]

Similarly for $R$.

### Scenario under weighted mmrm

Under weights mmrm model, the covariance matrix for subject $i$, can be represented as

\[
  \Sigma_i = G_i^{-1/2} S_i^\top \Sigma S_i G_i^{-1/2}
\]

Where $G_i$ is a diagonal matrix of the weights. Then, when deriving $P$, $Q$ and $R$,
there are no mathematical differences as they are constant, and having $G_i$ in addition to $S_i$
does not change the algorithms and we can simply multiply the formulas with $G_i^{-1/2}$, similarly as above for the subsetting matrix.

## Inference

Suppose we are testing the linear combination of $\beta$, $C\beta$ with $C \in \mathbb{R}^{c\times p}$, we can use the following F-statistic
\[
  F = \frac{1}{c} (\hat\beta - \beta)^\top  C (C^\top \hat\Phi_A C)^{-1} C^\top (\hat\beta - \beta)
\]
and
\[
  F^* = \lambda F
\]
follows exact $F_{c,m}$ distribution.

When we have only one coefficient to test, then $C$ is a column vector containing a single $1$ inside. We can still follow the same calculations as for the multi-dimensional case. This recovers the degrees of freedom results of the Satterthwaite method.

$\lambda$ and $m$ can be calculated through

\[
  M = C (C^\top \Phi C)^{-1} C^\top
\]

\[
  A_1 = \sum_{h=1}^k{\sum_{j=1}^k{W_{hj} tr(M \Phi P_h \Phi) tr(M \Phi P_j \Phi)}}
\]

\[
  A_2 = \sum_{h=1}^k{\sum_{j=1}^k{W_{hj} tr(M \Phi P_h \Phi M \Phi P_j \Phi)}}
\]

\[
  B = \frac{1}{2c}(A_1 + 6A_2)
\]

\[
  g = \frac{(c+1)A_1 - (c+4)A_2}{(c+2)A_2}
\]

\[
  c_1 = \frac{g}{3c+2(1-g)}
\]

\[
  c_2 = \frac{c-g}{3c+2(1-g)}
\]

\[
  c_3 = \frac{c+2-g}{3c+2(1-g)}
\]
\[E^*={\left\{1-\frac{A_2}{c}\right\}}^{-1}\]
\[V^*=\frac{2}{c}{\left\{\frac{1+c_1 B}{(1-c_2 B)^2(1-c_3 B)}\right\}}\]

\[\rho = \frac{V^{*}}{2(E^*)^2}\]

\[m = 4 + \frac{c+2}{c\rho - 1}\]
\[\lambda = \frac{m}{E^*(m-2)}\]

## Parameterization methods and Kenward-Roger

While the Kenward-Roger adjusted covariance matrix is adopting a Taylor series to approximate the true value, the choices
of parameterization can change the result. In a simple example of unstructured covariance structure, in our current approach,
where the parameters are elements of the Cholesky factor of $\Sigma$ (see [parameterization](covariance.html)), the second-order derivatives
of $\Sigma$ over our parameters, are non-zero matrices. However, if we use the elements of $\Sigma$ as our parameters, then the second-order
derivatives are zero matrices. However, the differences can be small and will not affect the inference.
If you would like to match SAS results for the unstructured covariance model, you can use the linear Kenward-Roger approximation.

## Implementations in `mmrm`

In package `mmrm`, we have implemented Kenward-Roger calculations based on the previous sections.
Specially, for the first-order and second-order derivatives, we use automatic differentiation to obtain
the results easily for non-spatial covariance structure. For spatial covariance structure, we derive the
exact results.

### Spatial Exponential Derivatives

For spatial exponential covariance structure, we have

\[\theta = (\theta_1,\theta_2)\]
\[\sigma = e^{\theta_1}\]
\[\rho = \frac{e^{\theta_2}}{1 + e^{\theta_2}}\]

\[\Sigma_{ij} = \sigma \rho^{d_{ij}}\]
where $d_{ij}$ is the distance between time point $i$ and time point $j$.

So the first-order derivatives can be written as:

\[
  \frac{\partial{\Sigma_{ij}}}{\partial\theta_1} = \frac{\partial\sigma}{\partial\theta_1} \rho^{d_{ij}}\\
  = e^{\theta_1}\rho^{d_{ij}} \\
  = \Sigma_{ij}
\]

\[
  \frac{\partial{\Sigma_{ij}}}{\partial\theta_2} = \sigma\frac{\partial{\rho^{d_{ij}}}}{\partial\theta_2} \\
  = \sigma\rho^{d_{ij}-1}{d_{ij}}\frac{\partial\rho}{\partial\theta_2}\\
  = \sigma\rho^{d_{ij}-1}{d_{ij}}\rho(1-\rho) \\
  = \sigma \rho^{d_{ij}} {d_{ij}} (1-\rho)
\]

Second-order derivatives can be written as:

\[
  \frac{\partial^2{\Sigma_{ij}}}{\partial\theta_1\partial\theta_1}\\
  = \frac{\partial\Sigma_{ij}}{\partial\theta_1}\\
  = \Sigma_{ij}
\]


\[
  \frac{\partial^2{\Sigma_{ij}}}{\partial\theta_1\partial\theta_2} = \frac{\partial^2{\Sigma_{ij}}}{\partial\theta_2\partial\theta_1} \\
  = \frac{\partial\Sigma_{ij}}{\partial\theta_2}\\
  = \sigma\rho^{d_{ij}-1}{d_{ij}}\rho(1-\rho)\\
  = \sigma\rho^{d_{ij}}{d_{ij}}(1-\rho)
\]


\[
  \frac{\partial^2{\Sigma_{ij}}}{\partial\theta_2\partial\theta_2}\\
  = \frac{\partial{\sigma\rho^{d_{ij}}{d_{ij}}(1-\rho)}}{\partial\theta_2}\\
  = \sigma\rho^{d_{ij}}{d_{ij}}(1-\rho)(d_{ij} (1-\rho) - \rho)
\]

# References