
# The Affine Transform Will Explode Your Mind

The non-centered parameterization is recommended for hierarchical models
where the groups have relatively few members.

To see how this works with binomial data, i.e., outcomes from repeated binary trials,
we build a hierarchical model with a *normal prior* on the *log odds of success*

The data and models are taken from Bob Carpenter's most excellent case study:
[Hierarchical Partial Pooling for Repeated Binary Trials](https://mc-stan.org/users/documentation/case-studies/pool-binary-trials.html)

### Packages used in this notebook

We use [CmdStanPy](https://mc-stan.org/cmdstanpy) to do the model fitting and plot the results using [plotnine](https://plotnine.readthedocs.io/en/stable/), a ggplot2-like Python package.
Pandas and NumPy are also used for data munging.

In [None]:
import os
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from plotnine import *
%matplotlib inline

from cmdstanpy import CmdStanModel

In [None]:
theme_set(
  theme_grey() + 
  theme(text=element_text(size=10),
        plot_title=element_text(size=14),
        axis_title_x=element_text(size=12),
        axis_title_y=element_text(size=12),
        axis_text_x=element_text(size=8),
        axis_text_y=element_text(size=8)
       )
)

### Baseball Data:  Number of hits in 45 at-bats for 18 MLB players in 1971

In [None]:
with open('efron-morris-75-data.tsv') as tsv_file:
    df = pd.read_csv("efron-morris-75-data.tsv", sep="\t")
df.style.hide_index().format(precision=3)

In [None]:
baseball_data = {"N": df.shape[0],
                 "K": df['At-Bats'],
                 "y": df['Hits'],
                 "K_new": df['RemainingAt-Bats'],
                 "y_new": df['SeasonHits']-df['Hits']}

M = 10000

def bda_plot(df, x_lab, y_lab, title=''):
  return (ggplot(df, aes('x', 'y')) +
          geom_point(alpha=0.2) +
          xlab(x_lab) +
          ylab(y_lab) +
          ggtitle(title) +
          theme(figure_size=(8,6)))

## The Model

The model we are interested in is a hierarchical model
with a *normal prior* on the *log odds of success*.
The mathematical model specification is

$$
p(y_n \, | \, K_n, \alpha_n) 
\ = \ \mathsf{Binomial}(y_n \, | \, K_n, \ \mathrm{logit}^{-1}(\alpha_n))
$$

with a simple normal hierarchical prior

$$
p(\alpha_n \, | \, \mu, \sigma)
= \mathsf{Normal}(\alpha_n \, | \, \mu, \sigma).
$$

a weakly informative hyperprior for $\mu$

$$
p(\mu) = \mathsf{Normal}(\mu \, | \, -1, 1),
$$

and a half normal prior on $\sigma$

$$
p(\sigma)
\ = \ 2 \, \mathsf{Normal}(\sigma \, | \, 0, 1)
\ \propto \ \mathsf{Normal}(\sigma \, | \, 0, 1).
$$

### Centered Parameterization

The Stan program `hier-logit-centered.stan` is a straightforward encoding of
a hierarchical model with a *normal prior* on the *log odds of success*,
but this is not the optimal way to code this model in Stan, as we will soon demonstrate.

```
parameters {
  real mu;                       // population mean of success log-odds
  real<lower=0> sigma;           // population sd of success log-odds
  vector[N] alpha;               // success log-odds
}
model {
  mu ~ normal(-1, 1);               // hyperprior
  sigma ~ normal(0, 1);             // hyperprior
  alpha ~ normal(mu, sigma);        // prior (hierarchical)
  y ~ binomial_logit(K, alpha);     // likelihood
}
```

In [None]:
hier_logit_centered_model = CmdStanModel(stan_file='hier-logit-centered.stan')
print(hier_logit_centered_model.code())

In [None]:
fit_centered = hier_logit_centered_model.sample(
    data=baseball_data,
    iter_sampling=int(M/4),
    seed=54321)

The variable `theta` is the per-player chance of success, i.e., `theta * 100` is their batting average.  These range from 240 - 270, which is in line with what we expect from major league baseball players.

In [None]:
fit_centered.summary(sig_figs=3).round(decimals=3).filter(regex=r'mu|sigma|theta', axis="index")

The reported Eff values for `sigma` are low and the R_hat value is above 1.   CmdStan's `diagnose` method indicates that this model had problems fitting the data.

In [None]:
print(fit_centered.diagnose())

#### Meet the Funnel

With a small amount of data, the sampler cannot properly determine how much of the observed variance in the data is individual-level variance or group-level variance.  It displays low ESS and poor R-hat for `sigma`.
When we plot the individual level estimates for player one against `log(sigma)`, the range on the y axis is (-4, 0) and there is a clear funnel shape with many draws in the neck of the funnel.

In [None]:
df_x_y = pd.DataFrame(
    data={'x': fit_centered.stan_variable('alpha')[:,0],
          'y': np.log(fit_centered.stan_variable('sigma'))
         }
)

bda_plot(df_x_y,
         x_lab = "alpha[1]: player 1 log odds of success",
         y_lab = "log(sigma): log population scale",
         title = "population vs player params, centered parameterization")

## The Non-Centered Parameterization 

The centered parameterization is challenging for MCMC methods to sample
when there are small counts per group (here, the players are the groups
and each has only 45 at bats observed).
Moving to a non-centered parameterization may mitigate the problem.
This changes the parameterization over which sampling is done,
taking now a standard unit normal prior for a new variable,

$$
\alpha^{\mathrm{std}}_n = \frac{\alpha_n - \mu}{\sigma}.
$$

Then we can parameterize in terms of $\alpha^{\mathrm{std}}$, which
has a standard-normal distribution

$$
p(\alpha^{\mathrm{std}}_n) = \mathsf{Normal}(\alpha^{\mathrm{std}}_n \, | \, 0, 1).
$$

We can then define our original $\alpha$ as a derived quantity

$$
\alpha_n = \mu + \sigma \, \alpha^{\mathrm{std}}_n.
$$

This decouples the sampling distribution
for $\alpha^{\mathrm{std}}$ from $\mu$ and $\sigma$, greatly reducing
their correlation in the posterior.  

###  Non-centered parameterization using a standard normal distribution

Prior to Stan 2.19, a Stan implementation directly encoded the above reparameterization,
introducing a new variable `alpha_std` which has a standard normal distribution,
thus decoupling the sampling distribution of `alpha_std` from `mu` and `sigma`.
```
parameters {
  real mu; // population mean of success log-odds
  real<lower=0> sigma; // population sd of success log-odds
  vector[N] alpha_std; // success log-odds (standardized)
}
model {
  mu ~ normal(-1, 1); // hyperprior
  sigma ~ normal(0, 1); // hyperprior
  alpha_std ~ normal(0, 1); // prior (hierarchical)
  y ~ binomial_logit(K, mu + sigma * alpha_std); // likelihood
}
```

The Stan program "hier-logit-nc-std-norm.stan" contains this model.

###  Non-centered parameterization using an affine transform


Since Stan version 2.19, the Stan language's
[affine transform](https://mc-stan.org/docs/reference-manual/univariate-data-types-and-variable-declarations.html) construct provides a more efficient way to do this.
For a real variable, the affine transform $x\mapsto \mu + \sigma * x$ with offset $\mu$ and (positive) multiplier $\sigma$
is specified using a syntax like that used for upper/lower bounds, with keywords <code>offset</code>, <code>multiplier</code>.
Specifying the affine transform in the parameter declaration for 
$\alpha^{\mathrm{std}}$ eliminates the need for intermediate variables
and makes it easier to see the hierarchical structure of the model.

When the parameters to the prior for $\sigma$ are constants, the
normalization for the half-prior (compared to the full prior) is
constant and therefore does not need to be included in the notation.
This only works if the parameters to the density are data or constants;
if they are defined as parameters or as quantities depending on parameters,
then explicit truncation is required.

The Stan program `hier-logit-nc-affine-xform.stan` uses the affine-transform syntax to
specify the non-centered version of the hierarchical model
with a normal prior on the log odds of success.


```
parameters {
  real mu; // population mean of success log-odds
  real<lower=0> sigma; // population sd of success log-odds
  vector<offset=mu, multiplier=sigma>[N] alpha; // success log-odds
}
model {
  mu ~ normal(-1, 1); // hyperprior
  sigma ~ normal(0, 1); // hyperprior
  alpha ~ normal(mu, sigma); // prior (hierarchical)
  y ~ binomial_logit(K, alpha); // likelihood
}
```


For the purposes of comparison to the other three models,
the chance of success $\theta$ is computed as a generated quantity.

```
generated quantities {
  vector[N] theta = inv_logit(alpha);
  vector[N] alpha_std = (alpha - mu) / sigma;
}
```

### Fitting the standard normal reparameterization

The model `hier-logit-nc-std-norm.stan` fits the model using parameter `alpha_std`.  (*Full disclosure: depending on the random seed, it may report 1 or 2 divergences for a sample of 2500 draws; we've chosen a random seed that avoids this problem and have used it throughout this note.*)

In [None]:
nc_std_norm_model = CmdStanModel(stan_file='hier-logit-nc-std-norm.stan')
print(nc_std_norm_model.code())

In [None]:
fit_nc_std_norm = nc_std_norm_model.sample(
    data=baseball_data,
    iter_sampling=int(M/4),
    seed=54321)

Again, we check for problems by running CmdStan's `diagnose` method.

In [None]:
print(fit_nc_std_norm.diagnose())

The estimates for `mu`, `sigma`, `theta` and `alpha` are roughly the same as for the centered parameterization.  The non-centered parameterization results in a much larger effective sample size.

In [None]:
print("Centered parameterization")
print(fit_centered.summary(sig_figs=3).round(decimals=3).filter(
    ["mu", "sigma",
     "theta[1]", "theta[5]", "theta[10]", "theta[18]",
     "alpha[1]", "alpha[5]", "alpha[10]", "alpha[18]"],
    axis="index"))

print("\nNon-centered parameterization, std_normal reparameterization")
print(fit_nc_std_norm.summary(sig_figs=3).round(decimals=3).filter(
    ["mu", "sigma",
     "theta[1]", "theta[5]", "theta[10]", "theta[18]",
     "alpha[1]", "alpha[5]", "alpha[10]", "alpha[18]"],
    axis="index"))

To consider how the reparameterization is working, we plot the
posterior for the mean and log scale of the hyperprior.
The prior location ($\mu$) and scale ($\sigma$) are coupled in the posterior.

In [None]:
df_x_y = pd.DataFrame(data={'x': fit_nc_std_norm.stan_variable('mu'),
                            'y': np.log(fit_nc_std_norm.stan_variable('sigma'))})

bda_plot(df_x_y,
         x_lab = "mu",
         y_lab = "log(sigma)",
         title = "Hierarchical params, standard normal reparameterization")

Now when we plot the sample values for log scale and the first transformed parameter, `alpha_std[1]`,
the range on the Y axis is (-12, 0) and the narrow funnel neck is gone.   But the values on the X axis range from -3 to 4, and are not directly interpretable as the log odds of success.

In [None]:
df_x_y_71 = pd.DataFrame(
    data={'x': fit_nc_std_norm.stan_variable('alpha_std')[: , 0],
          'y': np.log(fit_nc_std_norm.stan_variable('sigma'))}
)

bda_plot(df_x_y_71,
         x_lab = "alpha_std[1]: player 1 log odds of success (transformed)",
         y_lab = "log(sigma): log population scale",
         title = "population vs player params, non-centered parameterization")

We can also plot the value for the generated quantities variable `alpha[1]`.
This is readily interpretable as the log odds of success.
The y-axis ranges from (-12, 0).
There isn't a pileup of draws at the bottom, indicating that the sampler has been able
to explore the posterior.
But it still has the long-tail problem as `sigma` approaches zero.

In [None]:
df_x_y_71 = pd.DataFrame(
    data={'x': fit_nc_std_norm.stan_variable('alpha')[: , 0],
          'y': np.log(fit_nc_std_norm.stan_variable('sigma'))}
)

bda_plot(df_x_y_71,
         x_lab = "alpha[1]: player 1 log odds of success",
         y_lab = "log(sigma): log population scale",
         title = "population vs player, back-transformed alpha")

### Fitting the affine transform parameterization

The model `hier-logit-nc-affine-xform.stan` looks just like the centered parameterization,
with the exception that parameter `alpha` is defined with `<offset = mu, multiplier = sigma>`.

In [None]:
nc_affine_xform_model = CmdStanModel(stan_file='hier-logit-nc-affine-xform.stan')
print(nc_affine_xform_model.code())

In [None]:
fit_nc_affine = nc_affine_xform_model.sample(
    data=baseball_data,
    iter_sampling=int(M/4),
    seed=54321)

As usual, we check for problems by running CmdStan's diagnose method.

In [None]:
print(fit_nc_affine.diagnose())

In [None]:
print("Centered parameterization")
print(fit_centered.summary(sig_figs=3).round(decimals=3).filter(
    ["mu", "sigma",
     "theta[1]", "theta[5]", "theta[10]", "theta[18]",
     "alpha[1]", "alpha[5]", "alpha[10]", "alpha[18]"],
    axis="index"))

print("\nNon-centered parameterization, std_normal reparameterization")
print(fit_nc_std_norm.summary(sig_figs=3).round(decimals=3).filter(
    ["mu", "sigma",
     "theta[1]", "theta[5]", "theta[10]", "theta[18]",
     "alpha[1]", "alpha[5]", "alpha[10]", "alpha[18]"],
    axis="index"))

print("\nNon-centered parameterization, affine transform reparameterization")
print(fit_nc_affine.summary(sig_figs=3).round(decimals=3).filter(
    ["mu", "sigma",
     "theta[1]", "theta[5]", "theta[10]", "theta[18]",
     "alpha[1]", "alpha[5]", "alpha[10]", "alpha[18]"],
    axis="index"))


We plot the sample values for log scale and the first player ability parameter, `alpha[1]`;
this looks pretty much the same as the above plot.

In [None]:
df_x_y_71 = pd.DataFrame(
    data={'x': fit_nc_affine.stan_variable('alpha')[: , 0],
          'y': np.log(fit_nc_affine.stan_variable('sigma'))}
)

bda_plot(df_x_y_71,
         x_lab = "alpha[1]: player 1 log odds of success",
         y_lab = "log(sigma): log population scale",
         title = "population vs player params, affine transform")

Both `hier-logit-nc-std-norm.stan` and `hier-logit-nc-affine-xform.stan` produce essentially
the same results; this is because both models are essentially the same model:
they encode the non-centered parameterization.

Using the affine transform syntax allows us to write models which directly express the hierarchical structure of the model and which are therefore readily interpretable, unlike the standard normal parameter values.
The only difference is that for the standard normal parameterization
the user must apply the affine transform throughout the program to recover
an interpretable parameter estimate, whereas the use of the offset-multiplier construct
does this automatically.

In program `hier-logit-nc-std-norm.stan`
variable `alpha_std` is a parameter with prior `std_normal()`.
The user specifies the affine transform `mu + sigma * alpha_std`
everywhere.

In the generated quantities block we recover `theta`,
our estimate of a player's chance of success,
and `alpha`, the log-odds of success.
```
generated quantities {
  vector[N] theta = inv_logit(mu + sigma * alpha_std);
  vector[N] alpha = mu + sigma * alpha_std;  // recover alpha
}
```

In program `hier-logit-nc-affine-xform.stan`
variable `alpha` is a parameter with hierarchical prior `normal(mu, sigma)`.
In the generated quantities block we recover `theta`,
our estimate of a player's chance of success.
Were there a need for it, we would be able to generate variable `alpha_std`
as well.

```
generated quantities {
  vector[N] theta = inv_logit(alpha);
  vector[N] alpha_std = (alpha - mu)/sigma;
}
```

To verify that this is correct, we plot `alpha_std[1]` against `log(sigma)`.

In [None]:
df_x_y_71 = pd.DataFrame(
    data={'x': fit_nc_affine.stan_variable('alpha_std')[: , 0],
          'y': np.log(fit_nc_affine.stan_variable('sigma'))}
)

bda_plot(df_x_y_71,
         x_lab = "alpha_std[1]: player 1 log odds of success (transformed)",
         y_lab = "log(sigma): log population scale",
         title = "population vs player params, non-centered parameterization")