# Fitting a hierarchical logistic model to German credit data

This notebook explains how to run the toy hierarchical logistic regression model example using the German credit data from [1]. In this example, we have predictors for 1000 individuals and an outcome variable indicating whether or not each individual should be given credit.

[1] "UCI machine learning repository", 2010. A. Frank and A. Asuncion. https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pints
import pints.toy

To run this example, we need to first get the data from [1] and process it so we have dichtonomous $y\in\{-1,1\}$ outputs and the matrix of predictors has been standardised. In addition, we also add a column of 1s corresponding to a constant term in the regression.

If you are connected to the internet, by instantiating with `x=None`, Pints will fetch the data from the repo for you. If, instead, you have local copies of the `x` and `y` matrices, these can be supplied as arguments. 

In [2]:
logpdf = pints.toy.GermanCreditHierarchicalLogPDF(download=True)

Let's look at the data: `x` is a matrix of predictors and `y` is a vector of credit recommendations for 1000 individuals. Pints also handles processing of `x` into a design matrix `z` of all interactions between variables (including with themselves).

In [3]:
x, y, z = logpdf.data()
print(z.shape)

(1000, 325)


Now we run HMC to fit the parameters of the model.

In [4]:
xs = [
    np.random.uniform(0, 1, size=(logpdf.n_parameters())),
    np.random.uniform(0, 1, size=(logpdf.n_parameters())),
    np.random.uniform(0, 1, size=(logpdf.n_parameters())),
]

mcmc = pints.MCMCController(logpdf, len(xs), xs, method=pints.HamiltonianMCMC)
mcmc.set_max_iterations(400)

# Set up modest logging
mcmc.set_log_to_screen(True)
mcmc.set_log_interval(10)

for sampler in mcmc.samplers():
    sampler.set_leapfrog_step_size(0.02)
    sampler.set_leapfrog_steps(1)

# Run!
print('Running...')
chains = mcmc.run()
print('Done!')

Running...
Using Hamiltonian Monte Carlo
Generating 3 chains.
Running in sequential mode.
Iter. Eval. Accept.   Accept.   Accept.   Time m:s
0     3      0         0         0          0:00.1
1     6      0         0         0          0:00.5
2     9      0         0         0          0:01.1
3     12     0         0         0          0:01.4
10    33     0         0         0          0:03.6
20    63     0         0         0          0:07.5
30    93     0         0         0          0:10.9
40    123    0         0         0          0:14.5
50    153    0         0         0          0:18.4
60    183    0         0         0          0:21.8
70    213    0         0         0          0:25.4
80    243    0         0         0          0:31.2
90    273    0         0         0          0:34.8
100   303    0         0         0          0:38.5
110   333    0         0         0          0:42.4
120   363    0         0         0          0:46.6
130   393    0         0         0         

This is clearly a much harder problem than the non-hierarchical version!

In [5]:
results = pints.MCMCSummary(chains=chains, time=mcmc.time())
print(results)

  return np.sqrt((W + (1.0 / t) * (B - W)) / W)
  x = (x - np.mean(x)) / (np.std(x) * np.sqrt(len(x)))


param      mean    std.    2.5%    25%    50%    75%    97.5%    rhat                  ess    ess per sec.
---------  ------  ------  ------  -----  -----  -----  -------  --------------------  -----  --------------
param 1    0.21    0.08    0.10    0.10   0.24   0.29   0.29     1477951987327431.50   nan    nan
param 2    0.27    0.07    0.18    0.18   0.26   0.37   0.37     2277021055607869.50   nan    nan
param 3    0.22    0.09    0.14    0.14   0.18   0.35   0.35     2716659842192505.00   nan    nan
param 4    0.47    0.14    0.29    0.29   0.50   0.63   0.63     4800474167075437.00   nan    nan
param 5    0.47    0.23    0.19    0.19   0.46   0.75   0.75     3419016103109629.00   2.99   0.02
param 6    0.56    0.34    0.12    0.12   0.61   0.95   0.95     5837769648701615.00   nan    nan
param 7    0.49    0.29    0.09    0.09   0.60   0.77   0.77     3454367561356858.00   nan    nan
param 8    0.49    0.24    0.21    0.21   0.46   0.79   0.79     2986223916444384.50   2.99   0.0