# Fitting a hierarchical logistic model to German credit data

This notebook explains how to run the toy hierarchical logistic regression model example using the German credit data from [1]. In this example, we have predictors for 1000 individuals and an outcome variable indicating whether or not each individual should be given credit.


[1] "UCI machine learning repository", 2010. A. Frank and A. Asuncion. https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)

In [1]:
import pints
import pints.toy
import pints.plot
import numpy as np
import matplotlib.pyplot as plt
import io
import urllib
from scipy import stats
import time

To run this example, we need to first get the data from [1] and process it so we have dichtonomous $y\in\{-1,1\}$ outputs and the matrix of predictors has been standardised. In addition, we also add a column of 1s corresponding to a constant term in the regression.

If you are connected to the internet, by instantiating with `x=None`, Pints will fetch the data from the repo for you. If, instead, you have local copies of the `x` and `y` matrices, these can be supplied as arguments. 

In [2]:
logpdf = pints.toy.GermanCreditHierarchicalLogPDF()

Let's look at the data: `x` is a matrix of predictors and `y` is a vector of credit recommendations for 1000 individuals. Pints also handles processing of `x` into a design matrix `z` of all interactions between variables (including with themselves).

In [3]:
x, y, z = logpdf.data()
print(z.shape)

(1000, 325)


Now we run HMC to fit the parameters of the model.

In [7]:
xs = [
    np.random.uniform(0, 1, size=(logpdf.n_parameters())),
    np.random.uniform(0, 1, size=(logpdf.n_parameters())),
    np.random.uniform(0, 1, size=(logpdf.n_parameters())),
]

mcmc = pints.MCMCController(logpdf, len(xs), xs, method=pints.HamiltonianMCMC)
mcmc.set_max_iterations(400)

# Set up modest logging
mcmc.set_log_to_screen(True)
mcmc.set_log_interval(10)

for sampler in mcmc.samplers():
    sampler.set_leapfrog_step_size(0.02)
    sampler.set_leapfrog_steps(1)

start = time.time()    

# Run!
print('Running...')
chains = mcmc.run()
print('Done!')

end = time.time()  
diff = end - start

Running...
Using Hamiltonian Monte Carlo
Generating 3 chains.
Running in sequential mode.
Iter. Eval. Accept.   Accept.   Accept.   Time m:s
0     3      0         0         0          0:00.1
1     6      0.333     0.333     0          0:00.6
2     9      0.5       0.5       0          0:01.1
3     12     0.6       0.4       0.2        0:01.5
10    33     0.667     0.5       0.5        0:04.6
20    63     0.682     0.5       0.636      0:09.0
30    93     0.71875   0.46875   0.71875    0:13.3
40    123    0.667     0.452381  0.762      0:17.7
50    153    0.654     0.423     0.731      0:22.0
60    183    0.677     0.452     0.71       0:26.3
70    213    0.653     0.458     0.708      0:30.5
80    243    0.659     0.451     0.683      0:35.0
90    273    0.62      0.467     0.652      0:39.2
100   303    0.608     0.48      0.627451   0:43.6
110   333    0.589     0.455     0.643      0:48.0
120   363    0.582     0.434     0.631      0:52.6
130   393    0.576     0.432     0.614     

This is clearly a much harder problem than the non-hierarchical version!

In [8]:
results = pints.MCMCSummary(chains=chains, time=diff)
print(results)

param      mean    std.    2.5%    25%    50%    75%    97.5%    rhat    ess    ess per sec.
---------  ------  ------  ------  -----  -----  -----  -------  ------  -----  --------------
param 1    0.65    0.08    0.54    0.58   0.63   0.70   0.81     1.05    9.11   0.05
param 2    0.22    0.16    -0.04   0.08   0.21   0.37   0.51     3.80    8.83   0.05
param 3    0.76    0.16    0.52    0.56   0.81   0.91   0.93     13.71   8.98   0.05
param 4    0.67    0.29    0.25    0.26   0.85   0.89   0.92     26.78   13.57  0.07
param 5    0.56    0.25    0.24    0.27   0.55   0.87   0.87     27.42   17.72  0.10
param 6    0.68    0.25    0.31    0.33   0.82   0.89   0.91     29.91   13.81  0.07
param 7    0.57    0.26    0.22    0.24   0.62   0.84   0.90     17.76   11.31  0.06
param 8    0.73    0.24    0.39    0.40   0.89   0.91   0.93     32.07   17.13  0.09
param 9    0.60    0.20    0.32    0.33   0.65   0.80   0.82     39.49   19.62  0.11
param 10   0.54    0.19    0.27    0.29   0.61 