<div style=float:right><img src="assets/img/appliedai-logo.png" width=100 style="margin: 0px 20px"></img></div>


##### Jonathan Sedar Personal Project
## PyMC3 vs PyStan Comparison
_Spring 2016_

This set of Notebooks and scripts comprise the **pymc3_vs_pystan** personal project by Jonathan Sedar of Applied AI Ltd, written primarily for presentation at the PyData London 2016 Conference.

The project demonstrates hierarchical linear regression using two Bayesian inference frameworks: PyMC3 and PyStan. The project borrows heavily from code written for Applied AI Ltd and is supplied here for educational purposes only. No copyright or license is extended to users.


    
# 40_HierarchicalLinearRegression

#### Demonstrate pooling and hierarchical linear regression


Create a set of progressively more complex models, trying to show the effect of manufacturer upon NOx emissions. I'll evaluate the models using WAIC and PPC.

+ [Setup](#Setup)
    + [Local Functions](#Local-Functions)
    + [Load Data](#Load-Data)
    + [Prepare Dataset](#Prepare-Dataset)
    + [Describe Dataset](#Describe-Dataset)


+ [Choose Features](#Choose-Features)
    + [Create Modelspecs and Design Matrices](#Create-Modelspecs-and-Design-Matrices)


+ [Pooled Model](#Pooled-Model)


+ [Unpooled Model](#Unpooled-Model)
    + [Evaluate Manufacturers using Unpooled Model](#Evaluate-Manufacturers-using-Unpooled-Model)
    
    
+ [Digression: Fully Unpooled Model](#Digression:-Fully-Unpooled-Model)


+ [Partially-Pooled Model](#Partially-Pooled-Model)
    + [Evaluate Manufacturers using Partially-Pooled Model](Evaluate-Manufacturers-using-Partially-Pooled-Model)
    + [Can we comment on Volkswagen's NOx emissions at `mfr` level?](#Can-we-comment-on-Volkswagen's-NOx-emissions at `mfr` level?)


+ [Hierarchical Model of Parent and Manufacturer](#Hierarchical-Model-of-Parent-and-Manufacturer)



+ [Summary Insights and Notes](#Summary-Insights-and-Notes)
    + [Model Comparison using WAIC](#Model-Comparison-using-WAIC)
    + [Comparing Car Manufacturers and Parent Companies](Comparing-car-manufacturers-and-parent-companies)




+ [Digression: Model Comparison using WAIC](#Digression:-Model-Comparison-using-WAIC)


+ [Final look at parent with a Partially-Pooled Model](#Final-look-at-parent-with-a-Partially-Pooled-Model)
    + [Evaluate Manufacturers using Partially-Pooled `parent` Model](Evaluate-Manufacturers-using-Partially-Pooled-`parent`-Model)
    + [Can we comment on Volkswagen's NOx emissions at `parent` level?](#Can-we-comment-on-Volkswagen's-NOx-emissions at `parent` level?)



# Setup

In [None]:
## Interactive magics
%matplotlib inline
%qtconsole --colors=linux
# %connect_info

In [None]:
# general packages
import sys
from time import time
from collections import defaultdict
import sqlite3
from convenience_functions import *
from ipywidgets import interactive, fixed

# scientific packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import patsy as pt
from scipy import optimize
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.neighbors.kde import KernelDensity
import statsmodels.api as sm

# pymc3 libraries
import pymc3 as pm
import theano as thno
import theano.tensor as T 
import pystan

# filter warnings for presentation's sake
import warnings
warnings.filterwarnings('ignore')

sns.set(style="darkgrid", palette="muted")
pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = 12, 4
np.random.seed(0)

dfwaic_pymc = pd.DataFrame() # setup for WAIC evaluations

##### Package Versions

In [None]:
print('Python: {}'.format(sys.version))
print('Recursion limit {}'.format(sys.getrecursionlimit()))
print('theano: {}'.format(thno.__version__))
print('PyMC3: {}'.format(pm.__version__))
print('PyStan: {}'.format(pystan.__version__))

##### Set switches for run-all convenience

In [None]:
sample_switches = {'pooled':{'pymc':1, 'stan':1},
                   'unpooled':{'pymc':1, 'stan':1},
                   'fullyunpooled':{'pymc':0},
                   'partpooled':{'pymc':1, 'stan':1},
                   'hier':{'pymc':1, 'stan':1}}

runtimes = defaultdict(lambda : defaultdict(dict))

## Local Functions

In [None]:
def create_smry(trc, dfs, pname='mfr'):
    ''' Conv fn: create trace summary for sorted forestplot '''

    dfsm = pm.df_summary(trc).reset_index()
    dfsm.rename(columns={'index':'featval'}, inplace=True)
    dfsm = dfsm.loc[dfsm['featval'].apply(
        lambda x: re.search('{}__[0-9]+'.format(pname), x) is not None)]

    dfsm.set_index(dfs[pname].unique(), inplace=True)
    dfsm.sort('mean', ascending=True, inplace=True)
    dfsm['ypos'] = np.arange(len(dfsm))
    
    return dfsm


def custom_forestplot(df, sg, ylabel='mfr', size=8, aspect=0.8, facetby=None):
    ''' Conv fn: plot features from pm.df_summary using seaborn
        Facet on sets of forests for comparison '''

    g = sns.FacetGrid(col=facetby, hue='mean', data=df, palette='RdBu_r'
                      ,size=size, aspect=aspect, sharey=True)
    _ = g.map(plt.scatter, 'mean', 'ypos'
                ,marker='o', s=100, edgecolor='#333333', linewidth=0.8, zorder=10)
    _ = g.map(plt.hlines, 'ypos', 'hpd_2.5','hpd_97.5', color='#aaaaaa')

    _ = g.axes.flat[0].set_ylabel(ylabel)
    _ = [ax.set_xlabel('coeff value') for ax in g.axes.flat]
    _ = g.axes.flat[0].set_ylim((-1, df['ypos'].max()+1))
    _ = g.axes.flat[0].set_yticks(np.arange(df['ypos'].max()+1))
    _ = g.axes.flat[0].set_yticklabels(df.index)

    if sg is not None:
        # hacky way to show group sizes at rhs
        sg = sg[df.iloc[:len(df)//len(g.axes.flat),:].index]    ## ensure index order
        _ = axr = g.axes.flat[len(g.axes.flat)-1].twinx()
        _ = axr.plot(np.zeros(len(sg)), range(len(sg)), lw=0)
        _ = axr.set_ylim((-1,len(sg)))
        _ = axr.set_yticks([l for l in range(len(sg))])
        _ = axr.set_yticklabels(sg)
        _ = axr.yaxis.grid(False)
        _ = axr.spines['top'].set_visible(False)
        _ = axr.spines['right'].set_visible(False)
        
    
def custom_2d_forestplot(dfg, show='Zoom'):
    ''' Conv fn: custom 2d forestplot of parent and mfr, optional CRs 
        NOTE legend hack to cope with open issue with FacetGrid hue and plt.scatter
        https://github.com/mwaskom/seaborn/issues/840
    
    '''    

    g = sns.FacetGrid(data=dfg, size=12, aspect=0.8)

    _ = g.map(plt.scatter, 'mean_parent', 'mean_mfr',
              marker='o', s=100, edgecolor='#333333', linewidth=0.8, zorder=10,
              c=[sns.color_palette('Spectral', 20).as_hex()[m] for m in dfg['parent_order_id']])
    
    ylim_zoom = g.axes.flat[0].get_ylim()
    xlim_zoom = g.axes.flat[0].get_xlim()
    
    _ = g.map(plt.hlines, 'mean_mfr', 'hpd_2.5_parent', 'hpd_97.5_parent', color='#bbbbbb')
    _ = g.map(plt.vlines, 'mean_parent', 'hpd_2.5_mfr', 'hpd_97.5_mfr', color='#bbbbbb')

    # hack autogen legend to show only parent dots
    # hnd, lbl = g.axes.flat[0].get_legend_handles_labels()
    # _ = g.axes.flat[0].legend(hnd[:20], lbl[:20], loc='upper left', ncol=1,
    #     title='parent', frameon=True, framealpha=0.5).get_frame().set_facecolor('w')

    # Manually create legend from custom artist/label lists
    legcolors = [plt.Line2D([0,0],[0,1],
                            color=sns.color_palette('Spectral', 20).as_hex()[p],
                            ms=10, marker='o', mec='#333333', mew=1, linestyle='')
                 for p in range(20)]
    
    _ = g.axes.flat[0].legend(legcolors,
                              dfg['parent'].drop_duplicates().values.tolist(),
                loc='upper left', ncol=1, title='parent', frameon=True, 
                framealpha=0.5, numpoints=1).get_frame().set_facecolor('w')

    _ = g.axes.flat[0].set_ylabel('mfr')
    _ = g.axes.flat[0].set_xlabel('parent')

    for i, r in dfg[['mfr','n','mean_parent','mean_mfr']].iterrows():
        _ = g.axes.flat[0].annotate('{} ({})'.format(r[0],r[1]), #.split(' - ')[1]
                         xy=(r[2], r[3]), xycoords='data',
                         xytext=(5,5), textcoords='offset points',
                         color='#444444', fontsize=10, rotation=30, va='bottom')
    if show == 'Zoom':
        _ = g.axes.flat[0].set_ylim(ylim_zoom)
        _ = g.axes.flat[0].set_xlim(xlim_zoom) 
        

## Load Data

In [None]:
cnxsql = sqlite3.connect('data/car_emissions.db')
dfs = pd.read_sql('select * from cars_post_exclusions_2sd', cnxsql, index_col=None)

In [None]:
## convert sqlite bool storage (as ints) back to bools
for ft in ['parent_is_vw', 'mfr_is_vw', 'is_tdi']:
    dfs[ft] = dfs[ft].astype(bool)

In [None]:
custom_describe(dfs)

##### Label encode `mfr` and `mfr_owner`

In [None]:
le = LabelEncoder()
dfs['mfr_enc'] = le.fit_transform(dfs['mfr'])
dfs['parent_enc'] = le.fit_transform(dfs['parent'])

n_parent = dfs['parent_enc'].max()+1
n_mfr = dfs['mfr_enc'].max()+1

##### Declare feats for use

In [None]:
fts_cat = ['parent_is_vw', 'mfr_is_vw', 'parent', 'mfr', 'trans', 'fuel_type', 'is_tdi']
fts_cat_smp = ['mfr_is_vw','trans','fuel_type','is_tdi']
fts_num = ['metric_combined', 'metric_extra_urban', 'metric_urban_cold'
           ,'engine_capacity', 'emissions_co_mgkm']
fts_num_smp = ['metric_combined', 'engine_capacity', 'emissions_co_mgkm']
ft_endog = 'emissions_nox_mgkm'

## Describe dataset

+ The dataset is 2593 rows, with 12 exog features, 1 endog feature.
+ These are observations of car emissions tests, one row per car.
+ You can read off the basic distributional statistics of the features in the table above. Numeric features have been standardized according to [Gelman's 2sd principle](http://www.stat.columbia.edu/~gelman/research/published/standardizing7.pdf).
+ I have selected these particular 12 features to work with. Some are derivatives of original features.

We have the following features to choose from:

```
+ Categoricals:
    + `trans`     - the car transmission, simplified to 'auto', 'semiauto', 'manual'
    + `fuel_type` - the car power supply, simplified to 'petrol', 'diesel'
    + `parent`    - the parent company of the car manufacturer, 20 values
    + `mfr`       - the car manufacturer, 38 values

+ Booleans:
    + `parent_is_vw` - if the parent company of the car manufacturer is Volkswagen
    + `mfr_is_vw`    - if the car manufacturer is Volkswagen
    + `is_tdi`       - (processed feature) if the car engine type is a turbo diesel
    
+ Numerics:
    + `metric_combined`    - a score for fuel efficiency in combined driving
    + `metric_extra_urban` - a score for fuel efficiency in an extra-urban driving
    + `metric_urban_cold`  - a score for fuel efficiency in an urban setting, cold start
    + `emissions_co_mgkm`  - a count of CO particulates emitted mg/km
    
+ Numeric endogenous feature:
    + `emissions_nox_mgkm` - a count of NOx particulates emitted mg/km    
```

For the purposes of this Notebook, the final feature mentioned `emissions_nox_mgkm` will be used as the _endogenous_ / _dependent_ / _output_ feature of the linear models. All other features may be used as _exogenous_ / _independent_ / _input_ features.

---

---

# Choose Features

In a previous Notebook, I used a Lasso model for feature reduction. I'll broadly follow the results of that exercise here and use the following features for modelling, I include `emissions_co_mgkm` just to demonstrate a continuous feature in there.

Note: I will use this `glm` model specification for the pooled model. I will have to manually specify the unpooled, partially-pooled and hierarchical models.

```
endogenous feature: emissions_nox_mgkm

exogenous features: parent              : multi-class string
                    mfr                 : multi-class string
                    fuel_type           : multi-class string
                    trans               : multi-class string
                    is_tdi              : boolean
                    engine_capacity     : numeric int
                    metric_combined     : numeric int
                    emissions_co_mgkm   : numeric float
```

##### Reminder of mfr and parent counts:

In [None]:
print('parent: {} uniques\nmfr: {} uniques'.format(
        len(dfs['parent'].unique()), len(dfs['mfr'].unique())))

## Create Modelspecs and Design Matrices

##### Only possible to use this formula for pooled model

In [None]:
fml_pooled = '{} ~ '.format(ft_endog) + ' + '.join(['fuel_type','trans'
            ,'is_tdi','engine_capacity','metric_combined','emissions_co_mgkm'])
print(fml_pooled)

In [None]:
(mx_en, mx_ex) = pt.dmatrices(fml_pooled, dfs
                        ,return_type='dataframe', NA_action='raise')
mx_ex.head()

---

---

# Pooled Model

Pool (ignore) the `parent` and `mfr` features.

$$y \sim \mathcal{N}(\beta^{T} \bf{x},\epsilon)$$

where:  
$\beta$ are our coeffs in the linear model  
$\bf{x}$ is the vector of features describing each car in the dataset  
$\epsilon \sim \mathcal{HalfCauchy}(0, 10)$ 

I'll attempt to robustly handle outliers this time by using a Student-T distribution for the likelihood, the error-term $\epsilon$ is stochastic noise in the likelihood of that model.

---

## PyMC3 Model

##### Create model and sample

In [None]:
if sample_switches['pooled']['pymc']:

    with pm.Model() as mdl_pooled_pymc:

        t0 = time()
        pm.glm.glm(fml_pooled, dfs, family=pm.glm.families.StudentT())

        runtimes['pooled']['pymc']['build'] = time() - t0
        
        start_map = pm.find_MAP(fmin=optimize.fmin_powell)
        
        runtimes['pooled']['pymc']['find_map'] = time() - runtimes['pooled']['pymc']['build']
        
        trc_pooled_pymc = pm.sample(2000, njobs=1, step=pm.NUTS(),
                               start=start_map,
                               trace=pm.backends.Text('traces/trc_pooled_pymc'))

        runtimes['pooled']['pymc']['sample'] = time() - runtimes['pooled']['pymc']['find_map']
        
    ## dump to disk    
    write_pickle(obj=mdl_pooled_pymc, relnm='models/mdl_pooled_pymc.pkl')
    
else:
    with pm.Model():   ## read from disk

        mdl_pooled_pymc = read_pickle(relnm='models/mdl_pooled_pymc.pkl')

    trc_pooled_pymc = pm.backends.text.load('traces/trc_pooled_pymc', model=mdl_pooled_pymc)

##### Save WAIC and view traces

In [None]:
dfwaic_pymc['pooled'] = [pm.stats.waic(model=mdl_pooled_pymc, trace=trc_pooled_pymc[-1000:])]

rvs_pooled = [rv.name for rv in strip_derived_rvs(mdl_pooled_pymc.unobserved_RVs)]
plot_traces_pymc(trc_pooled_pymc[-1000:], varnames=rvs_pooled)

**Observe**:


+ Stuff



---

## PyStan Model

##### Spec model

In [None]:
mdl_pooled_stan_spec = """

    /* Spec for pooled regression linear model (robust) */

    data {                                 
        int<lower=0> N;                     // count of observations
        int<lower=0> K;                     // count of exog features
        matrix[N, K] X;                     // exog features
        vector[N] y;                        // endog feature
    }
    parameters {
        vector[K] beta;                     // exog coeffs
        real<lower=0> sigma;                // linear model error
    }
    transformed parameters {}
    model {  
        sigma ~ cauchy(0, 10);              // explicit prior for error (half-cauchy)
        y ~ student_t(1, X * beta, sigma);  // student-T likelihood
    }
    generated quantities {}
"""

In [None]:
mdl_pooled_stan_dict = {}
mdl_pooled_stan_dict['N'] = mx_ex.shape[0]
mdl_pooled_stan_dict['K'] = mx_ex.shape[1]
mdl_pooled_stan_dict['X'] = mx_ex.values
mdl_pooled_stan_dict['y'] = mx_en[ft_endog].values

##### Run model

In [None]:
if sample_switches['pooled']['stan']:

    mdl_pooled_stan_fit = pystan.stan(
            model_code  = mdl_pooled_stan_spec,
            data        = mdl_pooled_stan_dict,
            sample_file = ensure_dir('traces/trc_pooled_stan/chain'),
            model_name  = 'mdl_pooled_stan',
            iter        = 1000,
            warmup      = 750, 
            chains      = 4, 
            n_jobs      = 2, 
            verbose     = False)
    
    # dump model object and model fit object to disk
    write_pickle(obj=mdl_pooled_stan_fit.get_stanmodel(),
                 relnm='models/mdl_pooled_stan_model.pkl')

    write_pickle(mdl_pooled_stan_fit, 'models/mdl_pooled_stan_fit.pkl')
    
else:      
    ## load the model first, and then the fit:
    mdl_pooled_stan_model = read_pickle(relnm='models/mdl_pooled_stan_model.pkl')

    mdl_pooled_stan_fit = read_pickle(relnm='models/mdl_pooled_stan_fit.pkl')

In [None]:
print(mdl_pooled_stan_fit)

In [None]:
mdl_pooled_stan_fit.plot()
plt.show()

---

---

# Unpooled Model

Include the `mfr` feature values in the dmatrix. Each `mfr` value gets a separate intercept with shared slopes.


$$y \sim \mathcal{N}(\beta_{mfr} + \beta^{T} \bf{x},\epsilon)$$

where:  
$\beta_{mfr}$ is a separate intercept for each manufacturer  
$\beta$ are our (shared) coeffs in the linear model  
$\bf{x}$ is the vector of features describing each car in the dataset  
$\epsilon \sim \mathcal{HalfCauchy}(0, 10)$ 


Set priors as Cauchy(0, 2.5) as per Gelman 2008? 
http://www.stat.columbia.edu/~gelman/research/published/priors11.pdf

Nope, in later correspondance, he recommends Normals http://andrewgelman.com/2015/11/01/cauchy-priors-for-logistic-regression-coefficients/

Lots of other thoughts at: https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations


---

## PyMC3 Model

In [None]:
if sample_switches['unpooled']['pymc']:

    with pm.Model() as mdl_unpooled_pymc:

        b0 = pm.Normal('b0_mfr', mu=0, sd=10, shape=n_mfr)

        b1 = pm.Normal('b1_fuel_type[T.petrol]', mu=0, sd=10)
        b2a = pm.Normal('b2a_trans[T.manual]', mu=0, sd=10)
        b2b = pm.Normal('b2b_trans[T.semiauto]', mu=0, sd=10)
        b3 = pm.Normal('b3_is_tdi[T.True]', mu=0, sd=10)
        b4 = pm.Normal('b4_engine_capacity', mu=0, sd=10)
        b5 = pm.Normal('b5_metric_combined', mu=0, sd=10)
        b6 = pm.Normal('b6_emissions_co_mgkm', mu=0, sd=10)

        # define linear model
        yest = ( b0[dfs['mfr_enc']] +
                 b1 * mx_ex['fuel_type[T.petrol]'] + 
                 b2a * mx_ex['trans[T.manual]'] +
                 b2b * mx_ex['trans[T.semiauto]'] +
                 b3 * mx_ex['is_tdi[T.True]'] +
                 b4 * mx_ex['engine_capacity'] +
                 b5 * mx_ex['metric_combined'] +
                 b6 * mx_ex['emissions_co_mgkm'])

        ## Student T likelihood with HalfCauchy error and fixed DoF nu
        epsilon = pm.HalfCauchy('epsilon', beta=10)
        likelihood = pm.StudentT('likelihood', nu=1, mu=yest
                                 ,sd=epsilon, observed=dfs[ft_endog])
 
        ## sample
        trc_unpooled_pymc = pm.sample(2000, njobs=2, step=pm.NUTS()
                            ,start=pm.find_MAP(fmin=optimize.fmin_powell)     
                            ,trace=pm.backends.Text('traces/trc_unpooled_pymc'))
    ## dump to disk
    write_pickle(obj=mdl_unpooled_pymc, relnm='models/mdl_unpooled_pymc.pkl')
    
else:  
    with pm.Model():   ## read from disk

        mdl_unpooled_pymc = read_pickle(relnm='models/mdl_unpooled_pymc.pkl')

    trc_unpooled_pymc = pm.backends.text.load('traces/trc_unpooled_pymc',
                                              model=mdl_unpooled_pymc)


##### Save WAIC and view traces

In [None]:
dfwaic_pymc['unpooled'] = [pm.stats.waic(model=mdl_unpooled_pymc,
                                         trace=trc_unpooled_pymc[-1000:])]

rvs_unpooled_pymc = [rv.name for rv in strip_derived_rvs(mdl_unpooled_pymc.unobserved_RVs)]
plot_traces_pymc(trc_unpooled_pymc[-1000:], varnames=rvs_unpooled_pymc)

---

## PyStan Model

##### Create model spec

In [None]:
mdl_unpooled_stan_spec = """

    /* Spec for unpooled regression linear model */

    data {                                 
        int<lower=0> N;                           // count of observations
        int<lower=0> K;                           // count of exog features
        matrix[N, K] X;                           // exog features
        vector[N] y;                              // endog feature
        int<lower=0> n_mfr;                   // count of mfr index levels
        int<lower=1,upper=n_mfr> mfr_enc[N];  // mfr index encoding  
    }
    parameters {
        vector[n_mfr] b0_mfr;                 // mfr intercept coeff
        vector[K] beta;                           // exog coeffs
        real<lower=0> sigma;                      // linear model error
    }
    transformed parameters {}
    model {  
        sigma ~ cauchy(0, 10);                    // prior for error
        y ~ student_t(1, b0_mfr[mfr_enc] + X * beta, sigma);  // student-T likelihood
    }
    generated quantities {}
"""

In [None]:
mdl_unpooled_stan_dict = {}
mdl_unpooled_stan_dict['N'] = mx_ex.iloc[:,1:].shape[0]
mdl_unpooled_stan_dict['K'] = mx_ex.iloc[:,1:].shape[1]
mdl_unpooled_stan_dict['X'] = mx_ex.iloc[:,1:].values
mdl_unpooled_stan_dict['y'] = mx_en[ft_endog].values

mdl_unpooled_stan_dict['mfr_enc'] = dfs['mfr_enc'].values + 1
mdl_unpooled_stan_dict['n_mfr'] = n_mfr

In [None]:
if sample_switches['unpooled']['stan']:

    mdl_unpooled_stan_fit = pystan.stan(
            model_code  = mdl_unpooled_stan_spec,
            data        = mdl_unpooled_stan_dict,
            sample_file = ensure_dir('traces/trc_unpooled_stan/chain'),
            model_name  = 'mdl_unpooled_stan',
            iter        = 1000,
            warmup      = 750, 
            chains      = 4, 
            n_jobs      = 2, 
            verbose     = False)
    
    # dump model object and model fit object to disk
    write_pickle(obj=mdl_unpooled_stan_fit.get_stanmodel(),
                 relnm='models/mdl_unpooled_stan_model.pkl')

    write_pickle(mdl_unpooled_stan_fit, 'models/mdl_unpooled_stan_fit.pkl')
    
else:      
    ## load the model first, and then the fit:
    mdl_unpooled_stan_model = read_pickle(relnm='models/mdl_unpooled_stan_model.pkl')

    mdl_unpooled_stan_fit = read_pickle(relnm='models/mdl_unpooled_stan_fit.pkl')

##### View fit

In [None]:
print(mdl_unpooled_stan_fit)

In [None]:
mdl_unpooled_stan_fit.plot()
plt.show()

---

## Evaluate Manufacturers using Unpooled Model

##### View forestplot of the `mfr` feature coeffs

In [None]:
dfsm_unpl_mfr = create_smry(trc_unpooled_pymc[-1000:], dfs, 'mfr')
sg_mfrfreq = dfs.groupby('mfr').size().reindex(dfsm_unpl_mfr.index)

custom_forestplot(dfsm_unpl_mfr, sg_mfrfreq)

**Observe:**

The forestplot lets us compare the effect of `mfr` upon `emissions_nox_mgkm` when all other features in the mode are kept equal: `engine_capacity`, `fuel_type` etc etc

The forest plot shows:

+ The mean value for each parameter value, sorted in descending order
+ The uncertainty in each value, a.k.a. the Credible Region (CR) the region which the sample values spend X% of their time durng the traces. By convention I have chosen the 95% CR, but we could chose the 50% region or indeed anything that is useful for understanding.

Looking at the manufacturer values:

+ Mitsubuishi, seems to hang outside the pack by quite a long way
+ Jaguar and Lexus appear to emit least, however, you can see a weakness in this unpooled model, which is the massive uncertainty region for under-represented manufacturers

Under-representation is a problem:

+ Lamborghini and Lexus are represented in the dataset by 1 and 5 cars respectively and have massive CRs, so wide that we really can't say much about their effect upon emissions with certainty
+ Other manufacturers with few cars also have wide CRs: Mistubuishi (4), Ssangyong (5), Ferrari (2), Aston Martin Lagonda (6), and Smart (5)

We'll see later how this can be improved using partially-pooled model with a shared hyperparameter to 'share power' between the manufacturer parameters.

---

---

# Digression: Fully Unpooled Model

Of course, we can take this unpooling to an extreme, calculating separate intercepts and slopes for each manufacturer:

$$y \sim \mathcal{N}(\beta_{mfr}^{T} \bf{x},\epsilon)$$

where:  
$\beta_{mfr}$ are separate coeffs for each manufucturer
$\bf{x}$ is the vector of features describing each car in the dataset  
$\epsilon \sim \mathcal{HalfCauchy}(0, 10)$ 




---

## PyMC3 Model

In [None]:
if sample_switches['fullyunpooled']['pymc']:

    with pm.Model() as mdl_fullyunpooled_pymc:

        b0 = pm.Normal('b0_mfr', mu=0, sd=10, shape=n_mfr)
        b1 = pm.Normal('b1_fuel_type[T.petrol]', mu=0, sd=10, shape=n_mfr)
        b2a = pm.Normal('b2a_trans[T.manual]', mu=0, sd=10, shape=n_mfr)
        b2b = pm.Normal('b2b_trans[T.semiauto]', mu=0, sd=10, shape=n_mfr)
        b3 = pm.Normal('b3_is_tdi[T.True]', mu=0, sd=10, shape=n_mfr)
        b4 = pm.Normal('b4_engine_capacity', mu=0, sd=10, shape=n_mfr)
        b5 = pm.Normal('b5_metric_combined', mu=0, sd=10, shape=n_mfr)
        b6 = pm.Normal('b6_emissions_co_mgkm', mu=0, sd=10, shape=n_mfr)

        # define linear model
        yest = ( b0[dfs['mfr_enc']] +
                 b1[dfs['mfr_enc']] * mx_ex['fuel_type[T.petrol]'] + 
                 b2a[dfs['mfr_enc']] * mx_ex['trans[T.manual]'] +
                 b2b[dfs['mfr_enc']] * mx_ex['trans[T.semiauto]'] +
                 b3[dfs['mfr_enc']] * mx_ex['is_tdi[T.True]'] +
                 b4[dfs['mfr_enc']] * mx_ex['engine_capacity'] +
                 b5[dfs['mfr_enc']] * mx_ex['metric_combined'] +
                 b6[dfs['mfr_enc']] * mx_ex['emissions_co_mgkm'])

        ## Student T likelihood with fixed degrees of freedom nu
        epsilon = pm.HalfCauchy('epsilon', beta=10)
        likelihood = pm.StudentT('likelihood', nu=1, mu=yest
                                 ,sd=epsilon, observed=dfs[ft_endog])

        ## sample
        trc_fullyunpooled_pymc = pm.sample(10000, njobs=3, step=pm.Metropolis()
                                ,start=pm.find_MAP(fmin=optimize.fmin_powell)    
                                ,trace=pm.backends.Text('traces/trc_fullyunpooled_pymc'))

    ## dump to disk
    write_pickle(obj=mdl_fullyunpooled_pymc, relnm='models/mdl_fullyunpooled_pymc.pkl')
    
else:
    with pm.Model():    ## read from disk

        mdl_fullyunpooled_pymc = read_pickle(relnm='models/mdl_fullyunpooled_pymc.pkl')

    trc_fullyunpooled_pymc = pm.backends.text.load('traces/trc_fullyunpooled_pymc',
                                                   model=mdl_fullyunpooled_pymc)


##### View traces

In [None]:
dfwaic_pymc['fullyunpooled'] = [pm.stats.waic(model=mdl_fullyunpooled_pymc,
                                         trace=trc_fullyunpooled_pymc[-333:])]
rvs_fullyunpooled_pymc = [rv.name for rv in \
                          strip_derived_rvs(mdl_fullyunpooled_pymc.unobserved_RVs)]
plot_traces_pymc(trc_fullyunpooled_pymc[-333:], varnames=rvs_fullyunpooled_pymc)

**Observe:**

There's three big issues with this fully unpooled model:

1. You'll notice I used the Metropolis sampler, rather than NUTS, because the NUTS sampler seemed to 'stall' and fail to move quickly around the posterior distribution - it sampled so slowly that it's unsuitable for this short demo.
2. Relatedly, the traces often show extreme values for parameters: likely because the separate values per manufacturer are simply allowed to vary too much and cause discontinuities in the posterior distribution.
3. Now, the differences between the manufacturers are captured across _all_ parameters in the mode, which makes comparing them really difficult! The mode may fit better (see the WAIC evaluation below), but we've made the task of human interpretation more difficult.

We have effectively fitted 38 seprate regressions, leading to immense complexity, slow sampling and messy traces. For this model at least, we need some degree of pooling.


**NOTE**: Regarding point 2 above: the slowness of NUTS sampling may possibly be something to do with the implementation in PyMC3, and I will look into this in future comparisons with Stan (via PyStan).

---

---

# Partially-Pooled Model

Here we place partial-pooling on intercept only: this hyperparameter lets us - in a balanced way - determine a difference between manufacturers `mfr` $m \in manufacturer$, keeping all other features constant

$$y \sim \mathcal{N}(\beta_{mfr} + \beta^{T} \bf{x}, \epsilon)$$

where (tree written upside down):  
$\beta_{mfr} \sim \mathcal{N}(\mu_{mfr}, \sigma_{mfr})$

$\;\;\;\;\;\;\;\;\;\;\;\;|\_\_ \mu_{mfr} \sim \mathcal{N}(0, 100) \;\;;\;\;
\sigma_{mfr} \sim \mathcal{HalfCauchy}(0, 10)$ 

$\beta$ are the other (shared) coeffs in the linear model  
$\bf{x}$ is the vector of features describing each car in the dataset  
$\epsilon \sim \mathcal{HalfCauchy}(0, 10)$ 

---

## PyMC3 Model

In [None]:
if sample_switches['partpooled']['pymc']:

    with pm.Model() as mdl_partpooled_pymc:

        # define hyperpriors for intercept
        b0_mu = pm.Normal('b0_mu', mu=0, sd=10)
        b0_sd = pm.HalfCauchy('b0_sd', beta=10)

        # define priors
        b0 = pm.Normal('b0_mfr', mu=b0_mu, sd=b0_sd, shape=n_mfr)
        b1 = pm.Normal('b1_fuel_type[T.petrol]', mu=0, sd=10)
        b2a = pm.Normal('b2a_trans[T.manual]', mu=0, sd=10)
        b2b = pm.Normal('b2b_trans[T.semiauto]', mu=0, sd=10)
        b3 = pm.Normal('b3_is_tdi[T.True]', mu=0, sd=10)
        b4 = pm.Normal('b4_engine_capacity', mu=0, sd=10)
        b5 = pm.Normal('b5_metric_combined', mu=0, sd=10)
        b6 = pm.Normal('b6_emissions_co_mgkm', mu=0, sd=10)

        # define linear model
        yest = ( b0[dfs['mfr_enc']] +
                 b1 * mx_ex['fuel_type[T.petrol]'] + 
                 b2a * mx_ex['trans[T.manual]'] +
                 b2b * mx_ex['trans[T.semiauto]'] +
                 b3 * mx_ex['is_tdi[T.True]'] +
                 b4 * mx_ex['engine_capacity'] +
                 b5 * mx_ex['metric_combined'] +
                 b6 * mx_ex['emissions_co_mgkm'])

        ## Student T likelihood with fixed degrees of freedom nu
        epsilon = pm.HalfCauchy('epsilon', beta=10)
        likelihood = pm.StudentT('likelihood', nu=1, mu=yest
                                 ,sd=epsilon, observed=dfs[ft_endog])

        ## sample
        trc_partpooled_pymc = pm.sample(1000, njobs=2, step=pm.NUTS(),
                                start=pm.find_MAP(fmin=optimize.fmin_powell),
                                trace=pm.backends.Text('traces/trc_partpooled_pymc'))
    ## dump to disk
    write_pickle(obj=mdl_partpooled_pymc, relnm='models/mdl_partpooled_pymc.pkl')
    
else:
    with pm.Model():   ## read from disk

        mdl_partpooled_pymc = read_pickle(relnm='models/mdl_partpooled_pymc.pkl')

    trc_partpooled_pymc = pm.backends.text.load('traces/trc_partpooled_pymc',
                                                   model=mdl_partpooled_pymc)
    

##### Save WAIC and View traces

In [None]:
dfwaic_pymc['partpooled'] = [pm.stats.waic(model=mdl_partpooled_pymc,
                                         trace=trc_partpooled_pymc[-500:])]
rvs_partpooled_pymc = [rv.name for rv in \
                          strip_derived_rvs(mdl_partpooled_pymc.unobserved_RVs)]
plot_traces_pymc(trc_partpooled_pymc[-500:], varnames=rvs_partpooled_pymc)

**Observe:**

This is more like it:

+ The traceplots look pretty well-mixed
+ We have a shared mean for the intercept `b0_mu` at approx. 48.
+ All the 38 `mfr` values are located around this value, with standard deviation `b0_sd`: as we see in the plot for `b0_mfr`

---

## PyStan Model

In [None]:
mdl_partpooled_stan_spec = """

    /* Spec for partpooled regression linear model */

    data {                                 
        int<lower=0> N;                         // count of observations
        int<lower=0> K;                         // count of exog features
        matrix[N, K] X;                         // exog features
        vector[N] y;                            // endog feature
        int<lower=0> n_mfr;                     // count of mfr index levels
        int<lower=1, upper=n_mfr> mfr_enc[N];   // mfr index encoding  
    }
    parameters {
        vector[K] beta;                         // exog coeffs
        real<lower=0> sigma;                    // linear model error       

        real mfr_mu;                            // part-pooling prior mu
        real<lower=0> mfr_sd;                   // part-pooling prior sd    
        vector[n_mfr] b0_mfr;                   // mfr intercept coeff
    }
    transformed parameters {}
    model {  
        
        mfr_mu ~ normal(0, 10);                  // part-pooling prior mu
        mfr_sd ~ cauchy(0, 10);                  // part-pooling prior sd
     
        for (mfr in 1:n_mfr) {
              b0_mfr[mfr] ~ normal(mfr_mu, mfr_sd);
        }
    
        sigma ~ cauchy(0, 10);                    // prior for error
        y ~ student_t(1, b0_mfr[mfr_enc] + X * beta, sigma);  // student-T likelihood
    }
    generated quantities {}
"""

In [None]:
if sample_switches['partpooled']['stan']:

    mdl_partpooled_stan_fit = pystan.stan(
            model_code  = mdl_partpooled_stan_spec,
            data        = mdl_unpooled_stan_dict,
            sample_file = ensure_dir('traces/trc_partpooled_stan/chain'),
            model_name  = 'mdl_partpooled_stan',
            iter        = 1000,
            warmup      = 750, 
            chains      = 4, 
            n_jobs      = 2, 
            verbose     = False)
    
    # dump model object and model fit object to disk
    write_pickle(obj=mdl_partpooled_stan_fit.get_stanmodel(),
                 relnm='models/mdl_partpooled_stan_model.pkl')

    write_pickle(mdl_partpooled_stan_fit, 'models/mdl_partpooled_stan_fit.pkl')
    
else:      
    ## load the model first, and then the fit:
    mdl_partpooled_stan_model = read_pickle(relnm='models/mdl_partpooled_stan_model.pkl')

    mdl_partpooled_stan_fit = read_pickle(relnm='models/mdl_partpooled_stan_fit.pkl')

##### View summary

In [None]:
print(mdl_partpooled_stan_fit)

##### View traces

In [None]:
f = mdl_partpooled_stan_fit.plot()
f.set_size_inches(12, 8)
plt.show()

### Compare PyStan model coeffs to PyMC3 coeffs

In [None]:
# return a dictionary of arrays
samples = mdl_partpooled_stan_fit.extract(permuted=True)
for k, v in samples.items():
    print(k, v.shape)

In [None]:
pd.DataFrame(samples['beta'], columns=mx_ex.columns[1:]).describe().T

##### Compare to PyMC3

In [None]:
pm.df_summary(trc_partpooled_pymc[-500:], varnames=rvs_partpooled_pymc).iloc[-9:-2,:]

> Looks similar!

---

## Evaluate Manufacturers using Partially-Pooled PyMC3 Model

Let's compare the forestplot for this partpooled model with that of the unpooled model

In [None]:
dfsm_ptpl_mfr = create_smry(trc_partpooled_pymc[-500:], dfs, 'mfr')

In [None]:
dfsm_mfr_vs = pd.concat((dfsm_ptpl_mfr
                         ,dfsm_unpl_mfr.reindex(dfsm_ptpl_mfr.index)), axis=0)
dfsm_mfr_vs['mdl'] = np.concatenate(
                    (np.repeat(['partpooled'],38), np.repeat('unpooled',38)))
dfsm_mfr_vs.iloc[38:]['ypos'] = dfsm_mfr_vs.iloc[:38]['ypos']

In [None]:
custom_forestplot(dfsm_mfr_vs, sg_mfrfreq, aspect=0.6, facetby='mdl')

**Observe:**

+ The forestplot for the `partpooled` model is shown on the left and the `unpooled` is on the right. Note the rows are ordered according to the `partpooled` model.
+ There's a few small changes in ordering from the `unpooled` model, for instance: Ssangyong, Mercedes-Benz, Ferrari
+ There's a noticable reduction in uncertainty for some parameters which have low counts, for instance: Lamborghini (1 car), Lexus (5 cars) and Smart (5 cars)

**Shrinkage**

Overall all the parameters appear to be pulled in slightly closer together, this is a.k.a 'shrinkage'
+ The coeffs now occupy a region between 34 - 62, centered on approx 50. This compares to the unpooled model where parameters have a region between 32 and 72. 
+ This reduction in variance would suggest the `partpooled` model is less overfitted than the `unpooled`, and may perform better in hold-out validation.



---

## Can we comment on Volkswagen's NOx emissions at `mfr` level?

We can see from the above that the intercept parameter for `mfr == volkswagen` is 8th highest in the pack of all 38 manufacturers, seemingly higher than average.

Let's take a more detailled look at the parameter value compared to the group mean

In [None]:
## Summary of Volkswagen

dfsm_ptpl_mfr.loc[['volkswagen']]

In [None]:
## Hyperprior group mean and standard dev

pm.df_summary(trc_partpooled_pymc[-500:], varnames=['b0_mu','b0_sd'])

**Observe:**

The model is specified such that all 38 `mfr` parameters share a common hyperparameter `b0_mu` for their mean, and common hyperparameter `b0_sd` for their standard deviation.

+ Looking at the **mean**:  
Volkswagen has a mean value of `55.32` with a 95% CR from `52.12 to 58.15`  
The group mean `b0_mn` has a mean value much lower at `48.15` with a 95% CR from `44.76 to 51.80`  
The 95% CR for Volkswagen does not overlap with the 95% CR for the group mean, so we can say that it is strongly above the mean for NOx emissions.


+ Looking at the **standard deviation**:  
Volkswagen has a sd value of `1.60`   
The group sd `b0_sd` has a mean value much higher at `8.71` with a 95% CR from `6.34 to 10.99`  
The sd for Volkswagen does not overlap with the 95% CR for the group sd, so we can say that Volkswagen has a strongly narrow distribution in its mean NOx emissions.


+ **In summary**  it seems that Volkswagen has an unusually high and tight parameter explaining their NOx emissions.



**Caveat: This is far from rigorous!**  

1. The manufacturer parameter values are all quite well distributed through a large range `34 to 62` and don't always overlap with one another. We can make the same inference of 'tightly above average NOx emissions' for several other manufacturers: Rolls Royce, Fiat, Alfa Romeo, Maclaren and Subaru.
2. We haven't considered the manufacturer-owner parameter `mfr_owner`: which is higher-level information and may help with the class imbalances on `mfr`

---

---

# Hierarchical Model of Parent and Manufacturer

Since I wanted to build the case slowly, I've not yet considered `parent`: the manufacturer parent company.

Now's the time to consider it: because the parent company may provide important information. 

We could create another part-pooled model like above, with `parent` as a separate set of parameter coeffs, but quite obviously the `mfr` correolates completely with `parent` in a hierarchical structure; we can't assume orthogonality in the model.

Rather than have all `mfr`s joined by a single hyperparameter, let's instead join them to a hyperparameter set by their parent company `parent`. These 20 `parent` levels will in turn be joined to a single hyperparameter. It's just an extension of the part-pooling model to create a truely hierarchical model which incorporates the ownership structure we know about. 


$$y \sim \mathcal{N}(\beta_{mfr} + \beta^{T} \bf{x}, \epsilon)$$

where (tree written upside down):  
$\beta_{mfr} \sim \mathcal{N}(\mu_{mfr}, \sigma_{mfr})$

$\;\;\;\;|\_\_ \mu_{mfr} \sim \mathcal{N}(\mu_{parent}, \sigma_{parent}) \;\; ; \;\;
\sigma_{mfr} \sim \mathcal{HalfCauchy}(0, 10)$ 

$\;\;\;\;\;\;\;\;\;\;|\_\_  \mu_{parent} \sim \mathcal{N}(0, 10) \;\; ; \;\; \sigma_{parent}\sim \mathcal{HalfCauchy}(0, 10)$  


$\beta$ are the other (shared) coeffs in the linear model  
$\bf{x}$ is the vector of features describing each car in the dataset  
$\epsilon \sim \mathcal{HalfCauchy}(0, 10)$ 

In [None]:
mfr_parent_map = (dfs.groupby(['mfr_enc','parent_enc']).size()
                  .reset_index()['parent_enc'].values)
mfr_parent_map

---

## PyMC3 Model

In [None]:
if sample_switches['hier']['pymc']:

#     @thno.compile.ops.as_op(itypes=[thno.tensor.dvector, thno.tensor.iscalar],
#                             otypes=[thno.tensor.dvector])
#     def mfr2parent(b0_parent=b0_parent, phi=phi):
#         return [phi[z[i,j]] for i in range(D) for j in range(W)]
    
    time_mdl_hier_pymc = {}
    t0 = time()
    
    with pm.Model() as mdl_hier_pymc:

        # set theano constants for convenient theano object indexing
        #parent_enc = T.constant(dfs['parent_enc'])

        # define hyperpriors for intercept based on parent   # 1x 
        b0_parent_mn = pm.Normal('b0_parent_mn', mu=0, sd=10)
        b0_parent_sd = pm.HalfCauchy('b0_parent_sd', beta=10)

        # define hyperpriors for mfr based on parent         # 20x
        b0_parent = pm.Normal('b0_parent', mu=b0_parent_mn,
                              sd=b0_parent_sd, shape=n_parent)
        b0_mfr_sd = pm.HalfCauchy('b0_mfr_sd', beta=10)

        # define priors

        # b0 vectorised version, this is definitely wrong get broadcast issues,
        # I've no idea how to force it to index mfr -> parent correctly
        b0 = pm.Normal('b0_mfr',
                        mu=b0_parent[mfr_parent_map],
                        sd=b0_mfr_sd, shape=n_mfr)

        # list comp version: this doest work either
#         b0 = [pm.Normal('b0_mfr_{}'.format(m),
#                         mu=T.sum(b0_parent[mfr_parent_map[m]]),
#                         sd=b0_mfr_sd) for m in range(38)]

        b1 = pm.Normal('b1_fuel_type[T.petrol]', mu=0, sd=10)
        b2a = pm.Normal('b2a_trans[T.manual]', mu=0, sd=10)
        b2b = pm.Normal('b2b_trans[T.semiauto]', mu=0, sd=10)
        b3 = pm.Normal('b3_is_tdi[T.True]', mu=0, sd=10)
        b4 = pm.Normal('b4_engine_capacity', mu=0, sd=10)
        b5 = pm.Normal('b5_metric_combined', mu=0, sd=10)
        b6 = pm.Normal('b6_emissions_co_mgkm', mu=0, sd=10)
                             
        # define hierachical linear model
        yest = ( b0[dfs['mfr_enc']] +
                 b1 * mx_ex['fuel_type[T.petrol]'] + 
                 b2a * mx_ex['trans[T.manual]'] +
                 b2b * mx_ex['trans[T.semiauto]'] +
                 b3 * mx_ex['is_tdi[T.True]'] +
                 b4 * mx_ex['engine_capacity'] +
                 b5 * mx_ex['metric_combined'] +
                 b6 * mx_ex['emissions_co_mgkm'])

        ## StudentT likelihood with fixed degrees of freedom nu
        epsilon = pm.HalfCauchy('epsilon', beta=10)
        likelihood = pm.StudentT('likelihood', nu=1, mu=yest
                                 ,sd=epsilon, observed=dfs[ft_endog])

        time_mdl_hier_pymc['t0_build'] = time() - t0
        
        #start_MAP = pm.find_MAP(fmin=optimize.fmin_powell)  ## takes a LONG time
        
        time_mdl_hier_pymc['t1_findmap'] = time() - time_mdl_hier_pymc['t0_build']
        
        ## sample
        trc_hier_pymc = pm.sample(4000, njobs=2, step=pm.NUTS()
                               #,start=start_MAP
                               ,trace=pm.backends.Text('traces/trc_hier_pymc'))

        time_mdl_hier_pymc['t2_sample'] = time() - time_mdl_hier_pymc['t1_findmap']
        
    ## dump to disk
    write_pickle(obj=mdl_hier_pymc, relnm='models/mdl_hier_pymc.pkl')

else:
    with pm.Model():   ## read from disk

        mdl_hier_pymc = read_pickle(relnm='models/mdl_hier_pymc.pkl')

    trc_hier_pymc = pm.backends.text.load('traces/trc_hier_pymc',
                                          model=mdl_hier_pymc)

##### Save WAIC and view traceplots

In [None]:
# dfwaic_pymc['hier'] = [pm.stats.waic(model=mdl_hier_pymc,
#                                          trace=trc_hier_pymc[-500:])]
rvs_hier_pymc = [rv.name for rv in strip_derived_rvs(mdl_hier_pymc.unobserved_RVs)]
plot_traces_pymc(trc_hier_pymc[-500:], varnames=rvs_hier_pymc)

---

## PyStan Model

In [None]:
mdl_hier_stan_spec = """

    /* Spec for hierarchical regression linear model */

    data {                                 
        int<lower=0> N;                         // count of observations
        int<lower=0> K;                         // count of exog features
        matrix[N, K] X;                         // exog features
        vector[N] y;                            // endog feature

        int<lower=0> n_parent;                      // count of parent index levels
        int<lower=1, upper=n_parent> parent_enc[N]; // parent index encoding  
        
        int<lower=0> n_mfr;                     // count of mfr index levels
        int<lower=1, upper=n_mfr> mfr_enc[N];   // mfr index encoding          
        
        int<lower=1, upper=n_mfr> mfr_parent_map[n_mfr];
    }
    parameters {
        vector[K] beta;                         // exog coeffs
        real<lower=0> sigma;                    // linear model error       

        real parent_mu;                         // parent mu hyperprior
        real<lower=0> parent_sd;                // parent sd hyperprior

        vector [n_parent] b0_parent;            // mfr mu hyperprior (parent prior)
        real<lower=0> mfr_sd;                   // mfr sd hyperprior

        vector[n_mfr] b0_mfr;                   // mfr prior
    }
    transformed parameters {}
    model {  
        
        parent_mu ~ normal(0, 10);              // weakly informative
        parent_sd ~ cauchy(0, 10);              // weakly informative
        mfr_sd ~ cauchy(0, 10);                 // weakly informative
        
                  
        for (parent in 1:n_parent) {            // parent priors (20x)
              b0_parent[parent] ~ normal(parent_mu, parent_sd);
        }

        for (mfr in 1:n_mfr) {                  // mfr priors (38x)
            b0_mfr[mfr] ~ normal(b0_parent[mfr_parent_map[mfr]], mfr_sd);
        }      

        sigma ~ cauchy(0, 10);                  // weakly informative noise
        y ~ student_t(1, b0_mfr[mfr_enc] + X * beta, sigma);    // likelihood
    }
    
"""

In [None]:
mdl_hier_stan_dict = {}
mdl_hier_stan_dict['N'] = mx_ex.iloc[:,1:].shape[0]
mdl_hier_stan_dict['K'] = mx_ex.iloc[:,1:].shape[1]
mdl_hier_stan_dict['X'] = mx_ex.iloc[:,1:].values
mdl_hier_stan_dict['y'] = mx_en[ft_endog].values

mdl_hier_stan_dict['parent_enc'] = dfs['parent_enc'].values + 1
mdl_hier_stan_dict['n_parent'] = n_parent
mdl_hier_stan_dict['mfr_enc'] = dfs['mfr_enc'].values + 1
mdl_hier_stan_dict['n_mfr'] = n_mfr

mdl_hier_stan_dict['mfr_parent_map'] = mfr_parent_map + 1

In [None]:
if sample_switches['hier']['stan']:

    mdl_hier_stan_fit = pystan.stan(
            model_code  = mdl_hier_stan_spec,
            data        = mdl_hier_stan_dict,
            sample_file = ensure_dir('traces/trc_hier_stan/chain'),
            model_name  = 'mdl_hier_stan',
            iter        = 1000,
            warmup      = 750, 
            chains      = 4, 
            n_jobs      = 2, 
            verbose     = False)
    
    # dump model object and model fit object to disk
    write_pickle(obj=mdl_hier_stan_fit.get_stanmodel(),
                 relnm='models/mdl_hier_stan_model.pkl')

    write_pickle(mdl_hier_stan_fit, 'models/mdl_hier_stan_fit.pkl')
    
else:      
    ## load the model first, and then the fit:
    mdl_hier_stan_model = read_pickle(relnm='models/mdl_hier_stan_model.pkl')

    mdl_hier_stan_fit = read_pickle(relnm='models/mdl_hier_stan_fit.pkl')

##### View summary

In [None]:
print(mdl_hier_stan_fit)

##### View traces

In [None]:
f = mdl_hier_stan_fit.plot()
f.set_size_inches(12, 10)
plt.show()

In [None]:
# TODO: plot all this 

---

## Evaluate manufacturer using the multilevel hierarchical model

##### Compare to unpooled and partpooled models

In [None]:
dfsm_hier_mfr = create_smry(trc_hier_pymc[-333:], dfs, 'mfr')

In [None]:
dfsm_mfr_vs = pd.concat((dfsm_hier_mfr
                         ,dfsm_ptpl_mfr.reindex(dfsm_hier_mfr.index)
                         ,dfsm_unpl_mfr.reindex(dfsm_hier_mfr.index)), axis=0)
dfsm_mfr_vs['mdl'] = np.concatenate(
        (np.repeat(['hier'],38), np.repeat(['partpooled'],38), np.repeat(['unpooled'],38)))
dfsm_mfr_vs.iloc[38:]['ypos'] = dfsm_mfr_vs.iloc[:38]['ypos']
dfsm_mfr_vs.iloc[2*38:]['ypos'] = dfsm_mfr_vs.iloc[:38]['ypos']

In [None]:
custom_forestplot(dfsm_mfr_vs, sg_mfrfreq, aspect=0.5, facetby='mdl')

---

## Evaluate parent using the multilevel hierarchical model

In [None]:
dfsm_hier_parent = create_smry(trc_hier_pymc[-500:], dfs, 'parent')

In [None]:
sg_parentfreq = dfs.groupby('parent').size().reindex(dfsm_hier_parent.index)
custom_forestplot(dfsm_hier_parent, sg_parentfreq, ylabel='parent', size=6)

**Observe:**

+ 

---

## Evaluate parent-manufacturer using the multilevel hierarchical model

In [None]:
# setup grouping structure
dfg = dfs.groupby(['parent','mfr']).size().reset_index()
dfg.rename(columns={0:'n'}, inplace=True)
dfg['key'] = dfg[['parent','mfr']].apply(lambda r: '{} - {}'.format(r[0][:5], r[1]), axis=1)

# join mfr
dfg = pd.merge(dfg, dfsm_hier_mfr[['mean','hpd_2.5','hpd_97.5']]
                       ,how='left', left_on='mfr', right_index=True)
dfg.rename(columns={k:'{}_mfr'.format(k) for k in ['mean','hpd_2.5','hpd_97.5']}
           ,inplace=True)

# join parent
dfg = pd.merge(dfg, dfsm_hier_parent[['mean','hpd_2.5','hpd_97.5']]
                       ,how='left', left_on='parent', right_index=True)
dfg.rename(columns={k:'{}_parent'.format(k) for k in ['mean','hpd_2.5','hpd_97.5']}
           ,inplace=True)

dfg.sort_values(['mean_parent','mean_mfr'], ascending=False, inplace=True)

dfg['parent_order_id'] = pd.factorize(dfg['parent'])[0]

In [None]:
interactive(custom_2d_forestplot, dfg=fixed(dfg), show=['All','Zoom'])

---
**&copy; Applied AI Ltd 2016**  
<a href='http://www.applied.ai'>applied.ai</a>