<div style=float:right><img src="assets/img/appliedai-logo.png" width=100 style="margin: 0px 20px"></img></div>


##### Jonathan Sedar Personal Project
## PyMC3 vs PyStan Comparison
_Spring 2016_

This set of Notebooks and scripts comprise the **pymc3_vs_pystan** personal project by Jonathan Sedar of Applied AI Ltd, written primarily for presentation at the PyData London 2016 Conference.

The project demonstrates hierarchical linear regression using two Bayesian inference frameworks: PyMC3 and PyStan. The project borrows heavily from code written for Applied AI Ltd and is supplied here for educational purposes only. No copyright or license is extended to users.


    
# 31_BasicModelEvaluation

#### Demonstrate posterior predictive checks and info criteria

      
+ [Setup](#Setup)
    + [Local Functions](#Local-Functions)
    + [Load Data](#Load-Data)
    + [Prepare Dataset](#Prepare-Dataset)


+ [Posterior Predictive Checks on Training Data](#Posterior-Predictive-Checks-on-Training-Data)
    + [Eyeballing](#Eyeballing)
    + [MSE and R-squared](#MSE-and-R-squared)
    + [KS test](#KS-test)  
    

+ [Posterior Predictive Checks on Validation Data](#Posterior-Predictive-Checks-on-Validation-Data)
    + [Held-out test set](#Held-out-test-set)
    + [Cross-Validation](#Cross-Validation)
    + [Leave-One-Out Cross-Val](#Leave-One-Out-Cross-Val)


+ [Information Criteria](#Information-Criteria)
    + [Deviance Information Criterion](Deviance-Information-Criterion)  
    + [Widely-Accepted Information Criterion](#Widely-Accepted-Information-Criterion)  


##### Note:

For more information on Model Selection in PyMC3, and about DIC and WAIC, you could start with:
+ Thomas Wiecki's [detailed response](https://stats.stackexchange.com/questions/161082/bayesian-model-selection-in-pymc3/166383#166383) to a question on Cross Validated
+ The Deviance Information Criterion: 12 Years On ([Speigelhalter et al 2014](http://onlinelibrary.wiley.com/doi/10.1111/rssb.12062/abstract))
+ A Widely Applicable Bayesian Information Criterion ([Watanabe 2013](http://www.jmlr.org/papers/volume14/watanabe13a/watanabe13a.pdf))
+ Efficient Implementation of Leave-One-Out Cross-Validation and WAIC for Evaluating Fitted Bayesian Models ([Gelman et al 2015](http://arxiv.org/abs/1507.04544))

---

# Setup

In [1]:
## Interactive magics
%matplotlib inline
%qtconsole --colors=linux

In [2]:
# general packages
import sqlite3
from convenience_functions import *
from ipywidgets import interactive, fixed
#from io import StringIO
#from collections import OrderedDict
#from itertools import combinations

# scientific packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import patsy as pt
from scipy import optimize
from scipy.stats import ks_2samp

#from sklearn.neighbors.kde import KernelDensity
import statsmodels.api as sm

# pymc3 libraries
import pymc3 as pm
import theano as thno
import theano.tensor as T 
import pystan

sns.set(style="darkgrid", palette="muted")
pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = 12, 4
np.random.seed(0)

## Local Functions

In [3]:
def strip_derived_rvs(rvs):
    '''Convenience fn: remove PyMC3-generated RVs from a list'''
    ret_rvs = []
    for rv in rvs:
        if not (re.search('_log',rv.name) or re.search('_interval',rv.name)):
            ret_rvs.append(rv)     
    return ret_rvs


def trace_median(x):
    return pd.Series(np.median(x,0), name='median')


def plot_traces_pymc(trcs, varnames=None):
    ''' Convenience fn: plot traces with overlaid means and values '''

    nrows = len(trcs.varnames)
    if varnames is not None:
        nrows = len(varnames)
    ax = pm.traceplot(trcs, varnames=varnames, figsize=(12,nrows*1.4)
        ,lines={k: v['mean'] for k, v in 
            pm.df_summary(trcs,varnames=varnames).iterrows()})

    for i, mn in enumerate(pm.df_summary(trcs, varnames=varnames)['mean']):
        ax[i,0].annotate('{:.2f}'.format(mn), xy=(mn,0), xycoords='data'
                    ,xytext=(5,10), textcoords='offset points', rotation=90
                    ,va='bottom', fontsize='large', color='#AA0022')    

        
def plot_stan_trc(dftrc):
    """
       Create simple plots of parameter distributions and traces from 
       output of pystan sampling. Emulates pymc traceplots.
    """

    fig, ax2d = plt.subplots(nrows=dftrc.shape[1], ncols=2, figsize=(14, 1.8*dftrc.shape[1]),
                                facecolor='0.99', edgecolor='k')
    fig.suptitle('Distributions and traceplots for {} samples'.format(
                                dftrc.shape[0]),fontsize=14)
    fig.subplots_adjust(wspace=0.2, hspace=0.5)

    k = 0
    
    # create density and traceplot, per parameter coeff
    for i, (ax1d, col) in enumerate(zip(ax2d, dftrc.columns)):

        samples = dftrc[col].values
        scale = (10**np.round(np.log10(samples.max() - samples.min()))) / 20
        kde = KernelDensity(bandwidth=scale).fit(samples.reshape(-1, 1))
        x = np.linspace(samples.min(), samples.max(), 100).reshape(-1, 1)
        y = np.exp(kde.score_samples(x))
        clr = sns.color_palette()[0]

        # density plot
        ax1d[0].plot(x, y, color=clr, linewidth=1.4)
        ax1d[0].vlines(np.percentile(samples, [2.5, 97.5]), ymin=0, ymax=y.max()*1.1,
                       alpha=1, linestyles='dotted', colors=clr, linewidth=1.2)
        mn = np.mean(samples)
        ax1d[0].vlines(mn, ymin=0, ymax=y.max()*1.1,
                       alpha=1, colors='r', linewidth=1.2)
        ax1d[0].annotate('{:.2f}'.format(mn), xy=(mn,0), xycoords='data'
                    ,xytext=(5,10), textcoords='offset points', rotation=90
                    ,va='bottom', fontsize='large', color='#AA0022')    
        ax1d[0].set_title('{}'.format(col), fontdict={'fontsize':10})


        # traceplot
        ax1d[1].plot(np.arange(len(samples)),samples, alpha=0.2, color=clr, linestyle='solid'
                              ,marker=',', markerfacecolor=clr, markersize=10)
        ax1d[1].hlines(np.percentile(samples,[2.5, 97.5]), xmin=0, xmax=len(samples),
                       alpha=1, linestyles='dotted', colors=clr)
        ax1d[1].hlines(np.mean(samples), xmin=0, xmax=len(samples), alpha=1, colors='r')

        k += 1
                
        ax1d[0].set_title('{}'.format(col), fontdict={'fontsize':14})#,'fontweight':'bold'})
        #ax1d[0].legend(loc='best', shadow=True)
        
        _ = [ax1d[j].axes.grid(True, linestyle='-', color='lightgrey') for j in range(2)]
            
    plt.subplots_adjust(top=0.94)
    plt.show()

## Load Data

In [4]:
cnxsql = sqlite3.connect('data/car_emissions.db')
dfs = pd.read_sql('select * from cars_post_exclusions_2sd', cnxsql, index_col=None)

In [5]:
## convert sqlite bool storage (as ints) back to bools
for ft in ['parent_is_vw', 'mfr_is_vw', 'is_tdi']:
    dfs[ft] = dfs[ft].astype(bool)

In [6]:
custom_describe(dfs)

(2593, 13)


Unnamed: 0,1653,835,763,count,mean,std,min,25%,50%,75%,max,dtype
emissions_nox_mgkm,12,36,32,2593,37.32,17.9,1,23.0,35.0,51.0,76,float64
parent_is_vw,False,False,False,2593,,,False,,,,True,bool
mfr_is_vw,False,False,False,2593,,,False,,,,True,bool
parent,daimler-ag,bmw,bmw,2593,,,aston,,,,volksw,object
mfr,mercedes-benz,bmw,bmw,2593,,,abarth,,,,volvo,object
trans,auto,semiauto,auto,2593,,,auto,,,,semiau,object
fuel_type,petrol,petrol,petrol,2593,,,diesel,,,,petrol,object
is_tdi,False,False,False,2593,,,False,,,,True,bool
metric_combined,-0.0728208,0.80692,0.220426,2593,-0.0,0.5,-0.685973,-0.339409,-0.152797,0.167108,2.75301,float64
metric_extra_urban,-0.0751821,0.47462,0.0728415,2593,0.0,0.5,-0.47696,-0.180913,-0.075182,0.093988,21.5997,float64


##### Declare feats for use

In [7]:
fts_cat = ['parent_is_vw', 'mfr_is_vw', 'parent', 'mfr', 'trans', 'fuel_type', 'is_tdi']
fts_cat_smp = ['mfr_is_vw','trans','fuel_type','is_tdi']
fts_num = ['metric_combined', 'metric_extra_urban', 'metric_urban_cold'
           ,'engine_capacity', 'emissions_co_mgkm']
fts_num_smp = ['metric_combined', 'engine_capacity', 'emissions_co_mgkm']
ft_endog = 'emissions_nox_mgkm'

## Describe dataset

+ The dataset is 2593 rows, with 12 exog features, 1 endog feature.
+ These are observations of car emissions tests, one row per car.
+ You can read off the basic distributional statistics of the features in the table above. Numeric features have been standardized according to [Gelman's 2sd principle](http://www.stat.columbia.edu/~gelman/research/published/standardizing7.pdf).
+ I have selected these particular 12 features to work with. Some are derivatives of original features.

We have the following features to choose from:

```
+ Categoricals:
    + `trans`     - the car transmission, simplified to 'auto', 'semiauto', 'manual'
    + `fuel_type` - the car power supply, simplified to 'petrol', 'diesel'
    + `parent`    - the parent company of the car manufacturer, 20 values
    + `mfr`       - the car manufacturer, 38 values

+ Booleans:
    + `parent_is_vw` - if the parent company of the car manufacturer is Volkswagen
    + `mfr_is_vw`    - if the car manufacturer is Volkswagen
    + `is_tdi`       - (processed feature) if the car engine type is a turbo diesel
    
+ Numerics:
    + `metric_combined`    - a score for fuel efficiency in combined driving
    + `metric_extra_urban` - a score for fuel efficiency in an extra-urban driving
    + `metric_urban_cold`  - a score for fuel efficiency in an urban setting, cold start
    + `emissions_co_mgkm`  - a count of CO particulates emitted mg/km
    
+ Numeric endogenous feature:
    + `emissions_nox_mgkm` - a count of NOx particulates emitted mg/km    
```

For the purposes of this Notebook, the final feature mentioned `emissions_nox_mgkm` will be used as the _endogenous_ / _dependent_ / _output_ feature of the linear models. All other features may be used as _exogenous_ / _independent_ / _input_ features.

##### Create design matrices

In [33]:
fml_all = '{} ~ '.format(ft_endog) + ' + '.join(fts_num_smp + fts_cat_smp)
fml_all

'emissions_nox_mgkm ~ metric_combined + engine_capacity + emissions_co_mgkm + mfr_is_vw + trans + fuel_type + is_tdi'

In [34]:
(mx_en, mx_ex) = pt.dmatrices(fml_all, dfs, return_type='dataframe', NA_action='raise')
custom_describe(mx_ex, 2, )

(2593, 9)


Unnamed: 0,649,1344,count,mean,std,min,25%,50%,75%,max,dtype
Intercept,1.0,1.0,2593,1.0,0.0,1.0,1.0,1.0,1.0,1.0,float64
mfr_is_vw[T.True],0.0,0.0,2593,0.04,0.2,0.0,0.0,0.0,0.0,1.0,float64
trans[T.manual],0.0,1.0,2593,0.49,0.5,0.0,0.0,0.0,1.0,1.0,float64
trans[T.semiauto],0.0,0.0,2593,0.12,0.32,0.0,0.0,0.0,0.0,1.0,float64
fuel_type[T.petrol],0.0,0.0,2593,0.5,0.5,0.0,0.0,1.0,1.0,1.0,float64
is_tdi[T.True],0.0,0.0,2593,0.12,0.32,0.0,0.0,0.0,0.0,1.0,float64
metric_combined,-0.179456,-0.419385,2593,-0.0,0.5,-0.685973,-0.339409,-0.152797,0.167108,2.753013,float64
engine_capacity,-0.039416,-0.369128,2593,0.0,0.5,-0.920116,-0.278857,-0.045471,-0.037765,2.579014,float64
emissions_co_mgkm,-0.715474,-0.123882,2593,0.0,0.5,-0.846545,-0.382482,-0.088458,0.272874,2.281452,float64


---

---

# Reload Models and Traces

## Reload PyMC3 model and traces

In [8]:
with pm.Model():
    mdl_robust_pymc = read_pickle(relnm='models/mdl_robust_pymc.pkl')

Loaded object from disk at models/mdl_robust_pymc.pkl


In [9]:
## load traces within model context
with mdl_robust_pymc:
    trc_robust_pymc = pm.backends.text.load('traces/trc_robust_pymc')

In [10]:
# alternatively, load with a ref to reloaded model
trc_robust_pymc = pm.backends.text.load('traces/trc_robust_pymc', model=mdl_robust_pymc)

## Reload PyStan model and fitted model (contains traces)

In [15]:
## load the model first, and then the fit:

mdl_robust_pystan_model = read_pickle(relnm='models/mdl_robust_stan_model.pkl')

mdl_robust_pystan_fit = read_pickle(relnm='models/mdl_robust_stan_fit.pkl')

Loaded object from disk at models/mdl_robust_stan_model.pkl
Loaded object from disk at models/mdl_robust_stan_fit.pkl


---

---

# Posterior Predictive Checks on Training Data

## Calculate Posterior Samples (PyMC3 native method)

##### Feed native method with model and traces, output ppc

In [50]:
%%time 
ppc_pymc = pm.sample_ppc(trc_robust_pymc[-1000:], samples=1000, 
                           model=mdl_robust_pymc, size=10)

CPU times: user 20.9 s, sys: 530 ms, total: 21.4 s
Wall time: 21.7 s


##### Return is a dict of predicted values (traces * calcs * posterior * datapoints)

    for each datapoint:              2593x
        use each trace value:        1000x
            calc the posterior         10x
            
    creates 10,000 estimates for each data point (25,930,000 values)
    
    (probably a little overkill)

In [51]:
ppc_pymc['y'].shape

(1000, 10, 1, 2593)

In [96]:
yhat_pymc = ppc_pymc['y'].reshape(1000*10*1, 2593).T
yhat_pymc.shape

(2593, 10000)

## Calculate Posterior Samples (Remodel using Stan generated quantities)

Stan can generate a posteriour prob given datapoints within the modelsepc itself.

Very handy, but appears to be only an option at modelling time, so I'll rerun the model from earlier, this time speccing the `generated_quantities{}`

In [98]:
mdl_robustppc_stan_spec = """

    /* Spec for robust regression linear model */

    data {                                 
        int<lower=0> N;                     // count of observations
        int<lower=0> M_ppc;                 // count of ppc samples per observation
        int<lower=0> N_ppc;                 // count of ppc observations
        int<lower=0> K;                     // count of exog features
        matrix[N, K] X;                     // exog features
        matrix[N_ppc, K] X_ppc;             // exog features for posterior pred
        vector[N] y;                        // endog feature
    }
    parameters {
        vector[K] beta;                     // exog coeffs
        real<lower=0> sigma;                // linear model error
    }
    transformed parameters {}
    model {  
        sigma ~ cauchy(0, 10);              // explicit prior for error (half-cauchy)
        y ~ student_t(1, X * beta, sigma);  // student-T likelihood
    }
    generated quantities {
        matrix[M_ppc, N_ppc] y_ppc;
        
        // needs loops (the *_rng functions only support 1D reals inputs)
        
        for (n in 1:N_ppc)
            for (m in 1:M_ppc)
                y_ppc[m, n] <- student_t_rng(1, X_ppc[n] * beta, sigma);  // posterior pred
    }
    """

In [99]:
stan_datadictppc = {}
stan_datadictppc['N'] = mx_ex.shape[0]
stan_datadictppc['K'] = mx_ex.shape[1]
stan_datadictppc['X'] = mx_ex.values
stan_datadictppc['y'] = mx_en[ft_endog].values

stan_datadictppc['M_ppc'] = 10
stan_datadictppc['N_ppc'] = mx_ex.shape[0]
stan_datadictppc['X_ppc'] = mx_ex.values

In [100]:
mdl_robustppc_stan_fit = pystan.stan(
        model_code  = mdl_robustppc_stan_spec,
        data        = stan_datadictppc,
        sample_file = ensure_dir('traces/trc_robustppc_stan/chain'),
        model_name  = 'mdl_robustppc_stan',
        iter        = 1000,
        warmup      = 750, 
        chains      = 4, 
        n_jobs      = 2, 
        verbose     = False)

In [101]:
## Don't print summary this time, since it now also contains 2593 estimates of y_ppc

# print(mdl_robustppc_stan_fit)

In [102]:
## similarly a terrible idea to simply call plot

# mdl_robustppc_stan_fit.plot()
# plt.show()

##### Extract ppc from model fit object

In [103]:
ppc_pystan = mdl_robustppc_stan_fit.extract(pars='y_ppc', permuted=True, inc_warmup=False)

##### Return array is datapoints * traces

    for each datapoint:              2593x
        use each trace value:        1000x
            calc the posterior         10x
            
    creates 10,000 estimates for each data point (25,930,000 values)


In [104]:
ppc_pystan['y_ppc'].shape

(1000, 10, 2593)

In [106]:
yhat_pystan = ppc_pystan['y_ppc'].reshape(1000*10, 2593).T
yhat_pystan.shape

(2593, 10000)

## Deviance Information Criterion (DIC)

The [Deviance Information Criterion](https://en.wikipedia.org/wiki/Deviance_information_criterion) (DIC) is a fairly unsophisticated method for comparing the deviance of likelihood across the sample traces of a model run. 

However, this simplicity apparently yields quite good results in a variety of cases, see the discussion worth reading in ([Speigelhalter et al 2014](http://onlinelibrary.wiley.com/doi/10.1111/rssb.12062/abstract))

DIC has recently been added to PyMC3, so lets see what it tells us about our model fits. Lower numbers are better

In [None]:
# OLS
pm.stats.dic(model=mdl_ols, trace=trc_ols[-1000:])

In [None]:
# Lasso
pm.stats.dic(model=mdl_lasso, trace=trc_lasso[-7000::7])

In [None]:
# Ridge
pm.stats.dic(model=mdl_ridge, trace=trc_ridge[-1000:])

**Observe:**

+ All 3 values are quite similar! 
+ Interestingly, the Lasso has the highest (worst) DIC value, probably because I deliberately chose a suboptimal regularization parameter $\lambda$ in order to favour the dropping of feature coefficients. This likely resulted in an _underfitted_ model.

## Posterior Predictive Checks (PPC)

The PPC is another model diagnostic we can easily use in PyMC3:

1. For each datapoint (which naturally has a real value $y$), compute a vector of posterior predicted values $\bf{\hat{y}}$, using each value in the traces, using the model likelihood.
2. Now you have a range of posterior predictions for each datapoint, which you can leave as a distribution or take summary statistics
    1. If we take the mean summary statistic, then we can compute the **mean-squared-error (MSE)**
    2. If we leave as a distribution, then we can compute 1d distributional differences such as the Kolmogorov–Smirnov test or Kullback–Leibler divergence.

### Mean Squared Error (MSE)

Here I use the `sample_ppc()` function to:

+ subsample 500 traces from the 1000 burned-in traces
+ use each of the 500 subsampled trace values to create a model according to the model parameters
+ and generate 50 samples of $\hat{y}$ for each datapoint
+ thereby getting 25,000 estimates of $\hat{y}$ for each of the 2644 datapoints
+ this is probably overkill
+ but very quick to compute, so let's propogate the probability within the model...
+ then I compute the MSE $\frac{1}{n}\sum_{i=1}^{i=n}(\hat{y}_{i}-y_{i})^{2}$ 

In [None]:
def compute_ppc(trc, mdl, samples=500, size=50):
    return pm.sample_ppc(trc_ols[-1000:], samples=500, model=mdl_ols, size=50)

def compute_mse(df, ppc, ft_endog):
    return np.sum((ppc['y'].mean(0).mean(0).T - df[[ft_endog]])**2)[0]/df.shape[0]

In [None]:
ppc_ols = compute_ppc(trc_ols[-1000:], mdl_ols, 500, 50)
ppc_lasso = compute_ppc(trc_lasso[-7000::7], mdl_lasso, 500, 50)
ppc_ridge = compute_ppc(trc_ridge[-1000:], mdl_ridge, 500, 50)

In [None]:
df_mse = pd.DataFrame({'method':['OLS','Lasso','Ridge'], 'mse':np.zeros(3)})
df_mse.set_index('method', inplace=True)

for method, ppc in zip(['OLS','Lasso','Ridge'],[ppc_ols, ppc_lasso, ppc_ridge]):
    df_mse.loc[method,'mse'] = compute_mse(dfs, ppc, ft_endog)

In [None]:
df_mse

**Observe:**

+ The MSE for the models is extremely close: the Lasso and Ridge values are nearly identical and both represent about 99.92% of larger OLS value.

### R-squared

Since we have MSE anyway, lets calculate the (unadjusted) r-squared ($R^{2}$) value for each model. It's a ratio of the sum of squared errors for the model and the sum of squared errors for a hypothetical model that just predicts the mean of the dataset. 

$$R^{2} = 1 - \frac{\sum e_{model}^{2}}{\sum e_{mean}^{2}}$$

Ideally the $R^{2}$ lives in the range $[0,1]$, but in practice, it's quite easy for a model to perform arbitrily worse than the mean guess, particularly for oddly shaped data, resulting in a possible range (-inf, 1]



In [None]:
def compute_r2(df, ppc, ft_endog):
    
    sse_model = np.sum((ppc['y'].mean(0).mean(0).T - df[[ft_endog]])**2)[0]
    sse_mean = np.sum((df[[ft_endog]] - df[ft_endog].mean())**2)[0]
    
    return 1 - (sse_model / sse_mean)

In [None]:
for method, ppc in zip(['OLS','Lasso','Ridge'],[ppc_ols, ppc_lasso, ppc_ridge]):
    df_mse.loc[method,'r2'] = compute_r2(dfs, ppc, ft_endog)

In [None]:
df_mse

**Observe:**

+ Well, this ought not to be surprising given the MSE result: the $R^{2}$ results are nearly identical
+ the $R^{2}$ for the OLS model is _marginally_ worse then the Lasso and the Ridge

### Distributional Check: Kolmogorov–Smirnov (KS) test

Mean values / MLEs are a clumsy way to summarise data, wasting lots of rich information present in our carefully gathered Bayesian samples.

Instead of summarising to calcuate MSE and $R^{2}$, lets compare the distributions of the predicted values $\hat{y}$ vs the actual values $y$

I'll use the [Kolmogorov-Smirnov (KS) Test](https://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test), specifically the 2-sample version, which lets us quantify the distance between the empirical distribution functions of two 'samples': the model predictions in this case.

##### Calculate the emprirical distribution functions

In [None]:
df_ppc = pd.DataFrame({'observed': dfs[ft_endog]
                       ,'ols': ppc_ols['y'].mean(0).mean(0).ravel()
                       ,'lasso': ppc_lasso['y'].mean(0).mean(0).ravel()
                       ,'ridge': ppc_ridge['y'].mean(0).mean(0).ravel()}
                     , index=dfs.index)
df_ppc = df_ppc[['observed','ols','lasso','ridge']]
df_ppc_melt = pd.melt(df_ppc, var_name='method', value_name='y')

##### Boxplot the distributions

In [None]:
g = sns.factorplot(x='y', y='method', hue='method'
        ,data=df_ppc_melt, kind='box', size=5, aspect=2, legend=False, showmeans=True)

##### Lineplot the cumulative distributions

In [None]:
df_ppc_reordered = df_ppc.copy()

for i, method in enumerate(['observed','ols','lasso','ridge']):
     df_ppc_reordered[method] = ( (df_ppc[method].order()[::-1].cumsum() /
                                 df_ppc[method].sum()).values )
ax = df_ppc_reordered.plot(kind='line')
_ = ax.set_ylim((0,1))

**Observe:**

+ I have a funny feeling we're going to see that the KS test is almost the same value for each of the methods...

### Perform KS Test using `scipy.stats.ks_2samp`

As per the [docs](http://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.ks_2samp.html), this is a two-sided test for the null hypothesis that 2 independent samples are drawn from the same continuous distribution.

If the K-S statistic is small or the p-value is high, then we cannot reject the hypothesis that the distributions of the two samples are the same. For my purposes here, I'm simply looking for mutual differences between the posterior predictive distibutions for `ols`, `lasso` and `ridge`, and their individual differences with the `observed` distribution

In [None]:
for (a,b) in combinations(['observed','ols','lasso','ridge'],2):
    print('\n{} vs {}:'.format(a,b))
    print(ks_2samp(df_ppc_reordered[a].values,df_ppc_reordered[b].values))

**Observe:**

The first 3 results printed above show the difference between the `observed` distribution and the posterior predictive distribution for `ols`, `lasso`, and `ridge`

+ The KS statistic is largest for `observed` vs `ridge`, which would suggest that the difference between these two distributions is smallest: the Ridge model fitted the data best.
+ This is somewhat tenuous however, since the numbers are very very similar.

The latter 3 results show the mutual differences between the three posterior distributions.

+ In this case the KS statistic is very small and the p-values very high, indicating that we ought to treat the three posterior distributinos as the same.
+ This serves to further qualify the above observation that differences with `observed` are minimal.

We could read this as a good result: it's possible to create a Ridge regression that uses markedly less input data and achieves the same predictive results.

---

# Final Thoughts

I set out to determine if there's anything strange about Volkswagen's NOx emissions results, and I'll freely admit to getting somewhat distracted by demonstrating some of the flexibility of PyMC3 and methods for evaluating results. 

There's two interesting clues in the Notebook though:

+ The first is that the feature-value `mfr_owner_is_vw[T.True]` was selected by the Lasso model (both Frequentist and Bayesian versions) indicating that it has a non-negligible effect
+ In the Ridge model (both versions) this feature-value coefficient receives a value quite far above zero, and indeed the vast majority (>>95%) of its distribution lies above zero.

---
**&copy; Applied AI Ltd 2016**  
<a href='http://www.applied.ai'>applied.ai</a>