<div style=float:right><img src="assets/img/appliedai-logo.png" width=100 style="margin: 0px 20px"></img></div>


##### Jonathan Sedar Personal Project
## PyMC3 vs PyStan Comparison
_Spring 2016_

This set of Notebooks and scripts comprise the **pymc3_vs_pystan** personal project by Jonathan Sedar of Applied AI Ltd, written primarily for presentation at the PyData London 2016 Conference.

The project demonstrates hierarchical linear regression using two Bayesian inference frameworks: PyMC3 and PyStan. The project borrows heavily from code written for Applied AI Ltd and is supplied here for educational purposes only. No copyright or license is extended to users.


    
# 32_MoreModelEvaluation_PyMC3

#### Demonstrate some model evaluation built into PyMC3

      
+ [Setup](#Setup)
    + [Local Functions](#Local-Functions)
    + [Load Data](#Load-Data)
    + [Describe Dataset](#Describe-Dataset)


+ [Reload Models and Traces](#Reload-Models-and-Traces)
    + [Reload PyMC3 model and traces](#Reload-PyMC3-model-and-traces)
    + [Reload PyStan fitted model](#Reload-PyStan-fitted-model)


+ [Information Criteria](#Information-Criteria)
    + [Deviance Information Criterion](Deviance-Information-Criterion)  
    + [Widely-Accepted Information Criterion](#Widely-Accepted-Information-Criterion)  


+ [Posterior Predictive Checks on Validation Data](#Posterior-Predictive-Checks-on-Validation-Data)
    + [Held-out test set](#Held-out-test-set)
    + [Cross-Validation](#Cross-Validation)
    + [Leave-One-Out Cross-Val](#Leave-One-Out-Cross-Val)




##### Note:

For more information on Model Selection in PyMC3, and about DIC and WAIC, you could start with:
+ Thomas Wiecki's [detailed response](https://stats.stackexchange.com/questions/161082/bayesian-model-selection-in-pymc3/166383#166383) to a question on Cross Validated
+ The Deviance Information Criterion: 12 Years On ([Speigelhalter et al 2014](http://onlinelibrary.wiley.com/doi/10.1111/rssb.12062/abstract))
+ A Widely Applicable Bayesian Information Criterion ([Watanabe 2013](http://www.jmlr.org/papers/volume14/watanabe13a/watanabe13a.pdf))
+ Efficient Implementation of Leave-One-Out Cross-Validation and WAIC for Evaluating Fitted Bayesian Models ([Gelman et al 2015](http://arxiv.org/abs/1507.04544))

---

# Setup

In [1]:
## Interactive magics
%matplotlib inline
%qtconsole --colors=linux

In [2]:
# filter warnings for presentation's sake
import warnings
warnings.filterwarnings('ignore')

# general packages
import sqlite3
from convenience_functions import *
from ipywidgets import interactive, fixed
from itertools import combinations
#from io import StringIO
#from collections import OrderedDict


# scientific packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import patsy as pt
from scipy import optimize
from scipy.stats import ks_2samp

#from sklearn.neighbors.kde import KernelDensity
import statsmodels.api as sm

# pymc3 libraries
import pymc3 as pm
import theano as thno
import theano.tensor as T 
import pystan

sns.set(style="darkgrid", palette="muted")
pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = 12, 4
np.random.seed(0)

## Local Functions

In [3]:
def strip_derived_rvs(rvs):
    '''Convenience fn: remove PyMC3-generated RVs from a list'''
    ret_rvs = []
    for rv in rvs:
        if not (re.search('_log',rv.name) or re.search('_interval',rv.name)):
            ret_rvs.append(rv)     
    return ret_rvs


def trace_median(x):
    return pd.Series(np.median(x,0), name='median')


def plot_traces_pymc(trcs, varnames=None):
    ''' Convenience fn: plot traces with overlaid means and values '''

    nrows = len(trcs.varnames)
    if varnames is not None:
        nrows = len(varnames)
    ax = pm.traceplot(trcs, varnames=varnames, figsize=(12,nrows*1.4)
        ,lines={k: v['mean'] for k, v in 
            pm.df_summary(trcs,varnames=varnames).iterrows()})

    for i, mn in enumerate(pm.df_summary(trcs, varnames=varnames)['mean']):
        ax[i,0].annotate('{:.2f}'.format(mn), xy=(mn,0), xycoords='data'
                    ,xytext=(5,10), textcoords='offset points', rotation=90
                    ,va='bottom', fontsize='large', color='#AA0022')    

        
def plot_stan_trc(dftrc):
    """
       Create simple plots of parameter distributions and traces from 
       output of pystan sampling. Emulates pymc traceplots.
    """

    fig, ax2d = plt.subplots(nrows=dftrc.shape[1], ncols=2, figsize=(14, 1.8*dftrc.shape[1]),
                                facecolor='0.99', edgecolor='k')
    fig.suptitle('Distributions and traceplots for {} samples'.format(
                                dftrc.shape[0]),fontsize=14)
    fig.subplots_adjust(wspace=0.2, hspace=0.5)

    k = 0
    
    # create density and traceplot, per parameter coeff
    for i, (ax1d, col) in enumerate(zip(ax2d, dftrc.columns)):

        samples = dftrc[col].values
        scale = (10**np.round(np.log10(samples.max() - samples.min()))) / 20
        kde = KernelDensity(bandwidth=scale).fit(samples.reshape(-1, 1))
        x = np.linspace(samples.min(), samples.max(), 100).reshape(-1, 1)
        y = np.exp(kde.score_samples(x))
        clr = sns.color_palette()[0]

        # density plot
        ax1d[0].plot(x, y, color=clr, linewidth=1.4)
        ax1d[0].vlines(np.percentile(samples, [2.5, 97.5]), ymin=0, ymax=y.max()*1.1,
                       alpha=1, linestyles='dotted', colors=clr, linewidth=1.2)
        mn = np.mean(samples)
        ax1d[0].vlines(mn, ymin=0, ymax=y.max()*1.1,
                       alpha=1, colors='r', linewidth=1.2)
        ax1d[0].annotate('{:.2f}'.format(mn), xy=(mn,0), xycoords='data'
                    ,xytext=(5,10), textcoords='offset points', rotation=90
                    ,va='bottom', fontsize='large', color='#AA0022')    
        ax1d[0].set_title('{}'.format(col), fontdict={'fontsize':10})


        # traceplot
        ax1d[1].plot(np.arange(len(samples)),samples, alpha=0.2, color=clr, linestyle='solid'
                              ,marker=',', markerfacecolor=clr, markersize=10)
        ax1d[1].hlines(np.percentile(samples,[2.5, 97.5]), xmin=0, xmax=len(samples),
                       alpha=1, linestyles='dotted', colors=clr)
        ax1d[1].hlines(np.mean(samples), xmin=0, xmax=len(samples), alpha=1, colors='r')

        k += 1
                
        ax1d[0].set_title('{}'.format(col), fontdict={'fontsize':14})#,'fontweight':'bold'})
        #ax1d[0].legend(loc='best', shadow=True)
        
        _ = [ax1d[j].axes.grid(True, linestyle='-', color='lightgrey') for j in range(2)]
            
    plt.subplots_adjust(top=0.94)
    plt.show()

## Load Data

In [4]:
cnxsql = sqlite3.connect('data/car_emissions.db')
dfs = pd.read_sql('select * from cars_post_exclusions_2sd', cnxsql, index_col=None)

In [5]:
## convert sqlite bool storage (as ints) back to bools
for ft in ['parent_is_vw', 'mfr_is_vw', 'is_tdi']:
    dfs[ft] = dfs[ft].astype(bool)

In [6]:
custom_describe(dfs)

(2593, 13)


Unnamed: 0,1653,835,763,count,mean,std,min,25%,50%,75%,max,dtype
emissions_nox_mgkm,12,36,32,2593,37.32,17.9,1,23.0,35.0,51.0,76,float64
parent_is_vw,False,False,False,2593,,,False,,,,True,bool
mfr_is_vw,False,False,False,2593,,,False,,,,True,bool
parent,daimler-ag,bmw,bmw,2593,,,aston,,,,volksw,object
mfr,mercedes-benz,bmw,bmw,2593,,,abarth,,,,volvo,object
trans,auto,semiauto,auto,2593,,,auto,,,,semiau,object
fuel_type,petrol,petrol,petrol,2593,,,diesel,,,,petrol,object
is_tdi,False,False,False,2593,,,False,,,,True,bool
metric_combined,-0.0728208,0.80692,0.220426,2593,-0.0,0.5,-0.685973,-0.339409,-0.152797,0.167108,2.75301,float64
metric_extra_urban,-0.0751821,0.47462,0.0728415,2593,0.0,0.5,-0.47696,-0.180913,-0.075182,0.093988,21.5997,float64


##### Declare feats for use

In [7]:
fts_cat = ['parent_is_vw', 'mfr_is_vw', 'parent', 'mfr', 'trans', 'fuel_type', 'is_tdi']
fts_cat_smp = ['mfr_is_vw','trans','fuel_type','is_tdi']
fts_num = ['metric_combined', 'metric_extra_urban', 'metric_urban_cold'
           ,'engine_capacity', 'emissions_co_mgkm']
fts_num_smp = ['metric_combined', 'engine_capacity', 'emissions_co_mgkm']
ft_endog = 'emissions_nox_mgkm'

## Describe dataset

+ The dataset is 2593 rows, with 12 exog features, 1 endog feature.
+ These are observations of car emissions tests, one row per car.
+ You can read off the basic distributional statistics of the features in the table above. Numeric features have been standardized according to [Gelman's 2sd principle](http://www.stat.columbia.edu/~gelman/research/published/standardizing7.pdf).
+ I have selected these particular 12 features to work with. Some are derivatives of original features.

We have the following features to choose from:

```
+ Categoricals:
    + `trans`     - the car transmission, simplified to 'auto', 'semiauto', 'manual'
    + `fuel_type` - the car power supply, simplified to 'petrol', 'diesel'
    + `parent`    - the parent company of the car manufacturer, 20 values
    + `mfr`       - the car manufacturer, 38 values

+ Booleans:
    + `parent_is_vw` - if the parent company of the car manufacturer is Volkswagen
    + `mfr_is_vw`    - if the car manufacturer is Volkswagen
    + `is_tdi`       - (processed feature) if the car engine type is a turbo diesel
    
+ Numerics:
    + `metric_combined`    - a score for fuel efficiency in combined driving
    + `metric_extra_urban` - a score for fuel efficiency in an extra-urban driving
    + `metric_urban_cold`  - a score for fuel efficiency in an urban setting, cold start
    + `emissions_co_mgkm`  - a count of CO particulates emitted mg/km
    
+ Numeric endogenous feature:
    + `emissions_nox_mgkm` - a count of NOx particulates emitted mg/km    
```

For the purposes of this Notebook, the final feature mentioned `emissions_nox_mgkm` will be used as the _endogenous_ / _dependent_ / _output_ feature of the linear models. All other features may be used as _exogenous_ / _independent_ / _input_ features.

##### Create design matrices

In [8]:
fml_all = '{} ~ '.format(ft_endog) + ' + '.join(fts_num_smp + fts_cat_smp)
fml_all

'emissions_nox_mgkm ~ metric_combined + engine_capacity + emissions_co_mgkm + mfr_is_vw + trans + fuel_type + is_tdi'

In [9]:
(mx_en, mx_ex) = pt.dmatrices(fml_all, dfs, return_type='dataframe', NA_action='raise')
custom_describe(mx_ex, 2, )

(2593, 9)


Unnamed: 0,1731,1033,count,mean,std,min,25%,50%,75%,max,dtype
Intercept,1.0,1.0,2593,1.0,0.0,1.0,1.0,1.0,1.0,1.0,float64
mfr_is_vw[T.True],0.0,0.0,2593,0.04,0.2,0.0,0.0,0.0,0.0,1.0,float64
trans[T.manual],0.0,1.0,2593,0.49,0.5,0.0,0.0,0.0,1.0,1.0,float64
trans[T.semiauto],0.0,0.0,2593,0.12,0.32,0.0,0.0,0.0,0.0,1.0,float64
fuel_type[T.petrol],0.0,0.0,2593,0.5,0.5,0.0,0.0,1.0,1.0,1.0,float64
is_tdi[T.True],0.0,0.0,2593,0.12,0.32,0.0,0.0,0.0,0.0,1.0,float64
metric_combined,-0.31275,-0.472703,2593,-0.0,0.5,-0.685973,-0.339409,-0.152797,0.167108,2.753013,float64
engine_capacity,0.042048,-0.278857,2593,0.0,0.5,-0.920116,-0.278857,-0.045471,-0.037765,2.579014,float64
emissions_co_mgkm,0.099293,-0.421449,2593,0.0,0.5,-0.846545,-0.382482,-0.088458,0.272874,2.281452,float64


---

---

# Reload Models and Traces

## Reload PyMC3 model and traces

In [10]:
with pm.Model():
    mdl_robust_pymc = read_pickle(relnm='models/mdl_robust_pymc.pkl')

Loaded object from disk at models/mdl_robust_pymc.pkl


In [11]:
## load traces within model context
with mdl_robust_pymc:
    trc_robust_pymc = pm.backends.text.load('traces/trc_robust_pymc')

In [12]:
# alternatively, load with a ref to reloaded model
trc_robust_pymc = pm.backends.text.load('traces/trc_robust_pymc', model=mdl_robust_pymc)

## Reload PyStan fitted model

##### Fit object contains traces

In [13]:
## load the model first, and then the fit:

mdl_robust_pystan_model = read_pickle(relnm='models/mdl_robust_stan_model.pkl')

mdl_robust_pystan_fit = read_pickle(relnm='models/mdl_robust_stan_fit.pkl')

Loaded object from disk at models/mdl_robust_stan_model.pkl
Loaded object from disk at models/mdl_robust_stan_fit.pkl


---

---

# More model comparison techniques available in PyMC3

##### Lets create another pymc3 model which ought to yield worse results

In [None]:
with pm.Model() as mdl_ols_pymc:
      
    pm.glm.glm(fml_all,
               dfs,
               intercept_prior=pm.Uniform.dist(lower=-1e3, upper=1e3),
               regressor_prior=pm.Uniform.dist(lower=-1e3, upper=1e3),
               family=pm.glm.families.Normal())

In [None]:
with mdl_robust_pymc:

    ## find MAP using Powell, seems to be more robust
    start_MAP = pm.find_MAP(fmin=optimize.fmin_powell)

    ## take samples
    trc_ols_pymc = pm.sample(2000, start=start_MAP, njobs=1, step=pm.NUTS()
                        ,trace=pm.backends.Text('traces/trc_ols_pymc'))

## Deviance Information Criterion (DIC)

datasets by using the Deviance Information Criterion (DIC) and Watanabe - Akaike (or Widest Available) Information Criterion (WAIC).
+ DIC (`stats.dic`) and WAIC (`stats.waic`) are new additions to PyMC3, so this example shows their usage in a more concrete fashion, also usage of the new `glm` submodule.
+ The example was inspired by Jake Vanderplas' [recent blogpost](https://jakevdp.github.io/blog/2015/08/07/frequentism-and-bayesianism-5-model-selection/) on model selection, although in this first iteration, Cross-Validation and Bayes Factor comparison are not implemented.
+ The datasets are tiny and generated within this Notebook. They contain errors in the measured value (y) only.


For more information on Model Selection in PyMC3, and about DIC and WAIC, you could start with:

+ Thomas Wiecki's [detailed response](https://stats.stackexchange.com/questions/161082/bayesian-model-selection-in-pymc3/166383#166383) to a question on Cross Validated
+ The Deviance Information Criterion: 12 Years On [(Speigelhalter et al 2014)](http://onlinelibrary.wiley.com/doi/10.1111/rssb.12062/abstract)
+ A Widely Applicable Bayesian Information Criterion [(Watanabe 2013)](http://www.jmlr.org/papers/volume14/watanabe13a/watanabe13a.pdf)
+ Efficient Implementation of Leave-One-Out Cross-Validation and WAIC for Evaluating Fitted Bayesian Models [(Gelman et al 2015)](http://arxiv.org/abs/1507.04544)





The [Deviance Information Criterion](https://en.wikipedia.org/wiki/Deviance_information_criterion) (DIC) is a fairly unsophisticated method for comparing the deviance of likelihood across the sample traces of a model run. 

However, this simplicity apparently yields quite good results in a variety of cases, see the discussion worth reading in ([Speigelhalter et al 2014](http://onlinelibrary.wiley.com/doi/10.1111/rssb.12062/abstract))

DIC has recently been added to PyMC3, so lets see what it tells us about our model fits. Lower numbers are better

The Deviance Information Criterion (DIC) is a fairly unsophisticated method for comparing the deviance of likelhood across the the sample traces of a model run. However, this simplicity apparently yields quite good results in a variety of cases, see the discussion worth reading in [(Speigelhalter et al 2014)](http://onlinelibrary.wiley.com/doi/10.1111/rssb.12062/abstract)

DIC has recently been added to PyMC3, so lets see what it tells us about our model fits for both datasets.

##### Manual

In [None]:
dftrc_lin = pm.trace_to_dataframe(traces_lin['k1'])
trc_lin_logp = dftrc_lin.apply(lambda x: models_lin['k1'].logp(x.to_dict()), axis=1)
mean_deviance = -2 * trc_lin_logp.mean(0)
mean_deviance

In [None]:
deviance_at_mean = -2 * models_lin['k1'].logp(dftrc_lin.mean(0).to_dict())
deviance_at_mean

In [None]:
dic_k1 = 2 * mean_deviance - deviance_at_mean
dic_k1

##### Or use stats.py

In [None]:
pm.stats.dic(model=models_lin['k1'], trace=traces_lin['k1'])

**Observe**

+ We should prefer the model(s) with lower DIC, which (happily) directly opposes the increasing likelihood we see above.


+ Linear-generated data (lhs):
    + The DIC increases monotonically with model complexity, this is great too see!
    + The more complicated the model, the more it would appear we are overfitting.


+ Quadratic-generated data (rhs):
    + The DIC dips slightly for the correct model k2
    + The difference is slight though!



In [None]:
# OLS
pm.stats.dic(model=mdl_ols, trace=trc_ols[-1000:])

In [None]:
# Lasso
pm.stats.dic(model=mdl_lasso, trace=trc_lasso[-7000::7])

In [None]:
# Ridge
pm.stats.dic(model=mdl_ridge, trace=trc_ridge[-1000:])

**Observe:**

+ All 3 values are quite similar! 
+ Interestingly, the Lasso has the highest (worst) DIC value, probably because I deliberately chose a suboptimal regularization parameter $\lambda$ in order to favour the dropping of feature coefficients. This likely resulted in an _underfitted_ model.

In [None]:
dics = [pm.stats.dic(model=mdl, trace=trc) for mdl, trc in zip(mdls, trcs)]
dfdics = pd.DataFrame({'model':['ols','lasso','ridge','ridge_hn','student'],'dic':dics})

In [None]:
waics = [pm.stats.waic(model=mdl, trace=trc) for mdl, trc in zip(mdls, trcs)]
dfwaics = pd.DataFrame({'model':['ols','lasso','ridge','ridge_hn','student'],'waic':waics})

In [None]:
g = sns.barplot(y='model',x='waic', data=dfwaics, orient='h')

In [None]:
See my example in pymc3 docs: https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/GLM-model-selection.ipynb


# Compare Watanabe - Akaike Information Criterion [WAIC]

The Widely Applicable Bayesian Information Criterion (WBIC), a.k.a the Watanabe - Akaike Information Criterion (WAIC) is another simple option for calculating the goodness-of-fit of amodel using numerical techniques. See [(Watanabe 2013)](http://www.jmlr.org/papers/volume14/watanabe13a/watanabe13a.pdf) for details.

WAIC has also recently been added to PyMC3, so lets see what it tells us about our model fits for both datasets.

In [None]:
##### This time go straight for the implementation in pymc3

In [None]:
pm.stats.waic(model=models_lin['k1'], trace=traces_lin['k1'])

In [None]:
**Observe**

+ We should prefer the model(s) with lower WAIC


+ Linear-generated data (lhs):
    + The WAIC seems quite flat across models
    + The WAIC seems best (lowest) for simpler models, but **k1** doesn't stand out as much as it did when using DIC


+ Quadratic-generated data (rhs):
    + The WAIC is certainly wrong for **k1**, but otherwise also quite flat across the models
    + There does appear to be a slight dip in the right place at **k2**
    
    
For these particular models and data, I would prefer to use the DIC scores in order to choose models.



---

---

# Hold out set (Theano shared vars)

In [None]:
##### Create split

In [None]:
kfold_shuffle = ShuffleSplit(mx_ex.shape[0], n_iter=1, test_size=0.2, random_state=0)

for train_idx, test_idx in kfold_shuffle:

    mx_ex_train = mx_ex.iloc[train_idx]
    mx_en_train = mx_en.iloc[train_idx]
    mx_ex_test = mx_ex.iloc[test_idx]
    mx_en_test = mx_en.iloc[test_idx]

print('ntrain: {}, ntest: {}'.format(mx_ex_train.shape, mx_ex_test.shape))

##### Use theano shared variables to allow us to switch train for test and re-use the model

In [None]:
shd_b0 = thno.shared(mx_ex_train['Intercept'].values)
shd_b1 = thno.shared(mx_ex_train['is_tdi[T.True]'].values)
shd_b2 = thno.shared(mx_ex_train['engine_capacity'].values)
shd_b3 = thno.shared(mx_ex_train['metric_combined'].values)

shd_obs = thno.shared(mx_en_train['emissions_nox_mgkm'].values)

##### Train model 

In [None]:
with pm.Model() as mdl_holdout:

    # define priors, use Normal for Ridge (sd=100, weakly informative)
    b0 = pm.Normal('b0_intercept', mu=0, sd=100)
    b1 = pm.Normal('b1_is_tdi', mu=0, sd=100)
    b2 = pm.Normal('b2_engine_capacity', mu=0, sd=100)
    b3 = pm.Normal('b3_metric_combined', mu=0, sd=100)    
 
    # define linear model
    yest = (b0 * shd_b0 +
            b1 * shd_b1 +
            b2 * shd_b2 + 
            b3 * shd_b3)

    # create MLE with stdev chosen from HalfNormal dist and free param on nu             
    epsilon = pm.HalfNormal('epsilon', sd=10)
    nu = pm.DiscreteUniform('nu', lower=1, upper=100)
   
    ## Student T likelihood with variable degress of freedom nu
    likelihood = pm.StudentT('likelihood', nu=nu, mu=yest, sd=epsilon
                            ,observed=shd_obs)
    
    start_MAP = pm.find_MAP(fmin=optimize.fmin_powell, disp=True)
      
    trc_holdout = pm.sample(2000, step=pm.NUTS(), start=start_MAP, progressbar=True)

In [None]:
### Test model

In [None]:
##### Now switch values in the shared variables to the test set

In [None]:
shd_b0.set_value(mx_ex_test['Intercept'].values)
shd_b1.set_value(mx_ex_test['is_tdi[T.True]'].values)
shd_b2.set_value(mx_ex_test['engine_capacity'].values)
shd_b3.set_value(mx_ex_test['metric_combined'].values)

shd_obs.set_value(mx_en_test['emissions_nox_mgkm'].values)

In [None]:
##### And run the posterior check which will now create estimates for the test data

In [None]:
ppc = pm.sample_ppc(trc_holdout[-1000:], samples=500, model=mdl_holdout, size=50)
ppc['likelihood'].shape

## Evaluate pediction

## PPC

In [None]:
ax = plt.subplot()
sns.distplot([n.mean() for n in ppc['likelihood']], kde=False, ax=ax)
ax.axvline(mx_en_test[ft_endog].mean())
# ax.set(title='Posterior predictive of the mean', xlabel='mean(x)', ylabel='Frequency');

## Coverage

## R2

## MSE

## KS

# Leave One Out LOO

In [None]:
## K-Fold Cross Validation and/or Leave-One-Out (LOO)

http://www.stat.columbia.edu/~gelman/research/unpublished/waic_stan.pdf



http://arxiv.org/abs/1507.04544
    

In [None]:
## Bayes Factor

Following text lifted directly from [JakeVDP blogpost](https://jakevdp.github.io/blog/2015/08/07/frequentism-and-bayesianism-5-model-selection/)

The Bayesian approach proceeds very differently. Recall that the Bayesian model involves computing the odds ratio between two models:

$$O_{21}=\frac{P(M_{2} \;|\; D)}{P(M_{1} \;|\; D)}=\frac{P(D \;|\; M_{2})}{P(D \;|\; M_{1})}\frac{P(M_{2})}{P(M_{1})}$$

Here the ratio $\frac{P(M2)}{P(M1)}$ is the prior odds ratio, and is often assumed to be equal to 1 if no compelling prior evidence favors one model over another. The ratio $\frac{P(D \;|\; M2)}{P(D \;|\; M1)}$ is the **Bayes factor**, and is the key to Bayesian model selection.


The Bayes factor can be computed by evaluating the integral over the parameter likelihood:

$$P(D \;|\; M)=\int_{\Omega}P(D \;|\; \theta,M) \; P(\theta \;|\; M) \;d\theta$$

This integral is over the entire parameter space of the model, and thus can be extremely computationally intensive, especially as the dimension of the model grows beyond a few. 

---
**&copy; Applied AI Ltd 2016**  
<a href='http://www.applied.ai'>applied.ai</a>