In [7]:
from IPython.core.display import HTML
HTML("<style>.container { width:90% !important; } div.cell.selected { border-left-width: 1px !important;}</style>")

<div style="float:right; width=45%"><img src="assets/img/parent_mfr_coeffs.png" style="margin: 0px 20px 0px 20px"></div>



# Hierachical Modelling with PyMC3 and PyStan

Can we use Bayesian inference to determine unusual car emissions tests for Volkswagen?   
#PyDataLondon Sun 08 May 2016   


<br/>

#### Jonathan Sedar  
Consulting Data Scientist  
Applied AI Ltd  
<a href="https:/twitter.com/jonsedar">@jonsedar</a>  
<a href='http://www.applied.ai'>applied.ai</a>  




<div style=float:left><img src="assets/img/appliedai-logo.png" width=100/></div>




# In the next 30 mins you will:


1. Gain a better understanding of Bayesian inference

2. Learn a bit about the capabilities of PyMC3 and PyStan

3. Think about giving it a try on your problems


(... hopefully)





<small>1. Gain a better understanding of Bayesian inference</small>

# It tells us more about the "why" and "how"

<br/>

<div style="float:left; width:45%"><img src="assets/img/ppc_outlierdetection.png" style="margin: 0px 20px 0px 20px">
<h4>[How do these outliers differ?](https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/GLM-robust-with-outlier-detection.ipynb)</h4>
</div>

<div style="float:right; width:45%"><img src="assets/img/sneeze_counts.png" style="margin: 0px 20px 0px 20px">
<h4>[What affects my sneezing?](https://gist.github.com/jonsedar/d600ff796fc8a42555410d72e4426d99)</h4>
</div>



<div style="float:left; width:45%"><img src="assets/img/processchange_convrate.png" style="margin: 0px 20px 0px 20px">
<h4>[Was there a change in my conversion rate?](http://blog.applied.ai/a-bayesian-approach-to-monitoring-process-change-part-2/)</h4>
</div>

<div style="float:right; width:45%"><img src="assets/img/mfr_unpooled_forestplot.png" style="margin: 0px 20px 0px 20px">
<h4>[What's the impact of car mfr on emissions?](https://github.com/jonsedar/pymc3_vs_pystan)</h4>
</div>



<small>2. Learn a bit about the capabilities of PyMC3 and PyStan</small>

# Two excellent and quite different open-source libraries for Bayesian Inference

<br/>

<div style="float:left; width:30%"><img src="assets/img/pymc3_logo.png" style="margin: 0px 20px 0px 20px">
</div>

<div style="float:right; width:30%"><img src="assets/img/stan_logo.png" style="margin: 0px 20px 0px 20px">
</div>







# PyMC3


+ Pure python implementation of symbolic statistical modelling
+ Very efficient gradient-based samplers using the Theano library for autodiff / graph computation


<div style="width:90%"><img src="assets/img/github_pymc3.png" style="margin: 0px 20px 0px 20px">
</div>

https://github.com/pymc-devs/pymc3

## Includes

+ Metropolis-Hastings, Hamiltonian Monte-Carlo (HMC), Binary-Metropolis, Slice samplers 
+ Non U-Turn Sampler (NUTS) developed by Hoffman & Gelman in 2014 (only otherwise available in Stan)
+ Variational Inference (ADVI)
+ A host of helpful functions for model evaluation 
    + posterior predictive checks (PPC)
    + DIC / WAIC calcs
    + LOO cross-validation
    + Traceplotting / Forestplotting

# PyStan

+ Thin Python wrapper around Stan (see also RStan, MStan, CmdStan)
+ Stan is a C++ based probabilistic programming language 


<div style="width:90%"><img src="assets/img/github_stan.png" style="margin: 0px 20px 0px 20px">
</div>

+ Strong academic support via researchers inc. Gelman, Carpenter, Hoffman etc. (many of them at Columbia Uni)
+ https://github.com/stan-dev/stan

## Stan includes

+ full Bayesian statistical inference with MCMC sampling (NUTS, HMC)
+ approximate Bayesian inference with variational inference (ADVI)
+ penalized maximum likelihood estimation with optimization (L-BFGS)


## PyStan includes

+ Convenience functions for compiling Stan models and running them
+ A handful of plotting functions 


# Some existing comparisons

+ Jake Vanderplas blogpost [Frequentism and Bayesianism IV: How to be a Bayesian in Python, Jun 2014](http://jakevdp.github.io/blog/2014/06/14/frequentism-and-bayesianism-4-bayesian-in-python/) (Thorough eval of PyMC2 vs PyStan vs emcee)
+ Chris Burns blogpost [A Tale of Three Samplers, Jul 2014](https://users.obs.carnegiescience.edu/~cburns/site/?p=120) (Overview of PyMC2 vs PyStan vs emcee)
+ Llewelyn Richards-Ward blogpost: [You’ll never guess what’s been happening with PyStan and PyMC—Click here to find out, Nov 2015](http://andrewgelman.com/2015/10/15/whats-the-one-thing-you-have-to-know-about-pystan-and-pymc-click-here-to-find-out/) (Overview of PyMC3 vs PyStan)
+ Issac Slavit blogpost [Solving the Bayesian German Tank problem with PyMC and PyStan, Dec 2015](http://isaacslavitt.com/2015/12/19/german-tank-problem-with-pymc-and-pystan/) (More detailed comparison of PyMC3 vs PyStan)



<small>3. Think about giving it a try on your problems</small>

# 


In [None]:
`https://github.com/jonsedar/pymc3_vs_pystan`

I'm co-director of Applied AI Ltd, a niche data science consultancy operating in the insurance sector. I've a background in physics and machine learning, and over ten years experience providing advice and insights to senior audiences in a variety of industries in Europe & USA.

I helped to setup DataKind Dublin and still volunteer with the UK chapter, am co-organiser of the Bayesian Mixer London meetup group, and all too often use speaking events and blogposts to force myself to learn stuff! See http://blog.applied.ai for more material like this talk.

##### Overview

Bayesian inference bridges the gap between white-box model introspection and black-box predictive performance. We gain the ability to fully specify a model and fit it to observed data according to our prior knowledge. Small datasets are handled well and the overall method and results are very intuitive: lending to both statistical insight and future prediction.

This talk will demonstrate the use of Bayesian inference in a real-world scenario: using a set of hierarchical models to compare exhaust emissions data from a set of vehicle manufacturers.

This will be interesting to people who work in the Type A side of data science, and will demonstrate usage of the tools as well as some theory.

<img src="assets/img/continuum.png" style="margin: 0px 20px 0px 20px; width=45%"></img>

# The Frameworks

PyMC3 and PySTAN are two of the leading frameworks for Bayesian inference in Python: offering concise model specification, MCMC sampling, and a growing amount of built-in conveniences for model validation, verification and prediction.

PyMC3 is an iteration upon the prior PyMC2, and comprises a comprehensive package of symbolic statistical modelling syntax and very efficient gradient-based samplers using the Theano library (of deep-learning fame) for gradient computation. Of particular interest is that it includes the Non U-Turn Sampler (NUTS) developed recently by Hoffman & Gelman in 2014, which is only otherwise available in STAN.

PySTAN is a wrapper around STAN, a major3 open-source framework for Bayesian inference developed by Gelman, Carpenter, Hoffman and many others. STAN also has HMC and NUTS samplers, and recently, Variational Inference - which is a very efficient way to approximate the joint probability distribution. Models are specified in a custom syntax and compiled to C++.

# The Real-World Problem & Dataset

I'm currently quite interested in road traffic and vehicle insurance, so I've dug into the [UK VCA](http://www.dft.gov.uk/vca/) (Vehicle Type Approval) to find their [Car Fuel and Emissions Information](http://carfueldata.direct.gov.uk/) for August 2015. The raw dataset is available for [direct download](http://carfueldata.direct.gov.uk/downloads/download.aspx?rg=aug2015) and is small but varied enough for our use here: roughly 2500 cars and 10 features inc hierarchies of car parent-manufacturer > manufacturer > model.

I will investigate the car emissions data from the point-of-view of the [Volkswagen Emissions Scandal](https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal) which seems to have [meaningfully damaged their sales](http://www.usatoday.com/story/money/cars/2015/12/01/emissions-scandal-crushes-volkswagen-sales-november/76605062/). Perhaps we can find unusual results in the emissions data for Volkswagen.

# Or, how to cook your laptop in ten easy steps

<img src="assets/img/LaptopCooling03.png" style="margin: 0px 20px 0px 20px; width=45%"></img>

---
**Applied AI Ltd &copy; 2016**  
<a href='http://www.applied.ai'>applied.ai</a>