## Open Source Tools For Rapid Statistical Model Development

### Overview
* Catching up with Probabilistic Programming
* Plan models -- causalgraphicalmodels
* Loading Data -- PANDAS
* Preparing Data -- PANDAS, Seaborn, Scikit-learn
* Rapid Model Development -- PyMC3
* Beyond the Basics

### Notebook Content:
1. [PP Overview](#Overview)
    1. PP as modeling framework
    2. Emphasis on generalizing predictive models
1. [Plan modeling](#ModelPlan)
2. [Loading NASA's SEABASS Data](#DataLoad)
3. [Prepare Data for Modeling](#DataPrep)
4. [Bayesian Modelling](#PyMC3)
   1. [Model coding](#writemodel)
   2. [Prior evaluation & Model modification](#priors)
   3. [Model fitting & diagnostics](#fit)
   4. [Model Evaluation](#eval)
       1. [PPC](#ppc)
       2. [Out-of-sample (test set) performance](#test)
           1. [Usual suspects; $r^2$, $mae$](#mae)
           2. [Unusual suspect: $deviance$](#deviance)
           3. [Approximating test set performance; WAIC](#waic)
5. [Conclusion](Conclusion)

In [5]:
import sys

import pandas as pd
import seaborn as sb
import pymc3 as pm
import arviz as ar
from matplotlib import rcParams

In [2]:
rcParams['axes.formatter.limits'] = (-2, 3)
rcParams['axes.titlesize'] = 18
rcParams['axes.labelsize'] = 16
rcParams['font.size'] = 16
rcParams['ytick.labelsize'] = 16
rcParams['xtick.labelsize'] = 16
rcParams['legend.fontsize'] = 16
rcParams['xtick.minor.visible'] = True

In [13]:
print(f'Python: version {sys.version.split("|")[0]}')
print(f'PanDaS version {pd.__version__}')
print(f'Seaborn version {sb.__version__}')
print(f'PyMC3 version {pm.__version__}')

Python: version 3.7.1 
PanDaS version 0.24.1
Seaborn version 0.9.0
PyMC3 version 3.6


In [8]:
% matplotlib inline

### Loading and preparing data -- PANDAS
* the nomad dataset
* reading in 
* get column names
* extract desired variables

### Data Exploration -- PANDAS, Seaborn and Scikit-Learn
* predictor isolated distributions
* plotting predictors/predicted w/ respect to each other
* predictor correlation, multicollinearity and pca

### Modeling -- Probabilistic Programming with PyMC3
* simple bayesian regression to predict chlorophyll from Rrs
* rapid but transparent model development
* evaluation of priors
* fitting and evaluation of posterior distribution
* model comparison/selection