Use this template for exploratory model development. Modify cells below to suit your needs, but follow "input" and "output" object naming conventions where indicated

# Project title
* Author: `<your name here>`

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#Local-functions" data-toc-modified-id="Local-functions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Local functions</a></span></li><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Load Data</a></span><ul class="toc-item"><li><span><a href="#Import-raw-data" data-toc-modified-id="Import-raw-data-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Import raw data</a></span></li><li><span><a href="#Combine-to-create-a-single-raw-data-frame" data-toc-modified-id="Combine-to-create-a-single-raw-data-frame-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Combine to create a single raw data frame</a></span></li></ul></li><li><span><a href="#Process-data" data-toc-modified-id="Process-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Process data</a></span><ul class="toc-item"><li><span><a href="#Apply-pre-processing" data-toc-modified-id="Apply-pre-processing-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Apply pre-processing</a></span></li></ul></li><li><span><a href="#Feature-Engineering" data-toc-modified-id="Feature-Engineering-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Feature Engineering</a></span><ul class="toc-item"><li><span><a href="#Split-data-into-train-and-test-set" data-toc-modified-id="Split-data-into-train-and-test-set-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Split data into train and test set</a></span></li><li><span><a href="#Create-features" data-toc-modified-id="Create-features-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Create features</a></span></li><li><span><a href="#Combine-features-into-Train-and-Test-sets.-Create-outcome-label-Train-and-Test-sets" data-toc-modified-id="Combine-features-into-Train-and-Test-sets.-Create-outcome-label-Train-and-Test-sets-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Combine features into Train and Test sets. Create outcome label Train and Test sets</a></span></li></ul></li><li><span><a href="#Model-Development" data-toc-modified-id="Model-Development-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Model Development</a></span><ul class="toc-item"><li><span><a href="#Train-the-model" data-toc-modified-id="Train-the-model-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Train the model</a></span></li><li><span><a href="#OPTIONAL:-Load-previously-saved-model-artefact-dictionary-from-pickle-file" data-toc-modified-id="OPTIONAL:-Load-previously-saved-model-artefact-dictionary-from-pickle-file-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span><em>OPTIONAL:</em> Load previously saved model artefact dictionary from pickle file</a></span></li><li><span><a href="#Evaluate-model-performance" data-toc-modified-id="Evaluate-model-performance-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>Evaluate model performance</a></span></li><li><span><a href="#Interpret-model-predictions" data-toc-modified-id="Interpret-model-predictions-6.4"><span class="toc-item-num">6.4&nbsp;&nbsp;</span>Interpret model predictions</a></span></li><li><span><a href="#Save-model-artefacts" data-toc-modified-id="Save-model-artefacts-6.5"><span class="toc-item-num">6.5&nbsp;&nbsp;</span>Save model artefacts</a></span></li><li><span><a href="#Optional:-Persist-model-data-sets-in-S3/Redshift" data-toc-modified-id="Optional:-Persist-model-data-sets-in-S3/Redshift-6.6"><span class="toc-item-num">6.6&nbsp;&nbsp;</span><em>Optional:</em> Persist model data sets in S3/Redshift</a></span></li></ul></li><li><span><a href="#Appendix-1---Environment-Configuration" data-toc-modified-id="Appendix-1---Environment-Configuration-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Appendix 1 - Environment Configuration</a></span></li><li><span><a href="#Appendix-2---Automated-Tests" data-toc-modified-id="Appendix-2---Automated-Tests-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Appendix 2 - Automated Tests</a></span></li></ul></div>

## Setup
* Import standard and project specific packages
* Specify global settings and configuration

In [14]:
# Standard python packages
import pandas as pd
import numpy as np
import os
import sys
from sklearn import model_selection
from helpers import aws_helpers as aws

# Project Packages
# import nabds

# Setup global settings 
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Local functions
* Functions for re-use should be incorporated into project package source code or common functions in src folder

## Load Data 
* For each data source, load data and apply scope
* See templates ["load_data_from_redshift.ipynb"](../../NDC_data_science/notebooks/template_load_data_from_redshift.ipynb), ["load_data_from_s3.ipynb"](../../NDC_data_science/notebooks/template_load_data_from_s3.ipynb) for loading data on NAB Discovery Cloud
* Where applicable, apply high level filters, e.g. selected columns, rows 
* **Input:** source data
* **Output:** *raw_all*: data frame with raw data attributes and outcome values

In [38]:
# Import raw data

# Retain required columns 
# COLUMN_NAMES = ['A','B']

# Drop rows with NaN in all columns


# Combine raw data into a data frame
raw_all = pd.DataFrame({'A': range(10,15),'B': np.random.randn(5), 'outcome': [0,1,0,0,0]})

## Process data
* Pre-process and cleanse the raw data frame, e.g. impute/discard missing values, text pre-processing (stop words, lemmatisation etc)
* **Input:** labelled data frame *raw_all* 
* **Output:** *processed*: data frame with pre-processed data attributes and outcome label

In [39]:
processed = raw_all.copy()

# Apply pre-processing


## Feature Engineering
* Split pre-processed data frame into Train and Test sets
* create features e.g. one-hot encoding of categorical variables, binning numeric variables, text features via tokenisation and term frequency matricies. 
* **Input:** *processed*: Pre-processed data frame
* **Output:** Train and Test sets, each split into label and feature sets, i.e. 2 feature data frames, 2 outcome series:
    * *X_train*
    * *y_train*
    * *X_test*
    * *y_test*

In [40]:
# Split data into train and test set and create outcome sets
X_train, X_test, y_train, y_test = (
    model_selection.train_test_split(processed,
                                     processed.outcome,
                                     test_size=0.33, 
                                     random_state=42))

# Create features


## Model Development
* Train a model/s on the Train data set
* Evaluate model performance on both the Train and Test sets
* Interpret model predictions (optional)
* Save model artefacts in a dictionary as a pickle file, and optionally data sets in Redshift
* **Input:** 4 arrays with features and labels for Train and Tests sets: *X_train, X_test, y_train, y_test*
* **Output:** 
    * a dictionary *model_output* with model artefacts (e.g. trained model object, Train and Test sets, predictions)
    * a pickle file *<`model_name`>_model_output_<`timestamp`>.pickle* containing the dictionary
    * file/s in S3 or table/s Redshift with model data sets (optional)

### Train the model

In [48]:
# Build a classifier/regressor


# Train the classifier/regressor


### Evaluate model performance

### Interpret model predictions
* e.g. Feature weights/importance

### Store / Load the model objects

In [51]:
# Persist the model artefacts in a dictionary as a pickle file in s3
model_output = {}

# Connect to s3, serialise as a pickle file, store in s3


# Persist model train and test data frames in Redshift or S3



## Appendix 1 - Environment Configuration

In [15]:
# System configuration
print (os.getcwd())
print (sys.version)
print (sys.executable)
print (sys.path)

c:\Temp\NDC_data_science\notebooks
3.6.3 |Anaconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)]
C:\Program Files (x86)\Python\Anaconda\envs\p3\python.exe
['c:\\Temp\\NDC_data_science\\src', '', 'C:\\Program Files (x86)\\Python\\Anaconda\\envs\\p3\\python36.zip', 'C:\\Program Files (x86)\\Python\\Anaconda\\envs\\p3\\DLLs', 'C:\\Program Files (x86)\\Python\\Anaconda\\envs\\p3\\lib', 'C:\\Program Files (x86)\\Python\\Anaconda\\envs\\p3', 'C:\\Program Files (x86)\\Python\\Anaconda\\envs\\p3\\lib\\site-packages', 'C:\\Program Files (x86)\\Python\\Anaconda\\envs\\p3\\lib\\site-packages\\Babel-2.5.0-py3.6.egg', 'C:\\Program Files (x86)\\Python\\Anaconda\\envs\\p3\\lib\\site-packages\\win32', 'C:\\Program Files (x86)\\Python\\Anaconda\\envs\\p3\\lib\\site-packages\\win32\\lib', 'C:\\Program Files (x86)\\Python\\Anaconda\\envs\\p3\\lib\\site-packages\\Pythonwin', 'C:\\Program Files (x86)\\Python\\Anaconda\\envs\\p3\\lib\\site-packages\\IPython\\extensions', 'C:\

In [16]:
# Python packages and versions
!pip freeze

absl-py==0.6.1
alabaster==0.7.10
anaconda-client==1.6.5
anaconda-navigator==1.6.9
anaconda-project==0.8.0
arrow==0.13.0
asn1crypto==0.22.0
astor==0.7.1
astroid==1.5.3
astropy==2.0.2
atomicwrites==1.2.1
attrs==18.2.0
autocorrect==0.3.0
autopep8==1.3.4
babel==2.5.0
backports.shutil-get-terminal-size==1.0.0
beautifulsoup4==4.6.0
binaryornot==0.4.4
bitarray==0.8.1
bkcharts==0.2
blaze==0.11.3
bleach==2.0.0
bokeh==0.12.10
boto==2.48.0
boto3==1.6.4
botocore==1.9.4
Bottleneck==1.2.1
CacheControl==0.12.3
cachetools==2.0.1
cchardet==2.1.1
certifi==2018.1.18
cffi==1.10.0
chardet==3.0.4
click==6.7
cloudpickle==0.4.0
clyent==1.2.2
colorama==0.4.1
comtypes==1.1.2
contextlib2==0.5.5
cookiecutter==1.6.0
cryptography==2.0.3
cx-Oracle==6.1
cycler==0.10.0
cymem==1.31.2
Cython==0.26.1
cytoolz==0.8.2
dash==0.18.3
dash-core-components==0.12.6
dash-html-components==0.7.0
dash-renderer==0.10.0
dask==0.17.1
datashape==0.5.4
decorator==4.3.2
defusedxml==0.5.0
dill==0.2.7.1
distlib==0.2.5
distributed==1.19.1
doc

You are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


## Appendix 2 - Automated Tests

In [17]:
# Run tests within notebook
f_path = os.getcwd()
os.chdir(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))

# Run pytest from the repository root
!pytest

os.chdir(f_path)

platform win32 -- Python 3.6.3, pytest-4.1.1, py-1.7.0, pluggy-0.8.1
rootdir: c:\Temp\NDC_data_science, inifile:
collected 5 items

tests\examplepackage\examplemodule\test_add_value_to_numpy.py ...        [ 60%]
tests\examplepackage\examplemodule\test_hello_world.py ..                [100%]

c:\program files (x86)\python\anaconda\envs\p3\lib\site-packages\urllib3\contrib\pyopenssl.py:46
    import OpenSSL.SSL



# End