# [TITLE] General Data Science/ML Project Template

- Author: Kevin Chuang [@k-chuang](https://github.com/k-chuang)
- Date: 10/07/2018
- Description: A jupyter notebook template for steps in solving a data science and/or machine learning problem.
- Dataset: [Link to dataset source]()

----------

## Overview

- **Introduction / Abstract**
- **Load libraries & get data**
    - Split data to training and test set
        - stratified sampling based on certain feature(s) or label(s)
- **Exploratory Data Analysis**
    - Discover and visualize the training data to gain insights
- **Data Preprocessing**
    - Prepare data for ML algorithms
    - Write pipelines using transformers to do automated feature engineering:
        - Scale data
        - Impute missing data (or remove)
        - Feature creation
- **Model Selection & Training**
    - Use K-Folds Cross-Validation to select top 2 to 5 most promising models
        - Do not spend too much time tweaking hyperparameters
    - Typical ML models include kNN, SVM, linear/logistic regression, ensemble methods (RF, XGB), neural networks, etc.
    - [Optional] Save experimental models to pickle file.
- **Model Tuning**
    - `GridSearchCV`, `RandomSearchCV`, or `BayesSearchCV`
        - `GridSearchCV`: brute force way to search for 'best' hyperparameters
        - `BayesSearchCV`: smart way to use Bayesian inference to optimally search for best hyperparameters
- **Model Evaluation**
    - Final evaluation on hold out test set
    - If regression, calculate 95% confidence interval range
        - t score or z score to calculate confidence interval
- **Solution Presentation and/or submission**
    - What I learned, what worked & what did not, what assumptions were made, and what system's limitations are
    - Create clear visualizations & easy-to-remember statements
- **Deployment**
    - Clean up and concatenate pipleines to single pipeline to do full data preparation plus final prediction
    - Create programs to monitor & check system's live performance    

## Introduction / Abstract

- Write a paragraph about the project/problem at hand
    - Look at the big picture
    - Frame the problem
        - Business objectives

## Load libraries & data

- Load important libraries
- Load (or acquire) associated data
- Split data into training and test set
    - Based on either feature importance or class imbalance, use *stratified sampling* to split data to keep porportion even for training set and test set.

In [None]:
__author__ = 'Kevin Chuang (https://www.github.com/k-chuang)' 

# linear algebra
import numpy as np 

# data processing
import pandas as pd 

# data visualization
%matplotlib inline
import seaborn as sns
from matplotlib import pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
import xgboost as xgb

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder

# Metrics 
from sklearn.metrics import log_loss
from sklearn.model_selection import cross_val_score

# Model Selection & Hyperparameter tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, StratifiedKFold
from skopt import BayesSearchCV
from skopt.space  import Real, Categorical, Integer


# Clustering
from sklearn.cluster import KMeans

# Mathematical Functions
import math

# Statistics
from scipy import stats

## Exploratory Data Analysis (EDA)

- Visualize training data using different kinds of plots
- Plot dependent variables (features) against independent variable (target label)

## Data Preprocessing

- Writing pipelines to do automated feature engineering
    - Imputing missing values (or removing values)
    - Scaling data
    - Transforming objects (strings, dates, etc.) to numerical vectors
    - Creating new features

## Model Selection & Training

- Try different models and choose best 2-5 models
    - Use K-Fold cross-validation to validate which models are the best
- Typical ML models include kNN, SVM, linear/logistic regression, ensemble methods (RF, XGB), neural networks, etc.
- [Optional] Save experimental models to pickle file.

## Model Tuning

- Tune the top chosen model(s) and tune hyperparameters
    - Ideally, use Bayes Optimization `BayesSearchCV` to optimally search for best hyperparameters for the model

## Model Evaluation

- Final evaluation on the test set
- Calculation of confidence intervals using t-score or z-scores to give a range of values and confidence level