# [TITLE] General Data Science/ML Project Template

- Author: Kevin Chuang [@k-chuang](https://github.com/k-chuang)
- Date: 10/07/2018
- Description: A jupyter notebook template for steps in solving a data science and/or machine learning problem.
- Dataset: [Link to dataset source]()

----------

## Overview

- **Introduction / Abstract**
- **Load libraries & get data**
    - Split data to training and test set
        - stratified sampling based on certain feature(s) or label(s)
- **Exploratory Data Analysis**
    - Discover and visualize the training data to gain insights
- **Data Preprocessing**
    - Prepare data for ML algorithms
    - Write pipelines using transformers to do automated feature engineering:
        - Data scaling
        - Impute missing data (or remove)
        - Feature extraction
            - Create new dimensions by combining existing ones
        - Feature selection
            - Choose subset of features from the existing features
- **Model Selection & Training**
    - Use K-Folds Cross-Validation to select top 2 to 5 most promising models
        - Do not spend too much time tweaking hyperparameters
    - Typical ML models include kNN, SVM, linear/logistic regression, ensemble methods (RF, XGB), neural networks, etc.
    - [Optional] Save experimental models to pickle file.
- **Model Tuning**
    - `GridSearchCV`, `RandomSearchCV`, or `BayesSearchCV`
        - `GridSearchCV`: brute force way to search for 'best' hyperparameters
        - `BayesSearchCV`: smart way to use Bayesian inference to optimally search for best hyperparameters
- **Model Evaluation**
    - Final evaluation on hold out test set
    - If regression, calculate 95% confidence interval range
        - t score or z score to calculate confidence interval
- **Solution Presentation and/or submission**
    - What I learned, what worked & what did not, what assumptions were made, and what system's limitations are
    - Create clear visualizations & easy-to-remember statements
- **Deployment**
    - Clean up and concatenate pipleines to single pipeline to do full data preparation plus final prediction
    - Create programs to monitor & check system's live performance    

## Introduction / Abstract

- Write a paragraph about the project/problem at hand
    - Look at the big picture
    - Frame the problem
        - Business objectives

## Load libraries & data

- Load important libraries
- Load (or acquire) associated data
- Split data into training and test set
    - Based on either feature importance or class imbalance, use *stratified sampling* to split data to keep porportion even for training set and test set.

In [None]:
__author__ = 'Kevin Chuang (https://www.github.com/k-chuang)' 

# Version check
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

# linear algebra
import numpy as np 

# data processing
import pandas as pd 

# data visualization
%matplotlib inline
import seaborn as sns
from matplotlib import pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
import xgboost as xgb

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder

# Metrics 
from sklearn.metrics import log_loss
from sklearn.model_selection import cross_val_score

# Model Selection & Hyperparameter tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, StratifiedKFold
from skopt import BayesSearchCV
from skopt.space  import Real, Categorical, Integer


# Clustering
from sklearn.cluster import KMeans

# Mathematical Functions
import math

# Statistics
from scipy import stats

# Ignore useless warnings
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

## Exploratory Data Analysis (EDA)

- Visualize training data using different kinds of plots
- Plot dependent variables (features) against independent variable (target label)

## Data Preprocessing

- Writing pipelines to do automated feature engineering
    - Imputing missing values (or removing values)
    - Scaling data
    - Transforming objects (strings, dates, etc.) to numerical vectors
    - Creating new features

## Model Selection & Training

- Try different models and choose best 2-5 models
    - Use K-Fold cross-validation to validate which models are the best
- Typical ML models include kNN, SVM, linear/logistic regression, ensemble methods (RF, XGB), neural networks, etc.
- [Optional] Save experimental models to pickle file.

## Model Tuning

- Tune the top chosen model(s) and tune hyperparameters
    - Ideally, use Bayes Optimization `BayesSearchCV` to optimally search for best hyperparameters for the model
        - `BayesSearchCV` is from `skopt` or `scikit-optimize` library (There are many different Bayesian Optimization implementations) 
- Below are some common search spaces for ensemble algorithms (which tend to have a lot of hyperparameters), specifically:
    - Random Forest (Variation of Bagging)
    - xgboost (Gradient Boosting)
    - lightgbm (Gradient Boosting)
        - https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html

In [None]:
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

In [None]:
# Random Forest (Classificaton Example)

from sklearn.ensemble import RandomForestClassifier

rf_search_space = {
    'n_estimators': (100, 600),
    'max_depth': (1, 50),  
    'max_features': (1, n_features),
    'min_samples_leaf': (1, 50),  # integer valued parameter
    'min_samples_split': (2, 50),
}

rf_bayes_tuner = BayesSearchCV(
    estimator=RandomForestClassifier(oob_score=True, random_state=1, n_jobs=2),
    search_spaces=rf_search_space,
    n_iter=20,
    optimizer_kwargs={'base_estimator': 'RF'},
    scoring='neg_log_loss',
    n_jobs=5,
    verbose=0,
    cv = StratifiedKFold(
        n_splits=3,
        shuffle=True,
        random_state=1
    ),
    random_state=1
)


def status_print(result):
    """Status callback durring bayesian hyperparameter search"""
    
    # Get all the models tested so far in DataFrame format
    all_models = pd.DataFrame(rf_bayes_tuner.cv_results_)    
    
    # Get current parameters and the best parameters    
    best_params = pd.Series(rf_bayes_tuner.best_params_)
    print('Model #{}\nBest LogLoss: {}\nBest params: {}\n'.format(
        len(all_models),
        np.round(rf_bayes_tuner.best_score_, 6),
        rf_bayes_tuner.best_params_
    ))
    
    # Save all model results
    clf_name = rf_bayes_tuner.estimator.__class__.__name__
    all_models.to_csv(clf_name + "_cv_results.csv")

    
# Fit the model
result = rf_bayes_tuner.fit(X_train.values, Y_train.values, callback=status_print)

In [None]:
# XGB (Classification Example)
import xgboost as xgb

xgb_search_space = { 
        # log-uniform: understand as search over p = exp(x) by varying x
        'learning_rate': (0.01, 1.0, 'log-uniform'),
        'min_child_weight': (0, 10),
        'max_depth': (1, 100),
        'max_delta_step': (0, 20),
        'subsample': (0.01, 1.0, 'uniform'),
        'colsample_bytree': (0.01, 1.0, 'uniform'),
        'colsample_bylevel': (0.01, 1.0, 'uniform'),
        'reg_lambda': (1e-9, 1000, 'log-uniform'),
        'reg_alpha': (1e-9, 1.0, 'log-uniform'),
        'gamma': (1e-9, 0.5, 'log-uniform'),
        'min_child_weight': (0, 5),
        'n_estimators': (50, 500),
        'scale_pos_weight': (1e-6, 500, 'log-uniform')
}

xgb_bayes_tuner = BayesSearchCV(
    estimator = xgb.XGBClassifier(
        n_jobs = 3,
        objective = 'multi:softprob',
        eval_metric = 'mlogloss',
        silent=1,
        random_state=1
    ),
    search_spaces = xgb_search_space,    
    scoring = 'neg_log_loss',
    cv = StratifiedKFold(
        n_splits=3,
        shuffle=True,
        random_state=1
    ),
    n_jobs = 6,
    n_iter = 20,   
    verbose = 0,
    refit = True,
    random_state = 1
)

def status_print(result):
    """Status callback during bayesian hyperparameter search"""
    
    # Get all the models tested so far in DataFrame format
    all_models = pd.DataFrame(xgb_bayes_tuner.cv_results_)    
    
    # Get current parameters and the best parameters    
    best_params = pd.Series(xgb_bayes_tuner.best_params_)
    print('Model #{}\nBest Log Loss: {}\nBest params: {}\n'.format(
        len(all_models),
        np.round(xgb_bayes_tuner.best_score_, 8),
        xgb_bayes_tuner.best_params_
    ))
    
    # Save all model results
    clf_name = xgb_bayes_tuner.estimator.__class__.__name__
    all_models.to_csv(clf_name + "_cv_results.csv")

# Fit the model
result = xgb_bayes_tuner.fit(X_train.values, Y_train.values, callback=status_print)

In [None]:
# LGB (Regression Example)

import lightgbm as lgb

lgb_search_space  = {
    'max_depth': (3, 10),
    'num_leaves': (6, 30),
    'min_child_samples': (50, 200),
    'subsample': (0.5, 1.0, 'uniform'),
    'colsample_bytree': (0.01, 1.0, 'uniform'),
    'reg_lambda': (1e-9, 1000, 'log-uniform'),
    'reg_alpha': (1e-9, 1.0, 'log-uniform'),
    'n_estimators': (50, 500),
    'scale_pos_weight': (1e-6, 500, 'log-uniform'),
    'learning_rate': (0.01, 0.2, 'uniform')
}


lgb_bayes_tuner = BayesSearchCV(
    estimator = lgb.LGBMRegressor(
        n_jobs = 3,
        boosting_type="gbdt",
        objective = 'regression',
        silent=1,
        random_state=1
    ),
    search_spaces = lgb_search_space,    
    scoring = 'neg_mean_squared_error',
    cv = 3,
    n_jobs = 3,
    n_iter = 20,   
    verbose = 3,
    refit = True,
    random_state = 1
)

def status_print(result):
    """Status callback during bayesian hyperparameter search"""
    
    # Get all the models tested so far in DataFrame format
    all_models = pd.DataFrame(lgb_bayes_tuner.cv_results_)    
    
    # Get current parameters and the best parameters    
    best_params = pd.Series(lgb_bayes_tuner.best_params_)
    print('Model #{}\nBest Log Loss: {}\nBest params: {}\n'.format(
        len(all_models),
        np.round(lgb_bayes_tuner.best_score_, 8),
        lgb_bayes_tuner.best_params_
    ))
    
    # Save all model results
    clf_name = lgb_bayes_tuner.estimator.__class__.__name__
    all_models.to_csv(clf_name + "_cv_results.csv")

lgb_bayes_tuner.fit(housing_prepared, housing_labels, callback=status_print)

## Model Evaluation

- Final evaluation on the test set
- Calculation of confidence intervals using t-score or z-scores to give a range of values and confidence level