## Introduction
---

Authors: Tina Pai, Ravindra  Thanniru, Walter Lai, and Jamie Vo


### General Questions:
1. Are the model comparisons accurate when different methods are used for parameter tuning?

In [5]:
# import libraries
import pandas as pd
import xgboost as xgb
import numpy as np
from sklearn.metrics import log_loss, accuracy_score
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import random

## Data Cleaning
---

In [6]:
random.seed(123)

In [7]:
df = pd.read_csv('../Data/case_8.csv')

In [8]:
df.head(2)

Unnamed: 0,ID,target,v1,v2,v3,v4,v5,v6,v7,v8,...,v122,v123,v124,v125,v126,v127,v128,v129,v130,v131
0,3,1,1.335739,8.727474,C,3.921026,7.915266,2.599278,3.176895,0.012941,...,8.0,1.98978,0.035754,AU,1.804126,3.113719,2.024285,0,0.636365,2.857144
1,4,1,1.630686,7.464411,C,4.145098,9.191265,2.436402,2.483921,2.30163,...,6.822439,3.549938,0.598896,AF,1.672658,3.239542,1.957825,0,1.925763,1.739389


## Parameter Tuning
---

[Random Search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) is used opposed to a grid search due to the lower computational time required. While a grid search is more thorough, a randomized search will result in comparable metrics. 

### Extreme Gradient Boost XGBoost
---
[xbgboost.cv](https://xgboost.readthedocs.io/en/latest/python/python_api.html) is the cross validation method used opposed to K-fold.


XGBoost is an algorithm which takes a collection of weak learners and create a strong learner. Weak learners are considered to be only slightly better than a random guess. Through each round, the model attemps to improve the prediction compared to the previous. At a high level, xgboost fits a model, determines the error rates, and then fits a new model using the error rates as the new target. In doing so, every round has a new error rate target until the residuals are randomized.  

Due to the nature of xgboost, any model can be fitted using the method since the error rates are used as the targets. Weak learners are created by predicting on a small portion of the data opposed to all of the unknowns. 

Xgboost models are strongly suseptible to overfitting. In order to prevent this, the model is defaulted to regularization. 


#### Loss Calculation
XGBoost uses an approximate loss calculation. So long as a loss function has a first and second order partial derivative, xgboost is applicable. 
<img src="../Images/Loss_formula.png" alt="Loss Formula" style="width: 400px;"/>
<img src="../Images/Penalty_formula.png" alt="Penalty Formula" style="width: 200px;"/>

##### Parameters for XGBoost

The bolded feature are of highest significance when tuning the model. 

---
"objective": "binary:logistic" - since the dataset is binary, the binary logistic is used <br>
"booster": "gbtree" <br>
"eval_metric": "logloss" - logloss is selected as the evaluation metric for consideration of the best model <br>
**"eta"**: 0.01 - learning rate, generally lower is better, but be cautious of a learning rate so slow that the model never stops <br> 
<br>Since XGBoost is a partition tree, the parameters below are relatable to tree parameters. <br><br>
**"subsample"**:  - generally closer to 1, this is similar to bagging and prevents over-sampling (row-level)<br> 
**"colsample_bytree"**: - generally closer to 1, this is the same as subsampling, but for columns<br> 
**"colsample_bylevel"**: sub-sampling of levels <br>
**"colsample_bynode"**: sub-sampling of nodes <br>
**"max_depth"**: the depth of the tree - generally a good rule of thumb is the squareroot of the number of features<br>
"boosting_round":  a higher number of rounds is recommended, once the loss no longer reduces, kill the model <br>
"gamma": penalty for more nodes/leafs <br>
**"max_child_weight"**: the weight to exceed for partition creation <br>
"lambda": L2 regularization - leave at standard (ON) <br>
"alpha": L1 regularization - leave at standard (OFF)




In [None]:
# data prep for XGBoost
xg = xgb.DMatrix(X.values, y.values)

In [None]:
# XGBoost params:
xgboost_params = { 
   "objective": "binary:logistic", # do not change
   "booster": "gbtree", # do not change
   "eval_metric": "logloss", # do not change
   "eta": 0.01, 
   "subsample": 0.5,
   "colsample_bytree": 0.5,
   "max_depth": 3
}
boost_round = 50 # recommended to be near 400-500 rounds


In [None]:
# cross validation
xgboost.cv(xgboost_params,
 dtrain, # set this to the training matrix***
 num_boost_round=1000,
 nfold=3, # set to the number required
 stratified=TRUE, # stratify TRUE
 metrics=("accuracy", "logloss"), # metrics desired
 obj=None, # this is the binary logloss***
 early_stopping_rounds=None, # when to stop the rounds*** recommended ~200/300. This needs to be triggered, if it doesn't it means there is an error
 seed=123,
 shuffle=True)



In [None]:
# train the model
clf = xgb.train(xgboost_params,xg,num_boost_round=boost_round,verbose_eval=True,maximize=False)

In [None]:
# 

### Random Forest
---

Random forest or SVM will have the lowest score; XGBoost will have a significantly higher score.

In [None]:
rf = RandomForestClassifier(n_estimators=50) # n_estimators of 50 is too low

### Support Vector Machine (SVM)
Linear SVM is used for the model analysis.

Test/train split is applicable in this situation due to the high computation time of the model.

## Statistical Metrics
---

SVM does not have a log-loss score due to its requirements of a probability prediction, which SVM is not capable of providing. While packages can be added on after running the SVM model, due to the requirement of cross-validation, the computation time required is not applicable for this study. 

### Metric Comparisons

## SVM Scaling
---

## Appendix
---