<a href="https://colab.research.google.com/github/kevinajordan/DS-Training/blob/master/extreme_gradient_boosting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gradient Boosting

Helpful Articles and Videos:
* http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
* https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d
* Gradient Boosting for Regression: https://www.youtube.com/watch?v=3CC4N4z3GJc
* Gradient Boosting for Classification: https://www.youtube.com/watch?v=jxuNLH5dXCs&t=824s

Gradient boosting involves three elements:
1. A loss function to be optimized.
* it is a generic enough framework that any **diﬀerentiable** loss function can be used.

2. A weak learner to make predictions.
* Decision trees are used as the weak learner in gradient boosting. Speciﬁcally regression trees are used that output real values for splits and whose output can be added together, allowing subsequent model outputs to be added and correct the residuals in the predictions. 

3. An additive model to add weak learners to minimize the loss function.

* Trees are added one at a time, and existing trees in the model are not changed. A gradient descent procedure is used to minimize the loss when adding trees.  Instead of parameters, we have weak learner sub-models or more speciﬁcally decision trees. After calculating the loss, to perform the gradient descent procedure, we must add a tree to the model that reduces the loss (i.e. follow the gradient). We do this by parameterizing the tree, then modify the parameters of the tree and move in the right direction by reducing the residual loss. Generally this approach is called functional gradient descent or gradient descent with functions.
The output for the new tree is then added to the output of the existing sequence of trees in an eﬀort to correct or improve the ﬁnal output of the model. A ﬁxed number of trees are added or training stops once loss reaches an acceptable level or no longer improves on an external validation dataset.

# XGBoost

XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.

The two reasons to use XGBoost are also the two goals of the project:
1. Execution Speed.
2. Model Performance.

Check this link to see benchmark performance for several ML algorithms:
https://github.com/szilard/benchm-ml

Dataset: Pima Indians Diabetes Dataset


In [0]:
# Clone DS-Training repo for datasets and skeleton code
!git clone https://github.com/kevinajordan/DS-Training.git

In [0]:
# Set our working directory to the dataset folder
import os
os.chdir('DS-Training/datasets')

In [0]:
!ls

In [0]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score

In [0]:
# load data with pandas
pima = pd.read_csv('diabetes.csv')

# Data Cleaning

Insert some cells below and clean your data.

In [0]:
# split data into X and y 
feature_cols = ['Pregnancies', 'Insulin', 'BMI', 'Age','Glucose','BloodPressure','DiabetesPedigreeFunction']
X = pima[feature_cols] # Features
y = pima.Outcome # Target variable

# split data into train and test sets  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=7)

In [0]:
from xgboost import XGBClassifier 

model = XGBClassifier() 

model.fit(X_train, y_train)

print(model)


In [0]:
# make predictions for test data 
predictions = model.predict(X_test)

# evaluate predictions 
accuracy = accuracy_score(y_test, predictions) 
print("Accuracy: %.2f%%" % (accuracy * 100.0))


# K-fold Cross-Validation
Cross-validation is an approach that you can use to estimate the performance of a machine learning algorithm with less variance than a single train-test set split. It works by splitting the dataset into k-parts (e.g. k = 5 or k = 10). Each split of the data is called a fold. The algorithm is trained on k−1 folds with one held back and tested on the held back fold. This is repeated so that each fold of the dataset is given a chance to be the held back test set. After running cross-validation you end up with k-diﬀerent performance scores that you can summarize using a mean and a standard deviation. 

The result is a more reliable estimate of the performance of the algorithm on new data given your test data. It is more accurate because the algorithm is trained and evaluated multiple times on diﬀerent data.

We can use k-fold cross-validation support provided in scikit-learn. First we must create the KFold object specifying the number of folds and the size of the dataset. We can then use this scheme with the speciﬁc dataset. The cross val score() function from scikit-learn allows us to evaluate a model using the cross-validation scheme and returns a list of the scores for each model trained on each fold.

For modest sized datasets in the thousands or tens of thousands of observations, k values of 3, 5 and 10 are common. 

In [0]:
from sklearn.model_selection import KFold 
from sklearn.model_selection import cross_val_score 

In [0]:
# CV model 
model = XGBClassifier() 
kfold = KFold(n_splits=10, random_state=7) 
results = cross_val_score(model, X, y, cv=kfold) 
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))