# Extreme Gradient Boosting with XGBoost
Gradient boosting is currently one of the most popular techniques for efficient modeling of tabular datasets of all sizes. 

XGboost is a very fast, scalable implementation of gradient boosting

Summary
1. Classification with XGBoost
2. Regression with XGBoost
3. Fine-tuning XGBoost model
4. XGBoost in Pipelines

Reference: Sergey Fogelson, VP Analytics, Viacom, DataCamp

In [None]:
# note on adding image
<img src="image.png" width="500" />

## 1. Classification with XGBoost
Understand the basics of:
- supervised classification
- decision trees
- boosting

Supervised learning
- relies on labeled data - some understanding on past behavior
- 2 kinds of problems: regression and classificaiton
- Classification problems
    - outcomes are binary or multi-class
    - AUC - metric for binary classification models
        - larger area under the ROC curve = better model, more sensitive
    - Accuracy score and confusion matrix - metric for multiclass
    - common algorithms: logistic regression and decision trees

Other supervised learning considerations
- require a table of feature vectors
- Features can be either numeric or categorical
- Numeric features should be scaled (Z-scored)
- Categorical features should be encoded (one-hot)
- Other problems
    - Ranking - predicting an ordering on a set of choices
        - ie. Google search
    - Recommendation
        - Recommending an item to a user
        - based on consumption history and profile
        - ie. Netflix

XGBoost introduction

What is XGBoost?
- Optimized gradient-boosting ML library
- orginally written in C++ command line application
- Has APIs in several languages:
    - Python, R, Scala, Julia, Java
    
What makes XGBoost so popular?
- speed and performance
- core algorithm is parallelizable - GPUs and networks of computers
    - feasible to scale to 100s of millions of training examples
- the real draw is: consistently outperforms single-algorithm methods
    - state of the art performance in many ML tasks




### 1.1 XGBoost: Classification example

In [None]:
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# load data
class_data = pd.read_csv("classification_data.csv")
X, y = class_data.iloc[:,:-1], class_data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                   random_state=123)

# instantiate XGBoost classifier
xg_cl = xgb.XGBClassifier(objective='binary:logistic',
                          n_estimators=10, seed=123)
# fit and predict
xg_cl.fit(X_train, y_train)
preds = xg_cl.predict(X_test)

# evaluate accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))
# 0.743300

### 1.2 XGBoost: Fit/Predict
It's time to create your first XGBoost model! As Sergey showed you in the video, you can use the scikit-learn .fit() / .predict() paradigm that you are already familiar to build your XGBoost models, as the xgboost library has a scikit-learn compatible API!

Here, you'll be working with churn data. This dataset contains imaginary data from a ride-sharing app with user behaviors over their first month of app usage in a set of imaginary cities as well as whether they used the service 5 months after sign-up. It has been pre-loaded for you into a DataFrame called churn_data - explore it in the Shell!

Your goal is to use the first month's worth of data to predict whether the app's users will remain users of the service at the 5 month mark. This is a typical setup for a churn prediction problem. To do this, you'll split the data into training and test sets, fit a small xgboost model on the training set, and evaluate its performance on the test set by computing its accuracy.

pandas and numpy have been imported as pd and np, and train_test_split has been imported from sklearn.model_selection. Additionally, the arrays for the features and the target have been created as X and y.

In [None]:
# Import xgboost
import xgboost as xgb

# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the training and test sets
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)

# Fit the classifier to the training set
xg_cl.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_cl.predict(X_test)

# Compute the accuracy: accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))

# accuracy: 0.743300

### 1.3 What is a decision tree?
Decision trees as base learners
- Base learner - individual learning algorithm in an ensemble algorithm
    - note: XGB is an ensemble learning method that uses outputs of many models for a final prediction
- Composed of a series of binary questions/decisions - y/n, T/F
- Predictions happen at the "leaves" of the tree

Decision trees and CART (=classification and regression trees for ML)
- Constructed iteratively (one decision at a time)
    - Until a stopping criterion is met
- Individual decision trees tend to overfit
    - Low Bias and High Variance learning models
        - good at learning relationships but tend to overfit, so generalize poorly
- CART: Classification and Regression Trees
    - XGB uses this
    - Each leaf ALWAYS contains a real-valued score for classification or regression
    - Can later be converted into categories
        

### 1.4 Decision trees
Your task in this exercise is to make a simple decision tree using scikit-learn's DecisionTreeClassifier on the breast cancer dataset that comes pre-loaded with scikit-learn.

This dataset contains numeric measurements of various dimensions of individual tumors (such as perimeter and texture) from breast biopsies and a single outcome value (the tumor is either malignant, or benign).

We've preloaded the dataset of samples (measurements) into X and the target values per tumor into y. Now, you have to split the complete dataset into training and testing sets, and then train a DecisionTreeClassifier. You'll specify a parameter called max_depth. Many other parameters can be modified within this model, and you can check all of them out here (http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier).

In [None]:
# Import the necessary modules
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=123)

# Instantiate the classifier: dt_clf_4
# :param max_depth of 4. This parameter specifies the maximum number 
#  of successive split points you can have before reaching a leaf node.
dt_clf_4 = DecisionTreeClassifier(max_depth=4)

# Fit the classifier to the training set
dt_clf_4.fit(X_train, y_train)

# Predict the labels of the test set: y_pred_4
y_pred_4 = dt_clf_4.predict(X_test)

# Compute the accuracy of the predictions: accuracy
accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
print("accuracy:", accuracy)


### 1.5 What is Boosting?
Boosting overview
- Not a specific machine learning algorithm
- concept that can be applied to a set of ML models
    - "Meta-algorithm"
- Ensemble meta-algorithm used to convert many weak learners into a strong learner

Weak learners and strong learners
- Weak learner = ML algorithm that is slightly better than chance
    - Example: decision tree whose predictions slightly better than 50%
- Boosting converts a collection of weak learners into a strong learner
- Strong learner = any algorithm that can be tuned to achieve good performance

How boosting is accomplished?
- Iteratively learning a set of weak models on subsets of the data
- Weighting each weak prediction according to each weak learner's performance
    - final prediction is much better than any individual predictions

Model evaluation through cross-validation
- Cross-validation = robust method for estimating the performance of a model on unseen data by...
- Generating many non-overlapping train/test splits on training data
- Reports the average test set performance across all data splits


#### 1.5.a Cross-validation in XGBoost

In [None]:
import xgboost as xgb
import pandas as pd
class_data = pd.read_csv("classification_data.csv")

# DMatrix
churn_dmatrix = xgb.DMatrix(data=churn_data.iloc[:,:-1],
                            label=churn_data.month_5_still_here])

# parameter dictionary to pass into cross validation
params={"objective":"binary:logistic","max_depth":4}

# use cv method and pass required dmatrix
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=4,
                   num_boost_round=10, metrics="error", as_pandas=True)

# output accuracy
print("Accuracy: %f" %((1-cv_results["test-error-mean"]).iloc[-1]))
# Accuracy: 0.88315

### 1.6 Measuring accuracy
You'll now practice using XGBoost's learning API through its baked in cross-validation capabilities. As Sergey discussed in the previous video, XGBoost gets its lauded performance and efficiency gains by utilizing its own optimized data structure for datasets called a DMatrix.

In the previous exercise, the input datasets were converted into DMatrix data on the fly, but when you use the xgboost cv object, you have to first explicitly convert your data into a DMatrix. So, that's what you will do here before running cross-validation on churn_data.

In [None]:
# Create the DMatrix: churn_dmatrix
churn_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:logistic", "max_depth":3}

# Perform cross-validation: cv_results
# "error" metrics will be converted to accuracy
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=3, 
                    num_boost_round=5, metrics="error", as_pandas=True, 
                    seed=123)

# Print cv_results
print(cv_results)

# Print the accuracy
print(((1-cv_results["test-error-mean"]).iloc[-1]))

# output
       test-error-mean  test-error-std  train-error-mean  train-error-std
    0          0.28378        0.001932           0.28232         0.002366
    1          0.27190        0.001932           0.26951         0.001855
    2          0.25798        0.003963           0.25605         0.003213
    3          0.25434        0.003827           0.25090         0.001845
    4          0.24852        0.000934           0.24654         0.001981
    0.75148

cv_results stores the training and test mean and standard deviation of the error per boosting round (tree built) as a DataFrame. From cv_results, the final round 'test-error-mean' is extracted and converted into an accuracy, where accuracy is 1-error. The final accuracy of around 75% is an improvement from earlier!

### 1.7 Measuring AUC
Now that you've used cross-validation to compute average out-of-sample accuracy (after converting from an error), it's very easy to compute any other metric you might be interested in. All you have to do is pass it (or a list of metrics) in as an argument to the metrics parameter of xgb.cv().

Your job in this exercise is to compute another common metric used in binary classification - the area under the curve ("auc"). As before, churn_data is available in your workspace, along with the DMatrix churn_dmatrix and parameter dictionary params.

In [None]:
# Perform cross_validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=3, 
                    num_boost_round=5, metrics="auc", as_pandas=True, 
                    seed=123)

# Print cv_results
print(cv_results)

# Print the AUC
print((cv_results["test-auc-mean"]).iloc[-1])

<script.py> output:
       test-auc-mean  test-auc-std  train-auc-mean  train-auc-std
    0       0.767863      0.002820        0.768893       0.001544
    1       0.789157      0.006846        0.790864       0.006758
    2       0.814476      0.005997        0.815872       0.003900
    3       0.821682      0.003912        0.822959       0.002018
    4       0.826191      0.001937        0.827528       0.000769
    0.826191

# An AUC of 0.84 is quite strong. As you have seen, XGBoost's 
# learning API makes it very easy to compute any metric you may be 
# interested in.

### 1.8 When should I use XGBoost?
When to use XGBoost? Criteria:
- large number of training samples
    - > 1000 training samples and < 100 features
    - should be ok when number of features < number of training samples
- you have mixture of categorical and numeric features
- Or just numeric features

When to NOT use XGBoost? Criteria:
- had success with other algorithms
    - Not suited for: (Deep learning is better)
        - image recognition
        - computer vision
        - NLP and NL understanding problems
- dataset size issues
    - < 100 training samples
    - when number of training samples is significantly smaller than the number of features
    

### 1.9 Using XGBoost
XGBoost is a powerful library that scales very well to many samples and works for a variety of supervised learning problems. That said, as Sergey described in the video, you shouldn't always pick it as your default machine learning library when starting a new project, since there are some situations in which it is not the best option. In this exercise, your job is to consider the below examples and select the one which would be the best use of XGBoost.

Possible Answers
- Visualizing the similarity between stocks by comparing the time series of their historical prices relative to each other.
- Predicting whether a person will develop cancer using genetic data with millions of genes, 23 examples of genomes of people that didn't develop cancer, 3 genomes of people who wound up getting cancer.
- Clustering documents into topics based on the terms used in them.
- Predicting the likelihood that a given user will click an ad from a very large clickstream log with millions of users and their web interactions.

Answer:
- D

## 2. Regression with XGBoost


### 

### 

### 

### 

### 

### 

### 

### 

## 3. Fine-tuning your XGBoost model

### 

### 

### 

### 

### 

### 

### 

### 

## 4. XGBoost in Pipelines

### 

### 

### 

### 

### 

### 

### 

### 