# Chapter 1: XGBoost (DataCamp)

Course notes from [DataCamp](https://campus.datacamp.com/courses/extreme-gradient-boosting-with-xgboost) XGBoost<br>
Importing xgboost requires some [work](https://www.ibm.com/developerworks/community/blogs/jfp/entry/Installing_XGBoost_For_Anaconda_on_Windows?lang=en)

## XGBoost: Binary Logistic Classifier (Customer Churn)
Introduction to XGBoost working with churn data. This dataset contains imaginary data from a ride-sharing app with user behaviors over their first month of app usage in a set of imaginary cities as well as whether they used the service 5 months after sign-up. The goal is to use the first month's worth of data to __predict whether the app's users will remain users__ of the service at the 5 month mark. This is a typical setup for a churn prediction problem. To do this, split the data into training and test sets, fit a small xgboost model on the training set, and evaluate its performance on the test set by computing its accuracy.

In [33]:
# may be required if xgboost import throws errors 
# import os
# mingw_path = 'C:\\Program Files\\mingw-w64\\x86_64-7.2.0-posix-seh-rt_v5-rev1\\mingw64\\bin'
# os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']

In [34]:
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [35]:
# load data (may have to re-point once in github)
churn_data = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/production/course_3739/datasets/taxi_churn_data_dummified.csv")

In [36]:
# Create df for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

In [37]:
# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)

# Instantiate the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(objective = 'binary:logistic', n_estimators = 10, seed = 123)

# fit model
xg_cl.fit(X_train, y_train)

# predictions based on X_test
preds = xg_cl.predict(X_test)

# manual calc for accuracy
accuracy = float(np.sum(preds == y_test))/y_test.shape[0]

print ('Accuracy (unpreprocessed) with XGBoost: %f' % (accuracy))

Accuracy (unpreprocessed) with XGBoost: 0.743300


## Non-XGBoost Decision Tree and LogReg (Breast Cancer Wisconsin Dataset)
Simple decision tree using scikit-learn's DecisionTreeClassifier on UCI ML Breast Cancer Wisconsin (Diagnostic) dataset

In [38]:
# may be required if xgboost import throws errors 
# import os
# mingw_path = 'C:\\Program Files\\mingw-w64\\x86_64-7.2.0-posix-seh-rt_v5-rev1\\mingw64\\bin'
# os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']

In [39]:
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

### Load UCI ML Breast Cancer Wisconsin (Diagnostic) Dataset

In [40]:
print(50 * '=')
print('Loading the Breast Cancer Wisconsin dataset')
print(50 * '-')
breast_cancer_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases'
                 '/breast-cancer-wisconsin/wdbc.data', header=None)

print('Breast Cancer dataset excerpt:\n')
print(breast_cancer_data.head(), '\n\n')

print('Breast Cancer dataset dimensions:\n')
print(breast_cancer_data.shape)

Loading the Breast Cancer Wisconsin dataset
--------------------------------------------------
Breast Cancer dataset excerpt:

         0  1      2      3       4       5        6        7       8   \
0    842302  M  17.99  10.38  122.80  1001.0  0.11840  0.27760  0.3001   
1    842517  M  20.57  17.77  132.90  1326.0  0.08474  0.07864  0.0869   
2  84300903  M  19.69  21.25  130.00  1203.0  0.10960  0.15990  0.1974   
3  84348301  M  11.42  20.38   77.58   386.1  0.14250  0.28390  0.2414   
4  84358402  M  20.29  14.34  135.10  1297.0  0.10030  0.13280  0.1980   

        9    ...        22     23      24      25      26      27      28  \
0  0.14710   ...     25.38  17.33  184.60  2019.0  0.1622  0.6656  0.7119   
1  0.07017   ...     24.99  23.41  158.80  1956.0  0.1238  0.1866  0.2416   
2  0.12790   ...     23.57  25.53  152.50  1709.0  0.1444  0.4245  0.4504   
3  0.10520   ...     14.91  26.50   98.87   567.7  0.2098  0.8663  0.6869   
4  0.10430   ...     22.54  16.67  152.20  

### Load Training Data

In [41]:
X = breast_cancer_data.loc[:, 2:].values
y = breast_cancer_data.loc[:, 1].values

### Load Target Data

In [42]:
le = LabelEncoder()
y = le.fit_transform(y)
y_enc = le.transform(['M', 'B'])
print("Label encoding example, le.transform(['M', 'B'])")
print(le.transform(['M', 'B']))

Label encoding example, le.transform(['M', 'B'])
[1 0]


#### Traditional Decision Tree Classifier

In [43]:
from sklearn.tree import DecisionTreeClassifier

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=123)

# Instantiate the classifier: dt_clf_4
dt_clf_4 = DecisionTreeClassifier(max_depth = 4)

# Fit the classifier to the training set
dt_clf_4.fit(X_train, y_train)

# Predict the labels of the test set: y_pred_4
y_pred_4 = dt_clf_4.predict(X_test)

# Compute the accuracy of the predictions: accuracy
accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
print("Test Accuracy:", accuracy)

Test Accuracy: 0.9736842105263158


#### Pipelined Logistic Regression Classifier with PCA and Standardization 

In [44]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=123)

print(50 * '=')
print('Section: Combining transformers and estimators in a pipeline')
print(50 * '-')


pipe_lr = Pipeline([('scl', StandardScaler()),
                    ('pca', PCA(n_components=2)),
                    ('clf', LogisticRegression(random_state=1))])

pipe_lr.fit(X_train, y_train)
print('Test Accuracy: %.3f' % pipe_lr.score(X_test, y_test))
y_pred = pipe_lr.predict(X_test)

Section: Combining transformers and estimators in a pipeline
--------------------------------------------------
Test Accuracy: 0.965


## XGBoost Native Cross-validation (Accuracy) 
Use of __xgb.cv__, XGBoost's learning API through its baked-in cross-validation capabilities. XGBoost gets its lauded performance and efficiency gains by utilizing its own optimized data structure for datasets called a DMatrix.

In XGBoost: Fit/Predict, the input datasets were converted into DMatrix data on the fly, but when using the xgboost cv object, first explicitly convert data into a DMatrix. 

In [45]:
# load data (may have to re-point once in github)
churn_data = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/production/course_3739/datasets/taxi_churn_data_dummified.csv")

In [46]:
# Create the DMatrix: churn_dmatrix
churn_dmatrix = xgb.DMatrix(data = churn_data.iloc[:,:-1], label=churn_data.month_5_still_here)

# Create the parameter dictionary: params
params = {"objective":"reg:logistic", "max_depth":3}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=3, num_boost_round=5, metrics="error", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the accuracy
print ('\n Accuracy (unpreprocessed) with XGBoost: %f' % (((1-cv_results["test-error-mean"]).iloc[-1])))

   test-error-mean  test-error-std  train-error-mean  train-error-std
0          0.28378        0.001932           0.28232         0.002366
1          0.27190        0.001932           0.26951         0.001855
2          0.25798        0.003963           0.25605         0.003213
3          0.25434        0.003827           0.25090         0.001845
4          0.24852        0.000934           0.24654         0.001981

 Accuracy (unpreprocessed) with XGBoost: 0.751480


cv_results stores the training and test mean and standard deviation of the error per boosting round (tree built) as a DataFrame. From cv_results, the final round 'test-error-mean' is extracted and converted into an accuracy, where accuracy is 1-error. The final accuracy of around 75% is similar to the results determined earlier

## XGBoost Native Area Under the Curve (AUC)
Having used cross-validation to compute average out-of-sample accuracy (after converting from an error), it's very easy to compute any other metric you might be interested in. Simply pass it (or a list of metrics) in as an argument to the metrics parameter of xgb.cv().

This exercise will compute another common metric used in binary classification - the area under the curve ("auc"). 

In [47]:
# Perform cross_validation using cv_results and AUC as the metric 
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=3, num_boost_round=5, metrics="auc", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the AUC
print ('\n AUC (unpreprocessed) with XGBoost: %f' % (cv_results["test-auc-mean"]).iloc[-1])


   test-auc-mean  test-auc-std  train-auc-mean  train-auc-std
0       0.767863      0.002820        0.768893       0.001544
1       0.789157      0.006846        0.790864       0.006758
2       0.814476      0.005997        0.815872       0.003900
3       0.821682      0.003912        0.822959       0.002018
4       0.826191      0.001937        0.827528       0.000769

 AUC (unpreprocessed) with XGBoost: 0.826191
