# Assignment Week 1-2: Case Study XGBOOST (June 15)

Created by: **Hosein Beheshtifard & Aref Motamedi**

**Overview**:

XGBoost (eXtreme Gradient Boosting) is an open-source software library which provides a regularizing gradient boosting framework for C++, Java, Python, R, Julia, Perl, and Scala. It is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.

![](https://upload.wikimedia.org/wikipedia/commons/6/69/XGBoost_logo.png)

![](https://miro.medium.com/max/777/1*l4PN8hyAO4fMLxUbIxcETA.png)



In order to get a better understanding of this library, let's test it on an actual -small- project.

<h2> Dataset </h2>

We will use a clean dataset of 70,692 survey responses to the CDC's BRFSS2015. It has an equal 50-50 split of respondents with no diabetes and with either prediabetes or diabetes. The target variable Diabetes_binary has 2 classes. 0 is for no diabetes, and 1 is for prediabetes or diabetes. This dataset has 21 feature variables and is balanced.



Link to download the dataset: https://drive.google.com/file/d/1UXvhGWwUApkDEX9Tt4enQ5l1D6jNBPb4/view?usp=sharing

<h2> Importing libraries and packages </h2>

In [1]:
import numpy as np
import pandas as pd
import xgboost as xgb

from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold, cross_val_score, train_test_split, StratifiedKFold, GridSearchCV

from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, log_loss
from sklearn.metrics import roc_curve, roc_auc_score, auc
from xgboost import XGBClassifier

plt.style.use("Solarize_Light2")

import warnings
warnings.filterwarnings("ignore")

<h2> Loading the dataset </h2>


In [3]:
data = pd.read_csv("./non_name_data.csv")
data

Unnamed: 0,clmn1,clmn2,clmn3,clmn4,clmn5,clmn6,clmn7,clmn8,clmn9,clmn10,...,clmn17,clmn18,clmn19,clmn20,clmn21,clmn22,clmn23,clmn24,lbl,Out_num
0,13,5,5.353466,587.5,5,3,500.0,425.0,300.0,3.0,...,0,1,0,0,0,0,1,0,True,1.945125e-04
1,1,5,1.885697,525.0,1,2,412.5,462.5,200.0,4.0,...,0,1,0,0,0,0,1,0,False,8.934743e-03
2,5,6,1.567379,500.0,2,6,487.5,337.5,200.0,2.0,...,0,0,1,0,0,1,0,0,True,6.760000e-14
3,6,8,0.380682,425.0,2,1,375.0,437.5,300.0,2.0,...,0,0,0,1,0,0,1,0,False,5.241387e-03
4,9,13,0.578078,400.0,4,3,612.5,500.0,400.0,2.0,...,0,0,0,1,0,0,1,0,True,2.252500e-09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21995,6,4,2.906392,3600.0,1,1,2350.0,1400.0,300.0,4.0,...,0,0,0,1,0,0,0,1,False,1.229091e-02
21996,3,11,0.789073,2900.0,6,6,2100.0,2250.0,200.0,3.0,...,0,0,1,0,1,0,0,0,False,1.572687e-03
21997,7,0,4.620005,1800.0,4,2,1350.0,3600.0,300.0,4.0,...,0,0,1,0,1,0,0,0,False,2.490360e-02
21998,1,12,5.864099,3050.0,5,2,1800.0,2850.0,200.0,4.0,...,0,0,0,1,0,0,1,0,False,2.015414e-02


<h2> Pre-processing </h2>

In this section, we need to prepare our data before it is used in order to ensure or enhance performances.
We usually do a variety of things in this step, such as normalization, data cleaning, handling missing values, feature reduction, and etc.



In [4]:
data.isna().sum()

clmn1      0
clmn2      0
clmn3      0
clmn4      0
clmn5      0
clmn6      0
clmn7      0
clmn8      0
clmn9      0
clmn10     0
clmn11     0
clmn12     0
clmn13     0
clmn14     0
clmn15     0
clmn16     0
clmn17     0
clmn18     0
clmn19     0
clmn20     0
clmn21     0
clmn22     0
clmn23     0
clmn24     0
lbl        0
Out_num    0
dtype: int64

*   Fortunately, our dataset does not contain any missing values.

Let's see how many unique values are in each attribute:



In [5]:
data.nunique()

clmn1         14
clmn2         14
clmn3      21999
clmn4         77
clmn5         10
clmn6         10
clmn7         77
clmn8         77
clmn9          4
clmn10         5
clmn11        12
clmn12        10
clmn13         2
clmn14         2
clmn15         2
clmn16         2
clmn17         2
clmn18         2
clmn19         2
clmn20         2
clmn21         2
clmn22         2
clmn23         2
clmn24         2
lbl            2
Out_num    14100
dtype: int64

In [6]:
X = data.drop(['lbl'], axis=1)
y = data['lbl']

In [7]:
X

Unnamed: 0,clmn1,clmn2,clmn3,clmn4,clmn5,clmn6,clmn7,clmn8,clmn9,clmn10,...,clmn16,clmn17,clmn18,clmn19,clmn20,clmn21,clmn22,clmn23,clmn24,Out_num
0,13,5,5.353466,587.5,5,3,500.0,425.0,300.0,3.0,...,0,0,1,0,0,0,0,1,0,1.945125e-04
1,1,5,1.885697,525.0,1,2,412.5,462.5,200.0,4.0,...,1,0,1,0,0,0,0,1,0,8.934743e-03
2,5,6,1.567379,500.0,2,6,487.5,337.5,200.0,2.0,...,0,0,0,1,0,0,1,0,0,6.760000e-14
3,6,8,0.380682,425.0,2,1,375.0,437.5,300.0,2.0,...,1,0,0,0,1,0,0,1,0,5.241387e-03
4,9,13,0.578078,400.0,4,3,612.5,500.0,400.0,2.0,...,0,0,0,0,1,0,0,1,0,2.252500e-09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21995,6,4,2.906392,3600.0,1,1,2350.0,1400.0,300.0,4.0,...,1,0,0,0,1,0,0,0,1,1.229091e-02
21996,3,11,0.789073,2900.0,6,6,2100.0,2250.0,200.0,3.0,...,1,0,0,1,0,1,0,0,0,1.572687e-03
21997,7,0,4.620005,1800.0,4,2,1350.0,3600.0,300.0,4.0,...,1,0,0,1,0,1,0,0,0,2.490360e-02
21998,1,12,5.864099,3050.0,5,2,1800.0,2850.0,200.0,4.0,...,1,0,0,0,1,0,0,1,0,2.015414e-02


<h2> XGBOOST </h2>


At first, we would like to split out dataset into train and validation sets. For each, we are needed to indicate data (**X**) and labels (**y**)


In [8]:
seed = 123 # a value for random_state
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=seed, shuffle=True, stratify=y)
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

((17600, 25), (4400, 25), (17600,), (4400,))

In [9]:
xgb.XGBClassifier().get_params()

{'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 1,
 'gamma': 0,
 'learning_rate': 0.1,
 'max_delta_step': 0,
 'max_depth': 3,
 'min_child_weight': 1,
 'missing': None,
 'n_estimators': 100,
 'n_jobs': 1,
 'nthread': None,
 'objective': 'binary:logistic',
 'random_state': 0,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'seed': None,
 'silent': None,
 'subsample': 1,
 'verbosity': 1}

<h2> Tuning hyperparameters </h2>

In [11]:
# you can use different values for the below hyperparameters.
learning_rate_list = [0.02, 0.1, 0.3]
max_depth_list = [2, 3, 4]
n_estimators_list = [100, 200, 300]

params_dict = {"learning_rate": learning_rate_list,
               "max_depth": max_depth_list,
               "n_estimators": n_estimators_list}

num_combinations = 1
for v in params_dict.values(): num_combinations *= len(v) 

print(num_combinations)
params_dict

27


{'learning_rate': [0.02, 0.1, 0.3],
 'max_depth': [2, 3, 4],
 'n_estimators': [100, 200, 300]}

In [13]:
def my_roc_auc_score(model, X, y): return roc_auc_score(y, model.predict_proba(X)[:,1])

model = XGBClassifier(subsample=0.5,
                      colsample_bytree=1,
                      random_state = seed,
                      eval_metric='auc')

greedy = GridSearchCV(model,params_dict,n_jobs=-1,cv=10)
greedy.fit(X_train, y_train)
my_roc_auc_score(greedy,X_valid,y_valid)


1.0