# XGBoost

### Oscar Briones Ramirez

## Intorduction

XGBoost is short for Extreme Gradient Boost. It is one of the most popular machine learning techniques, because it is easy to use, it is fast, and it has a strong predictive power.

## Theory

XGBoost is an implementation of Gradient boosting that uses decision trees. It also uses regularization parameters that help with overfiitng, unlike a normal gradient boost.

It starts with a gradient boost, by building first a decision tree model using the original data, then a second model using the residuals of the last model, then a third model using the sum of the previous two, and so on until it reaches the number of estimators. Then it creates predictions by summing the product of the learning rate with each of the predictions from the models. 

## Important Parameters

<b>booster:</b>

booster is the boosting algorithm. There are 3 options: gbtree, gblinear or dart . The default is gbtree. Dart is similar, it uses dropout methods to avoid overfitting, and gblinear uses generalized linear regression instead of decision trees.


<b>max_depth:</b>

max_depth is the maximum depth of the decision trees. The bigger this number, the less conservative the model will be.

<b>subsample</b>

subsample is the sample size percentage to be used to train. It is a value between 0 and 1. The default is 1 and it uses all of the data. If it is set to 0.8 then 80% of the observations would be randomly sampled to be used in each boosting iteration. It helps to prevent overfitting.

<b>num_estimators:</b>

num_estimators is the number of boosted trees to use. The bigger this number, the bigger the risk of overfitting.

## Example

We will work with the <b>Pima Indians Diabetes</b> dataset.


This dataset comes from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to predict whether or not a patient has diabetes, based on a few medical variables.

The datasets one response variable, Outcome, which tells whether or not the person has diabetes. The explanatory variables are: 

- number of pregnancies
- BMI
- insulin level
- age
- glucose
- blood pressure
- skin thickness
- Diabetes Pedigree Function

### Installing XGBoost:

In [2]:
pip install xgboost

Collecting xgboost
  Downloading xgboost-1.7.5-py3-none-macosx_10_15_x86_64.macosx_11_0_x86_64.macosx_12_0_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: xgboost
Successfully installed xgboost-1.7.5

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Classes and Functions:

In [1]:
from numpy import loadtxt
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, GridSearchCV

### Loading and exploring the dataset:

This is the link to download the dataset:

https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database

In [2]:
df=pd.read_csv('diabetes.csv')

In [3]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Splitting dataset into X's and y's:

In [4]:
y =df['Outcome']
X = df.loc[:,'Pregnancies':'Age'] 

### Create train and test datasets:

In [5]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y, test_size=0.25, random_state=307)

### Train XGBoost Model:

In [11]:
xgb = XGBClassifier(booster = 'dart', max_depth = 15, subsample = .9)
xgb.fit(Xtrain, ytrain)

### Make predictions:

In [12]:
ypred = xgb.predict(Xtest)
ypred

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1])

### Test accuracy of predictions:

In [13]:
# evaluate predictions
accuracy = accuracy_score(ytest, ypred)
accuracy

0.78125

Accuracy is around 78% which is really good. We will now tune the parameters through GridSearch to see if we can improve the accuracy score.

### Tuning parameters

In [9]:
# ** CODE FOR Q4 **
# ** Tuning parameters for Q4 **

# Create the parameter grid based on the results of random search 
parameters = {"booster": ('dart', 'gbtree', 'gblinear'), "max_depth":[2,4,6,8,10,12,15,20], "subsample":[.2,.4,.6,.8,.9]}

#create classifier
xgbc = XGBClassifier()

clf_GS = GridSearchCV(xgbc, parameters)
clf_GS.fit(Xtrain, ytrain)
clf_GS.best_params_

Parameters: { "max_depth", "subsample" } are not used.

Parameters: { "max_depth", "subsample" } are not used.

Parameters: { "max_depth", "subsample" } are not used.

Parameters: { "max_depth", "subsample" } are not used.

Parameters: { "max_depth", "subsample" } are not used.

Parameters: { "max_depth", "subsample" } are not used.

Parameters: { "max_depth", "subsample" } are not used.

Parameters: { "max_depth", "subsample" } are not used.

Parameters: { "max_depth", "subsample" } are not used.

Parameters: { "max_depth", "subsample" } are not used.

Parameters: { "max_depth", "subsample" } are not used.

Parameters: { "max_depth", "subsample" } are not used.

Parameters: { "max_depth", "subsample" } are not used.

Parameters: { "max_depth", "subsample" } are not used.

Parameters: { "max_depth", "subsample" } are not used.

Parameters: { "max_depth", "subsample" } are not used.

Parameters: { "max_depth", "subsample" } are not used.

Parameters: { "max_depth", "subsample" } are not

{'booster': 'gblinear', 'max_depth': 20, 'subsample': 0.8}

In [21]:
xgb2 = XGBClassifier(booster = 'gblinear')
xgb2.fit(Xtrain, ytrain)
ypred = xgb.predict(Xtest)
accuracy = accuracy_score(ytest, ypred)
accuracy


0.78125

### Conclusion

As we can see, our XGBoost model predicts correctly over 78% of the data, which is very good. This shows how good XGBoost is. After tuning the parameters the score turned out the same. It might be useful to look into feature engineering in order to see if the score can be improved.

### Sources:

<b> Dataset: </b>

https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database


<b> XGBoost Documentation and Information: </b>

https://xgboost.readthedocs.io/en/stable/

https://www.datacamp.com/tutorial/xgboost-in-python

https://towardsdatascience.com/xgboost-theory-and-practice-fb8912930ad6 

https://towardsdatascience.com/xgboost-python-example-42777d01001e