# XGBoost - Lab

## Introduction

In this lab, we'll install the popular [XGBoost](http://xgboost.readthedocs.io/en/latest/index.html) library and explore how to use this popular boosting model to classify different types of wine using the [Wine Quality Dataset](https://archive.ics.uci.edu/ml/datasets/wine+quality) from the UCI Machine Learning Dataset Repository.  

## Objectives

You will be able to:

- Fit, tune, and evaluate an XGBoost algorithm

## Installing XGBoost

Run this lab on your local computer.

The XGBoost model is not currently included in scikit-learn, so we'll have to install it on our own.  To install XGBoost, you'll need to use `pip`. 

To install XGBoost, follow these steps:

1. Open up a new terminal window 
2. Activate your conda environment
3. Run `pip install xgboost`
4. Once the installation has completed, run the cell below to verify that everything worked 

In [1]:
from xgboost import XGBClassifier
! pip install xgboost




[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Run the cell below to import everything we'll need for this lab. 

In [2]:
import pandas as pd
import numpy as np
np.random.seed(0)
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

### Loading the Data

The dataset we'll be using for this lab is currently stored in the file `'winequality-red.csv'`.  

In the cell below, use pandas to import the dataset into a dataframe, and inspect the `.head()` of the dataframe to ensure everything loaded correctly. 

In [12]:
df = pd.read_csv("../data/winequality-red.csv")  # Move up one directory
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [None]:
# here we are dealing with a multiclass classification problem where we are predicting the quality of the wine

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


For this lab, our target column will be `'quality'`.  That makes this a multiclass classification problem. Given the data in the columns from `'fixed_acidity'` through `'alcohol'`, we'll predict the quality of the wine.  

This means that we need to store our target variable separately from the dataset, and then split the data and labels into training and test sets that we can use for cross-validation. 

### Splitting the Data

In the cell below:

- Assign the `'quality'` column to `y` 
- Drop this column (`'quality'`) and assign the resulting DataFrame to `X` 
- Split the data into training and test sets. Set the `random_state` to 42   

In [14]:
y = df["quality"]
X = df.drop("quality", axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

### Preprocessing the Data

These are the current target values:

In [None]:
# here we check the target values on the quality
y_train.value_counts().sort_index()

quality
3      9
4     40
5    517
6    469
7    151
8     13
Name: count, dtype: int64

In [18]:
y_train.shape

(1199,)

XGBoost requires that classification categories be integers that count up from 0, not starting at 3. Therefore you should instantiate a `LabelEncoder` ([documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)) and convert both `y_train` and `y_test` into arrays containing label encoded values (i.e. integers that count up from 0).

In [17]:
# from sklearn.preprocessing import LabelEncoder
# XGBoost requires that classification be integers that count from 0
# convert the y tain and y test into arrays containing label encoded values which count from 0

le = LabelEncoder()

# fit and transform the training data
y_train_le = pd.Series(le.fit_transform(y_train))

y_test_le = pd.Series(le.transform(y_test))

print(y_train_le)

0       2
1       2
2       4
3       2
4       3
       ..
1194    3
1195    3
1196    2
1197    4
1198    3
Length: 1199, dtype: int64


In [19]:
# # Instantiate the encoder
# encoder = LabelEncoder()

# # Fit and transform the training data
# y_train = pd.Series(encoder.fit_transform(y_train))

# # Transform the test data
# y_test = pd.Series(encoder.transform(y_test))

Confirm that the new values start at 0 instead of 3:

In [20]:
y_train_le.value_counts().sort_index()

0      9
1     40
2    517
3    469
4    151
5     13
Name: count, dtype: int64

In [21]:
# # Your code here to inspect the values of y_train and y_test
# y_train.value_counts().sort_index()

### Building an XGBoost Model

Now that you have prepared the data for modeling, you can use XGBoost to build a model that can accurately classify wine quality based on the features of the wine!

The API for `xgboost` is purposefully written to mirror the same structure as other models in scikit-learn.  

# How XGBoost Works
XGBoost works by iteratively adding models to correct the errors made by existing models. It builds a model in a stage-wise fashion and generalizes them by optimizing a loss function. The primary components of the XGBoost algorithm are:

* Initialization: Start with an initial prediction, usually the mean of the target values.

* Iterative Boosting: For each iteration:
Compute the gradient (i.e., the difference between the predicted value and the actual value).
Fit a base learner (e.g., a decision tree) to the gradient.
Update the prediction by adding the learned base learner multiplied by a learning rate.

* Regularization: Apply regularization to control model complexity and prevent overfitting.

# About XGBoost (extreme gradient boosting)

Speed and Performance: XGBoost is designed to be efficient and can handle large datasets and high-dimensional data. It is known for its speed and performance improvements over other gradient boosting implementations.

Regularization: It includes regularization parameters (L1 & L2) which helps to avoid overfitting.

Parallelization: XGBoost supports parallel processing, making it faster than other gradient boosting implementations.

Tree Pruning: XGBoost uses a technique called “pruning” (also known as “pre-pruning” or “maximum depth pruning”) to control the depth of the trees.

Handling Missing Values: XGBoost has a built-in routine to handle missing values.

Cross-Validation: It supports cross-validation at each iteration of the boosting process.

In [22]:
! pip install xgboost --upgrade

Collecting xgboost
  Downloading xgboost-2.1.4-py3-none-win_amd64.whl.metadata (2.1 kB)
Downloading xgboost-2.1.4-py3-none-win_amd64.whl (124.9 MB)
   ---------------------------------------- 0.0/124.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/124.9 MB ? eta -:--:--
   ---------------------------------------- 0.3/124.9 MB ? eta -:--:--
   ---------------------------------------- 0.5/124.9 MB 1.4 MB/s eta 0:01:29
   ---------------------------------------- 0.5/124.9 MB 1.4 MB/s eta 0:01:29
   ---------------------------------------- 0.8/124.9 MB 799.2 kB/s eta 0:02:36
   ---------------------------------------- 1.0/124.9 MB 883.6 kB/s eta 0:02:21
   ---------------------------------------- 1.3/124.9 MB 958.5 kB/s eta 0:02:09
    --------------------------------------- 1.6/124.9 MB 999.0 kB/s eta 0:02:04
    --------------------------------------- 1.8/124.9 MB 1.0 MB/s eta 0:01:58
    --------------------------------------- 2.1/124.9 MB 1.1 MB/s eta 0:01:54
    ---


[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: Exception:
Traceback (most recent call last):
  File "C:\Users\Administrator\anaconda3\envs\learn-env\lib\site-packages\pip\_vendor\urllib3\response.py", line 438, in _error_catcher
    yield
  File "C:\Users\Administrator\anaconda3\envs\learn-env\lib\site-packages\pip\_vendor\urllib3\response.py", line 561, in read
    data = self._fp_read(amt) if not fp_closed else b""
  File "C:\Users\Administrator\anaconda3\envs\learn-env\lib\site-packages\pip\_vendor\urllib3\response.py", line 527, in _fp_read
    return self._fp.read(amt) if amt is not None else self._fp.read()
  File "C:\Users\Administrator\anaconda3\envs\learn-env\lib\site-packages\pip\_vendor\cachecontrol\filewrapper.py", line 98, in read
    data: bytes = self.__fp.read(amt)
  File "C:\Users\Administrator\anaconda3\envs\learn-env\lib\http\client.py", line 458, in read
    n = self.readinto(b)
 

In [24]:
# lets build the XGBoost to build the model that accurately classifies the wine quality based on the the wine features

# here we instantiate the XGBoostClassifier then fit to the X train and the y train le (labelEncoded)
xgb_clf = XGBClassifier().fit(X_train, y_train_le)

# lets make predictions on the training amd test set
training_preds = xgb_clf.predict(X_train)
test_preds = xgb_clf.predict(X_test)

In [26]:
# determine the performance of the model
training_accuracy = accuracy_score(y_train_le, training_preds)

test_accuracy = accuracy_score(y_test_le, test_preds)

print("Training accuracy: {:.4}%".format(training_accuracy * 100))
print("Testing (validation) accuracy:{:.4}%".format(test_accuracy * 100))

Training accuracy: 100.0%
Testing (validation) accuracy:68.0%


In [27]:
# # Instantiate XGBClassifier
# clf = XGBClassifier()

# # Fit XGBClassifier
# clf.fit(X_train, y_train)

# # Predict on training and test sets
# training_preds = clf.predict(X_train)
# test_preds = clf.predict(X_test)

# # Accuracy of training and test sets
# training_accuracy = accuracy_score(y_train, training_preds)
# test_accuracy = accuracy_score(y_test, test_preds)

# print('Training Accuracy: {:.4}%'.format(training_accuracy * 100))
# print('Validation accuracy: {:.4}%'.format(test_accuracy * 100))

In [None]:
# Training accuracy: 100.0%
# Testing (validation) accuracy:68.0%

# the performance of the xgboost algorithm is a clear indication of overfitting
# the model learned the training data including its noise..


## Tuning XGBoost

The model had a somewhat lackluster performance on the test set compared to the training set, suggesting the model is beginning to overfit to the training data. Let's tune the model to increase the model performance and prevent overfitting. 

You've already encountered a lot of parameters when working with Decision Trees, Random Forests, and Gradient Boosted Trees.

For a full list of model parameters, see the [XGBoost Documentation](http://xgboost.readthedocs.io/en/latest/parameter.html).

Examine the tunable parameters for XGboost, and then fill in appropriate values for the `param_grid` dictionary in the cell below. 

**_NOTE:_** Remember, `GridSearchCV` finds the optimal combination of parameters through an exhaustive combinatoric search.  If you search through too many parameters, the model will take forever to run! To ensure your code runs in sufficient time, we restricted the number of values the parameters can take.  

In [28]:
param_grid = {
    'learning_rate': [0.1, 0.2],
    'max_depth': [6],
    'min_child_weight': [1, 2],
    'subsample': [0.5, 0.7],
    'n_estimators': [100],
}

Now that we have constructed our `params` dictionary, create a `GridSearchCV` object in the cell below and use it to iteratively tune our XGBoost model.  

Now, in the cell below:

* Create a `GridSearchCV` object. Pass in the following parameters:
    * `clf`, the classifier
    * `param_grid`, the dictionary of parameters we're going to grid search through
    * `scoring='accuracy'`
    * `cv=None`
    * `n_jobs=1`
* Fit our `grid_clf` object and pass in `X_train` and `y_train`
* Store the best parameter combination found by the grid search in `best_parameters`. You can find these inside the grid search object's `.best_params_` attribute 
* Use `grid_clf` to create predictions for the training and test sets, and store them in separate variables 
* Compute the accuracy score for the training and test predictions  

In [31]:
grid_clf = GridSearchCV(xgb_clf, param_grid, scoring = 'accuracy', cv = None, n_jobs = 1)
grid_clf.fit(X_train, y_train_le)

# find the best parameters..
best_parameters = grid_clf.best_params_

print("The Grid Search found the following optimal parameters: ")
for param_name in sorted(best_parameters.keys()):
    print('%s:%r'%(param_name, best_parameters[param_name]))
    
training_preds =  grid_clf.predict(X_train)
testing_preds = grid_clf.predict(X_test)

training_accuracy = accuracy_score(y_train_le, training_preds)
test_accuracy = accuracy_score(y_test_le, testing_preds)

print("The Training Accuracy: {:.4}%".format(training_accuracy * 100))
print("The Validation Accuracy: {:.4}%".format(test_accuracy * 100))

The Grid Search found the following optimal parameters: 
learning_rate:0.1
max_depth:6
min_child_weight:1
n_estimators:100
subsample:0.7
The Training Accuracy: 99.83%
The Validation Accuracy: 68.25%


In [32]:
# grid_clf = GridSearchCV(clf, param_grid, scoring = "accuracy", cv = None, n_jobs = 1)
# grid_clf.fit(X_train, y_train)

# best_parameters = grid_clf.best_params_

# print('Grid Search found the following optimal parameters: ')
# for param_name in sorted(best_parameters.keys()):
#     print('%s: %r' % (param_name, best_parameters[param_name]))

# training_preds = grid_clf.predict(X_train)
# test_preds = grid_clf.predict(X_test)
# training_accuracy = accuracy_score(y_train, training_preds)
# test_accuracy = accuracy_score(y_test, test_preds)

# print('')
# print('Training Accuracy: {:.4}%'.format(training_accuracy * 100))
# print('Validation accuracy: {:.4}%'.format(test_accuracy * 100)) 

## Summary

Great! You've now successfully made use of one of the most powerful boosting models in data science for modeling.  We've also learned how to tune the model for better performance using the grid search methodology we learned previously. XGBoost is a powerful modeling tool to have in your arsenal. Don't be afraid to experiment with it! 