Using XGBoost is easy. Maybe too easy, considering it's generally considered the best ML algorithm around right now.

To install it, just:

pip install xgboost

Let's experiment using the Iris data set. This data set includes the width and length of the petals and sepals of many Iris flowers, and the specific species of Iris the flower belongs to. Our challenge is to predict the species of a flower sample just based on the sizes of its petals. We'll revisit this data set later when we talk about principal component analysis too.


In [38]:
# Importing the 'load_iris' function from the 'sklearn.datasets' module.
# This function is used to load the Iris dataset, a classic dataset in machine learning and statistics.
from sklearn.datasets import load_iris

# Calling the 'load_iris' function to load the Iris dataset.
# The function returns an object that contains the dataset along with additional information.
iris = load_iris()

# The Iris dataset is stored in a structured format where 'iris.data' contains the data points.
# Each data point is a feature vector representing an Iris flower.
# 'iris.data.shape' returns a tuple where the first element is the number of samples (flowers)
# and the second element is the number of features (characteristics of each flower) in the dataset.
numSamples, numFeatures = iris.data.shape

# Printing the number of samples in the Iris dataset.
# This number represents how many individual flowers are included in the dataset.
print(numSamples)

# Printing the number of features in the Iris dataset.
# Each feature represents a specific measurement or characteristic of the Iris flowers (petal length, petal width, sepal Length, sepal Width).
print(numFeatures)

# The Iris dataset includes labels for each data point, categorizing them into different types of Iris flowers.
# 'iris.target_names' contains the names of these categories or species.
# Here, we convert this array into a list for easier reading and print it out.
# This shows the different types of Iris flowers that the dataset classifies.
print(list(iris.target_names))

150
4
['setosa', 'versicolor', 'virginica']


Let's divide our data into 20% reserved for testing our model, and the remaining 80% to train it with. By withholding our test data, we can make sure we're evaluating its results based on new flowers it hasn't seen before. Typically we refer to our features (in this case, the petal sizes) as X, and the labels (in this case, the species) as y.


In [39]:
# Importing the train_test_split function from the sklearn.model_selection module.
# This function is used for splitting data arrays into two subsets: for training data and for testing data.
from sklearn.model_selection import train_test_split

# Splitting the dataset into training and testing subsets.
X_train, X_test, y_train, y_test = train_test_split(
    iris.data,  # The feature vectors (independent variables) from the Iris dataset.
    iris.target,  # The labels (dependent variable) for each feature vector.
    test_size=0.2,  # Allocating 20% of the dataset for testing and the rest for training.
    random_state=0,  # Setting a seed for the random number generator for reproducibility.
)

# After this split:
# X_train contains the feature vectors for training the model.
# y_train contains the corresponding labels for X_train.
# X_test contains the feature vectors for testing the model.
# y_test contains the corresponding labels for X_test.

Now we'll load up XGBoost, and convert our data into the DMatrix format it expects. One for the training data, and one for the test data.


In [40]:
# Importing the xgboost library. XGBoost stands for eXtreme Gradient Boosting
# and is an efficient implementation of gradient boosted decision trees.
import xgboost as xgb

# Preparing the training data for XGBoost.
# The DMatrix is a data structure unique to XGBoost that optimizes both memory efficiency
# and training speed, which is especially beneficial for large datasets.
train = xgb.DMatrix(X_train, label=y_train)
# X_train: The feature vectors used for training the model.
# label=y_train: The labels corresponding to each training feature vector.

# Similarly, preparing the testing data for XGBoost.
test = xgb.DMatrix(X_test, label=y_test)
# X_test: The feature vectors used for testing the model.
# label=y_test: The labels corresponding to each testing feature vector.

# After executing this code, you'll have 'train' and 'test' datasets in a format
# that can be efficiently utilized by XGBoost for model training and testing.

Now we'll define our hyperparameters. We're choosing softmax since this is a multiple classification problem, but the other parameters should ideally be tuned through experimentation.


In [41]:
# Defining the hyperparameters for the XGBoost model in a dictionary named 'param'.
param = {
    "max_depth": 4,  # 'max_depth' specifies the maximum depth of a tree. It controls over-fitting as higher depth will allow the model to learn relations very specific to a particular sample.
    "eta": 0.3,  # 'eta' is the learning rate. It makes the model more robust by shrinking the weights on each step, preventing overfitting.
    "objective": "multi:softmax",  # 'objective' specifies the learning task and the corresponding learning objective. Here, 'multi:softmax' is used for multiclass classification, and it returns predicted class labels (not probabilities).
    "num_class": 3,  # 'num_class' is set to 3, as there are three classes in the Iris dataset (setosa, versicolor, virginica).
}

# Setting the number of training iterations.
epochs = 10  # 'epochs' (also known as num_boost_round in XGBoost) represents the number of times the boosting process is to be run. It's akin to the number of training epochs in deep learning.

Let's go ahead and train our model using these parameters as a first guess.


In [42]:
# Training the XGBoost model using the previously defined parameters and training dataset.
model = xgb.train(
    param,  # The dictionary of hyperparameters defined earlier.
    train,  # The DMatrix containing the training data (feature vectors and labels).
    epochs,  # The number of training iterations (boosting rounds).
)

# After executing this line, 'model' will be an XGBoost model trained on your data
# according to the specified hyperparameters and for the defined number of epochs.

Now we'll use the trained model to predict classifications for the data we set aside for testing. Each classification number we get back corresponds to a specific species of Iris.


In [43]:
# Making predictions using the trained XGBoost model on the test dataset.
predictions = model.predict(test)
# 'test' is the DMatrix containing the test data (feature vectors).

# The variable 'predictions' will now contain the predicted labels for each sample in the test dataset.
# These predictions can be compared against the actual labels ('y_test') to evaluate the model's performance.

In [44]:
# This line will display the array of predicted labels for each sample in the test dataset. Each element in this array corresponds to
# the predicted class for a test sample, as determined by the model.
print(predictions)
# This line will print the total number of predictions made, which should be equal to the number of samples in the test dataset.
print(len(predictions))

[2. 1. 0. 2. 0. 2. 0. 1. 1. 1. 2. 1. 1. 1. 1. 0. 1. 1. 0. 0. 2. 1. 0. 0.
 2. 0. 0. 1. 1. 0.]
30


Let's measure the accuracy on the test data...


In [45]:
# Importing the accuracy_score function from sklearn.metrics.
# This function is used to compute the accuracy of a classification model.
from sklearn.metrics import accuracy_score

# Calculating the accuracy of the model.
# Accuracy is the ratio of correctly predicted observations to the total observations.
accuracy = accuracy_score(y_test, predictions)
# y_test: The actual labels of the test dataset.
# predictions: The predicted labels generated by the XGBoost model.

# The variable 'accuracy' now holds the accuracy score of the model.
# It represents the proportion of correct predictions in all predictions made.

In [46]:
print(accuracy)

1.0


Holy crow! It's perfect, and that's just with us guessing as to the best hyperparameters!

Normally I'd have you experiment to find better hyperparameters as an activity, but you can't improve on those results. Instead, see what it takes to make the results worse! How few epochs (iterations) can I get away with? How low can I set the max_depth? Basically try to optimize the simplicity and performance of the model, now that you already have perfect accuracy.
