## Experimenting XGBoost using IRIS test dataset

### 1. Load all required libraries / packages

In [11]:
#install.packages("xgboost") - if not installed earlier
library(xgboost)
library(caret)
library(dplyr)

### 2. Load the dataset

In [13]:
set.seed(1000)
data(iris)
species = iris$Species
label = as.integer(iris$Species)-1
iris$Species = NULL

### 3. Data Preparation: Split the data for training and testing

In [14]:
n=nrow(iris)
train.index = sample(n,floor(0.74*n))
train.data = as.matrix(iris[train.index,])
train.label = label[train.index]
test.data = as.matrix(iris[-train.index,])
test.label = label[-train.index]

### 4. Data Preparation: Create xgb Dmatrix objects

In [15]:
xgb.train.object = xgb.DMatrix(data = train.data,label=train.label)
xgb.test.object = xgb.DMatrix(data = test.data,label=test.label)

### 5. Data Preparation: Define parameters

In [16]:
# The multi:softprob objective tells the algorithm to calculate probabilities for every 
# possible outcome (in this case, a probability for each of the three flower species), for 
# every observation.

num_class = length(levels(species))
params = list(
  booster = "gbtree",
  eta = 0.001,
  max_depth = 5,
  gamma = 3,
  subsample = 0.74,
  colsample_bytree = 1,
  objective = "multi:softprob",
  eval_metric = "mlogloss",
  num_class = num_class
)

### 6. Model Development: Train the model

In [17]:
# Train the XGBoost classifer
xgb.fit=xgb.train(
  params=params,
  data=xgb.train.object,
  nrounds=10000,
  nthreads=1,
  early_stopping_rounds=15,
  watchlist=list(val1=xgb.train.object,val2=xgb.test.object),
  verbose=0
)

In [18]:
# Review the final model and results
xgb.fit

##### xgb.Booster
raw: 3 Mb 
call:
  xgb.train(params = params, data = xgb.train.object, nrounds = 10000, 
    watchlist = list(val1 = xgb.train.object, val2 = xgb.test.object), 
    verbose = 0, early_stopping_rounds = 15, nthreads = 1)
params (as set within xgb.train):
  booster = "gbtree", eta = "0.001", max_depth = "5", gamma = "3", subsample = "0.74", colsample_bytree = "1", objective = "multi:softprob", eval_metric = "mlogloss", num_class = "3", nthreads = "1", silent = "1"
xgb.attributes:
  best_iteration, best_msg, best_ntreelimit, best_score, niter
callbacks:
  cb.evaluation.log()
  cb.early.stop(stopping_rounds = early_stopping_rounds, maximize = maximize, 
    verbose = verbose)
# of features: 4 
niter: 3017
best_iteration : 3002 
best_ntreelimit : 3002 
best_score : 0.256246 
nfeatures : 4 
evaluation_log:
    iter val1_mlogloss val2_mlogloss
       1      1.097309      1.097386
       2      1.095949      1.096094
---                                 
    3016      0.164426

### 7. Model Development: Predict new outcomes

In [19]:
# We can predict new outcomes given the testing data set that we set aside earlier. 
# We use the predict function to predict the likelihood of each observation in test.data 
# of being each flower species.

# Predict outcomes with the test data
xgb.pred = predict(xgb.fit,test.data,reshape=T)
xgb.pred = as.data.frame(xgb.pred)
colnames(xgb.pred) = levels(species)

### 8. Evaluation: Identify the class with highest probability for each prediction

In [20]:
# Iterate over the predictions and identify the label (class) with the highest probability. 
# This allows us to evaluate the true performance of the model by comparing the actual 
# labels with the predicted labels.
# Use the predicted label with the highest probability

# Please don’t forget to re-convert your labels back to the names of the species 
# by adding 1 back to the integer values

xgb.pred$prediction = apply(xgb.pred,1,function(x) colnames(xgb.pred)[which.max(x)])
xgb.pred$label = levels(species)[test.label+1]


### 9. Evaluation: Check Accuracy of predictions

In [21]:
# Calculate the accuracy of the predictions. This compares the true labels from the test data 
# set with the predicted labels (with the highest probability), and it represents the percent 
# of flower species that were accuracy predicted using the XGBoost model. My results suggest 
# that XGBoost can consistently achieve an accuracy of at least 90%!

result = sum(xgb.pred$prediction == xgb.pred$label)/nrow(xgb.pred)
print(paste("Final prediction Accuracy is ",sprintf("%1.2f%%",100*result)))

[1] "Final prediction Accuracy is  94.87%"
