# Introduction

in this notebook, we train an XGBoost model to classify the mnist data. 

Although traditionally a neural network project, XBosst can obtain 97.78% accuracy without any hyper-parameter tuning. 

In general, early stopping can be used to improve the testing accuracy of a Gradient boosted model.

We observe adding an endogenous feature increases the accuracy of the model to 98.15%.

# Import packages

In [1]:
from sklearn import datasets
import matplotlib.pyplot as plt
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# Extract images and labels

In [2]:
digits = datasets.load_digits()
images = digits.images
targets = digits.target

# Flatten the images to make them useable for LGBM

In [3]:
print("initial image dimenshions are ", images.shape)
images = images.reshape(1797, 8*8)
print("final image dimensions are ", images.shape)

initial image dimenshions are  (1797, 8, 8)
final image dimensions are  (1797, 64)


# Train Test split

In [4]:
X_train, X_test, y_train, y_test = train_test_split(images, targets, test_size = 0.3, random_state = 42)

# Model fit

In [5]:
model = LGBMClassifier(objective='multiclass')
model.fit(X_train, y_train)
print(model.best_iteration_)

None


# Prediction on test set

accuracy is calculated as: **input formula for accuracy score**

In [6]:
y_pred = model.predict(X_test,
                      start_iteration = 0,
                    num_iteration = 80)
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy: %.2f%%" %(accuracy * 100.0))

Accuracy: 97.78%


confusion matrix is **mention what a confusion matrix is**

In [7]:
confusion = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(confusion)

Confusion Matrix:
[[53  0  0  0  0  0  0  0  0  0]
 [ 0 49  1  0  0  0  0  0  0  0]
 [ 0  0 47  0  0  0  0  0  0  0]
 [ 0  0  0 53  0  0  0  0  1  0]
 [ 0  1  0  0 59  0  0  0  0  0]
 [ 0  0  0  0  1 64  1  0  0  0]
 [ 0  0  0  0  1  0 51  0  1  0]
 [ 0  0  0  0  0  0  0 55  0  0]
 [ 0  1  0  0  0  1  0  0 41  0]
 [ 1  0  0  0  0  0  0  1  1 56]]


# Additional work

## Early stopping

In [8]:
eval_set = [(X_train, y_train), (X_test, y_test)]
model2 = LGBMClassifier(
        objective = "Multiclass",
        n_estimators = 500,
        max_depth = 10)

In [9]:
model2.fit(X_train, y_train,
          early_stopping_rounds = 10,
          eval_metric = ["logloss"],
          eval_set = eval_set)



[1]	training's multi_logloss: 1.65033	valid_1's multi_logloss: 1.66917
[2]	training's multi_logloss: 1.33124	valid_1's multi_logloss: 1.36523
[3]	training's multi_logloss: 1.1057	valid_1's multi_logloss: 1.15501
[4]	training's multi_logloss: 0.930502	valid_1's multi_logloss: 0.999874
[5]	training's multi_logloss: 0.790853	valid_1's multi_logloss: 0.87438
[6]	training's multi_logloss: 0.674841	valid_1's multi_logloss: 0.767337
[7]	training's multi_logloss: 0.578472	valid_1's multi_logloss: 0.681782
[8]	training's multi_logloss: 0.49821	valid_1's multi_logloss: 0.605986
[9]	training's multi_logloss: 0.429049	valid_1's multi_logloss: 0.540851
[10]	training's multi_logloss: 0.370026	valid_1's multi_logloss: 0.484845
[11]	training's multi_logloss: 0.319627	valid_1's multi_logloss: 0.438681
[12]	training's multi_logloss: 0.276665	valid_1's multi_logloss: 0.400871
[13]	training's multi_logloss: 0.239497	valid_1's multi_logloss: 0.36404
[14]	training's multi_logloss: 0.207179	valid_1's multi_l

In [10]:
print(model2.best_iteration_)

64


In [11]:
y_pred = model2.predict(X_test,
                       start_iteration = 0,
                       num_iteration = model2.best_iteration_)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" %(accuracy * 100))

Accuracy: 97.78%


## Adding a Endogenous feature using horizontal Sobel filter

In [12]:
from sklearn.feature_extraction import image
from skimage import filters
import numpy as np

In [13]:
digits = datasets.load_digits()
images = digits.images
targets = digits.target

In [14]:
patches_sobel_h = [filters.sobel_h(img).reshape(8*8) for img in images]

In [15]:
images = images.reshape(1797, 8*8)

In [16]:
print(images.shape)
images = np.array([np.concatenate([img, patches_sobel_h[idx]]) for idx, img in enumerate(images)])
print(images.shape)

(1797, 64)
(1797, 128)


In [17]:
X_train, X_test, y_train, y_test = train_test_split(images, targets, test_size = 0.3, random_state = 42)
eval_set = [(X_train, y_train), (X_test, y_test)]

In [18]:
model3 = LGBMClassifier(
        objective = "multiclass",
        n_estimators = 500,
        max_depth = 10)

In [19]:
model3.fit(X_train, y_train,
          early_stopping_rounds = 10,
          eval_metric = ['logloss'],
          eval_set = eval_set)



[1]	training's multi_logloss: 1.62017	valid_1's multi_logloss: 1.66044
[2]	training's multi_logloss: 1.28108	valid_1's multi_logloss: 1.33729
[3]	training's multi_logloss: 1.05234	valid_1's multi_logloss: 1.12397
[4]	training's multi_logloss: 0.878601	valid_1's multi_logloss: 0.960065
[5]	training's multi_logloss: 0.741819	valid_1's multi_logloss: 0.833048
[6]	training's multi_logloss: 0.629578	valid_1's multi_logloss: 0.727562
[7]	training's multi_logloss: 0.535723	valid_1's multi_logloss: 0.639983
[8]	training's multi_logloss: 0.456998	valid_1's multi_logloss: 0.565885
[9]	training's multi_logloss: 0.3899	valid_1's multi_logloss: 0.503595
[10]	training's multi_logloss: 0.33345	valid_1's multi_logloss: 0.449359
[11]	training's multi_logloss: 0.285108	valid_1's multi_logloss: 0.404238
[12]	training's multi_logloss: 0.244205	valid_1's multi_logloss: 0.365063
[13]	training's multi_logloss: 0.209513	valid_1's multi_logloss: 0.33176
[14]	training's multi_logloss: 0.17942	valid_1's multi_lo

In [20]:
print(model3.best_iteration_)

78


In [21]:
y_pred = model3.predict(X_test,
                       start_iteration = 0,
                       num_iteration = model.best_iteration_)

In [22]:
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy: %.2f%%" % (accuracy * 100))

Accuracy: 98.15%


In [23]:
confusion = confusion_matrix(y_test, y_pred)
print("Confusion Matrix")
print(confusion)

Confusion Matrix
[[53  0  0  0  0  0  0  0  0  0]
 [ 0 49  1  0  0  0  0  0  0  0]
 [ 0  0 47  0  0  0  0  0  0  0]
 [ 0  0  2 51  0  0  0  0  1  0]
 [ 0  1  0  0 59  0  0  0  0  0]
 [ 0  0  0  1  1 64  0  0  0  0]
 [ 0  0  0  0  1  0 52  0  0  0]
 [ 0  0  0  0  0  0  0 55  0  0]
 [ 0  0  0  0  0  0  0  0 43  0]
 [ 0  0  0  1  0  0  0  0  1 57]]


# Conclusion

We observe how we can imporve the accuracy of our model by constructing endogenous features from with the available features. In this example we are able to improve the accuracy by 0.3%. 

# References:

https://towardsdatascience.com/hyperparameter-tuning-to-reduce-overfitting-lightgbm-5eb81a0b464e

Practical Gradient Boosting: A deep dive into Gradient Boosting in Python, Guillaume Saupin, 2022