Comparing a Stacking ensemble classifier with a random forest classifier

In [2]:
# initial imports
from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier
# global variables
seed = 42

In [3]:
#fetching the MNIST data
X_mnist, y_mnist = fetch_openml('mnist_784', return_X_y = True, as_frame = False)


In [4]:
# splitting the data to training (6/7) and test (1/7)
X_train, y_train = X_mnist[:60_000]/255., y_mnist[:60_000]
X_test, y_test = X_mnist[60_000:]/255., y_mnist[60_000:]

Using principal component analysis (PCA) to reduce the dimensions of the system and preserve 90% of the training set’s variance.

In [5]:
pca = PCA(0.9)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)


Training a Decision Tree with maximum depth equal to 10, 

a Random Forest with 50 estimators, 

an AdA-Boost with 50 estimators, 

a LinearSVC with maximum iterations equal to 500, 

and a Logistic Regression with maximum iterations equal to 500 classifier on the training set.

Calculating the score of each one of the estimators on the test set.

In [6]:
# Decision tree with maximum depth 10
dt = DecisionTreeClassifier(max_depth = 10, random_state = seed)

# Random Forest with 50 estimators
rf = RandomForestClassifier(n_estimators = 50, random_state = seed)

# AdA-Boost with 50 estimators
ada = AdaBoostClassifier(n_estimators = 50, random_state = seed)

# LinearSVC with maximum iterations equal to 500
svc = LinearSVC(max_iter = 500, random_state = seed)

# Logistic Regression with maximum iterations equal to 500
logi = LogisticRegression(max_iter = 500, random_state = seed)

In [11]:
# making a list with all the classifiers and their names
classifier_list = [("DecisionTreeClassifier", dt),
                   ("RandomForestClassifier",rf),
                   ("AdaBoostClassifier",ada),
                   ("LinearSVC",svc),
                   ("LogisticRegression",logi),]

In [12]:
# creating a list and appending the individual scores
scores = {}
for classifier in classifier_list:
    name, classifier = classifier
    print("Training the", classifier)
    classifier.fit(X_train, y_train)
    score = classifier.score(X_test, y_test)
    scores[name] = score


Training the DecisionTreeClassifier(max_depth=10, random_state=42)
Training the RandomForestClassifier(n_estimators=50, random_state=42)
Training the AdaBoostClassifier(random_state=42)
Training the LinearSVC(max_iter=500, random_state=42)




Training the LogisticRegression(max_iter=500, random_state=42)


In [13]:
# printing the scores
for clf_name, clf_score in scores.items():
    print (f"{clf_name}: {clf_score:.4f}")


DecisionTreeClassifier: 0.7970
RandomForestClassifier: 0.9468
AdaBoostClassifier: 0.7152
LinearSVC: 0.9114
LogisticRegression: 0.9195


Combining the previous classifiers into a Stacking Ensemble classifier with 3-fold cross-validation and a Random Forest Classifier as the final
classifier.

In [18]:
# stacking the classifiers into a stacking ensemble classifier
stack = StackingClassifier(
    estimators = classifier_list,
    final_estimator = RandomForestClassifier(random_state = seed),
    cv = 3)
stack.fit(X_train, y_train)



StackingClassifier(cv=3,
                   estimators=[('DecisionTreeClassifier',
                                DecisionTreeClassifier(max_depth=10,
                                                       random_state=42)),
                               ('RandomForestClassifier',
                                RandomForestClassifier(n_estimators=50,
                                                       random_state=42)),
                               ('AdaBoostClassifier',
                                AdaBoostClassifier(random_state=42)),
                               ('LinearSVC',
                                LinearSVC(max_iter=500, random_state=42)),
                               ('LogisticRegression',
                                LogisticRegression(max_iter=500,
                                                   random_state=42))],
                   final_estimator=RandomForestClassifier(random_state=42))

In [19]:
# stacking score
stack_score = stack.score(X_test, y_test)
print (f"Stacking score: {stack_score}")


Stacking score: 0.9569


The stacking classifier performs a bit better than most classifiers with a score of 0.9569 just above the Random Forest Classifier with a score of
0.946