# PA5: Estimating Classifier Accuracy

- Programmer: Lydia Lonzarich
- Class: CPSC 322-01, Fall 2025
- Programming Assignment #5
- Date of current version: 11/6/2025
- Description: this notebook evaluates the classification accuracy of kNN and dummy classifiers using different methods of dataset splitting. 

In [35]:
# import sys
# print(sys.executable)

# import sys
# !{sys.executable} -m pip install scikit-learn

In [7]:
# some useful mysklearn package import statements and reloads
import importlib

import numpy as np

import mysklearn.myutils
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils
from mysklearn.myutils import random_subsample, cross_val_predict, bootstrap_method

import mysklearn.mypytable
importlib.reload(mysklearn.mypytable)
from mysklearn.mypytable import MyPyTable 

import mysklearn.myclassifiers
importlib.reload(mysklearn.myclassifiers)
from mysklearn.myclassifiers import MyKNeighborsClassifier, MyDummyClassifier

import mysklearn.myevaluation
importlib.reload(mysklearn.myevaluation)
import mysklearn.myevaluation as myevaluation
from mysklearn.myevaluation import confusion_matrix

from tabulate import tabulate

# Step 0: Load the dataset

In [8]:
filename = "auto-data-removed-NA.txt"
pytable = MyPyTable()
pytable.load_from_file(filename)

# convert all integer values to floats.
pytable.convert_to_numeric()

# convert the dataset to an array.
data = np.array(pytable.data, dtype=object)

# Step 1: Train / Test Sets: Random Sub-sampling
- In this step, I evaluate the performance of both the kNN and dummy classifiers on predicting DOE mpg ratings using random sub-sampling sampling strategy. I used the 'cylinders', 'weight', and 'acceleration' attributes and a randomized 2:1 train-test split ratio of the dataset, repeated k=10 times, to generate these predictions.
- Note: random subsampling == repeated random train/test splits, and is used to get a robust estimate of performance
- I assume this method for evaluating classifier accuracy will be least effective because we don't guarantee that all instances will appear in the train and test set at least once, nor that each instance will appear in the train/test set equally as often.

In [9]:
print("=============================")
print("STEP 1: Predictive Accuracy")
print("=============================")
print("Random Subsample (k=10, 2:1 Train/Test)")

# ** "get" X (data) and y (corresponding labels) out of the pytable.
# find indices of the 'cylinders', 'weight', 'acceleration', and 'mpg' column in the table.
cylinder_indices = pytable.column_names.index("cylinders")
weight_indices = pytable.column_names.index("weight")
acceleration_indices = pytable.column_names.index("acceleration")
mpg_indices = pytable.column_names.index("mpg")

# separate data into X (samples) and y (corresponding labels).
X = np.column_stack((data[:, cylinder_indices], data[:, weight_indices], data[:, acceleration_indices]))
y = data[:, mpg_indices]

# normalize features in X_train and X_test using min-max normalization.
min = X.min(axis=0)
max = X.max(axis=0)
X = (X - min) / (max - min)

# compute the avg acc and error rate for each train/test split of the data.
knn_acc, knn_err_rate = random_subsample(X, y, 10, lambda: MyKNeighborsClassifier(n_neighbors=5)) # lambda creates a fresh instance with parameters when I call classifier_class() in the random_subsample function in utils.py. 
dummy_acc, dummy_err_rate = random_subsample(X, y, 10, MyDummyClassifier)

print("k Nearest Neighbors Classifier: accuracy = ", knn_acc, ", error rate = ", knn_err_rate)
print("Dummy Classifier: accuracy = ", dummy_acc, ", error rate = ", dummy_err_rate)



STEP 1: Predictive Accuracy
Random Subsample (k=10, 2:1 Train/Test)
k Nearest Neighbors Classifier: accuracy =  0.37209302325581395 , error rate =  0.627906976744186
Dummy Classifier: accuracy =  0.09069767441860466 , error rate =  0.9093023255813953


# Step 2: Train / Test Sets: Cross Validation
- In this step, I compute the predictive acccuracy for both the kNN and dummy classifiers using k-fold cross validation with k=10.
- Note: I assume cross validation is a very reliable method estimating classifier accuracy becuase it splits the data and trains the model only on unseen instances.

Stratified cross validation commentary: 
- The stratified cross validation did not significantly improve the kNN classifier performance. Though it did slightly improve the dummy classifier accuracy.

In [12]:
print("=============================")
print("STEP 2: Predictive Accuracy")
print("=============================")
print("10-Fold Cross Validation")

# ** "get" X (data) and y (corresponding labels) out of the pytable.
# find indices of the 'cylinders', 'weight', 'acceleration', and 'mpg' column in the table.
cylinder_indices = pytable.column_names.index("cylinders")
weight_indices = pytable.column_names.index("weight")
acceleration_indices = pytable.column_names.index("acceleration")
mpg_indices = pytable.column_names.index("mpg")

# separate data into X (samples) and y (corresponding labels).
X = np.column_stack((data[:, cylinder_indices], data[:, weight_indices], data[:, acceleration_indices]))
y = data[:, mpg_indices]

# normalize features in X_train and X_test using min-max normalization.
min = X.min(axis=0)
max = X.max(axis=0)
X = (X - min) / (max - min)

# compute the avg acc and error rate for each train/test split of the data.
knn_acc, knn_err_rate, knn_y_trues, knn_y_preds = cross_val_predict(X, y, 10, lambda: MyKNeighborsClassifier(n_neighbors=5), False)
dummy_acc, dummy_err_rate, dummy_y_trues, dummy_y_preds = cross_val_predict(X, y, 10, MyDummyClassifier, False)

print("k Nearest Neighbors Classifier: accuracy = ", knn_acc, ", error rate = ", knn_err_rate)
print("Dummy Classifier: accuracy = ", dummy_acc, ", error rate = ", dummy_err_rate)


# compute the avg acc and error rate for each train/test split of the data using stratified cross validation.
knn_acc2, knn_err_rate2, knn_y_trues2, knn_y_preds2 = cross_val_predict(X, y, 10, lambda: MyKNeighborsClassifier(n_neighbors=5), True)
dummy_acc2, dummy_err_rate2, dummy_y_trues2, dummy_y_preds2 = cross_val_predict(X, y, 10, MyDummyClassifier, True)

print("k Nearest Neighbors Classifier with stratified cross validation: accuracy = ", knn_acc2, ", error rate = ", knn_err_rate2)
print("Dummy Classifier with stratified cross validation: accuracy = ", dummy_acc2, ", error rate = ", dummy_err_rate2)


STEP 2: Predictive Accuracy
10-Fold Cross Validation
k Nearest Neighbors Classifier: accuracy =  0.35 , error rate =  0.65
Dummy Classifier: accuracy =  0.05 , error rate =  0.95
k Nearest Neighbors Classifier with stratified cross validation: accuracy =  0.31109499637418414 , error rate =  0.6889050036258159
Dummy Classifier with stratified cross validation: accuracy =  0.13517041334300217 , error rate =  0.8648295866569979


# Step 3: Train / Test Sets: Bootstrap Method
- In this step, I compute the predictive accuracy and error rate for each classifier using the bootstrap method with k=10. 
- Note: I assume this method of creating train/test splits and evaluating the model on each split might give higher accuracy and lower error rate than the two previous methods because the training and testing data overlap == seen instances could be "reused" for testing. 

In [5]:
print("=============================")
print("STEP 3: Predictive Accuracy")
print("=============================")
print("k=10 Bootstrap Method")

# ** "get" X (data) and y (corresponding labels) out of the pytable.
# find indices of the 'cylinders', 'weight', 'acceleration', and 'mpg' column in the table.
cylinder_indices = pytable.column_names.index("cylinders")
weight_indices = pytable.column_names.index("weight")
acceleration_indices = pytable.column_names.index("acceleration")
mpg_indices = pytable.column_names.index("mpg")

# separate data into X (samples) and y (corresponding labels).
X = np.column_stack((data[:, cylinder_indices], data[:, weight_indices], data[:, acceleration_indices]))
y = data[:, mpg_indices]

# normalize features in X_train and X_test using min-max normalization.
min = X.min(axis=0)
max = X.max(axis=0)
X = (X - min) / (max - min)

# compute the avg acc and error rate for each train/test split of the data.
knn_acc, knn_err_rate = bootstrap_method(X, y, 10, lambda: MyKNeighborsClassifier(n_neighbors=5))
dummy_acc, dummy_err_rate = bootstrap_method(X, y, 10, MyDummyClassifier)

print("k Nearest Neighbors Classifier: accuracy = ", knn_acc, ", error rate = ", knn_err_rate)
print("Dummy Classifier: accuracy = ", dummy_acc, ", error rate = ", dummy_err_rate)


STEP 3: Predictive Accuracy
k=10 Bootstrap Method
k Nearest Neighbors Classifier: accuracy =  0.3320560354375474 , error rate =  0.6679439645624525
Dummy Classifier: accuracy =  0.10852384531400827 , error rate =  0.8914761546859917


# Step 4: Confusion Matrices
- In this step, I create a confusion matrix for each classifier based on the 10-fold stratified cross validation results.
- I used the tabulate package to display a pretty confusion matrix.

In [14]:
print("=============================")
print("STEP 4: Confusion Matrices")
print("=============================")

labels = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10"]

headers = ["MPG Ranking", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "Total", "Recognition (%)"]



print("kNN Classifier (Stratified 10-Fold Cross Validation Results):")
knn_matrix = confusion_matrix(knn_y_trues2, knn_y_preds2, labels) # ==> a list of lists.
knn_matrix = np.array(knn_matrix)
totals = knn_matrix.sum(axis=1) # get the totals for each class in the table.

# initialize a list to store the recognition % for each row. 
knn_recognition = [] 

# iterate through each row in the knn_matrix to calculate its recognition %.
# recognition: - another measure of how well the model predicted the true label. 
#              - we look at how close the diagonal values in the cm (row i, column i) is to the the total for a row i. 
for row in range(len(knn_matrix)):
    # if there are instances the current row, calcuate its recognition %: diagonal / row_total * 100 
    if totals[row] > 0:
        rec = knn_matrix[row, row] / totals[row] * 100
    else:
        rec = 0

    knn_recognition.append(rec)

# add the total counts and recognition (%) of each row to cm matrix. 
completed_knn_cm = []
for i, label in enumerate(labels):
    # for each row, append the data as: mpg ranking label | the actual data for each ranking | total count for the row | recognition percept for the row
    completed_knn_cm.append([label] + list(knn_matrix[i]) + [totals[i], round(knn_recognition[i], 1)])

# create the formatted cm. 
knn_cm_table = tabulate(completed_knn_cm, headers=headers, tablefmt="grid")
print(knn_cm_table)



print("-------------------------------------------------------------------")
print("Dummy Classifier (Stratified 10-Fold Cross Validation Results):")
dummy_matrix = confusion_matrix(dummy_y_trues, dummy_y_preds, labels)
dummy_matrix = np.array(dummy_matrix)
totals = dummy_matrix.sum(axis=1) # get the totals for each class in the table.

# initialize a list to store the recognition % for each row. 
dummy_recognition = [] 

# iterate through each row in the knn_matrix to calculate its recognition %.
for row in range(len(dummy_matrix)):
    # if there are instances the current row, calcuate its recognition %: diagonal / row_total * 100 
    if totals[row] > 0:
        rec = dummy_matrix[row, row] / totals[row] * 100
    else:
        rec = 0

    dummy_recognition.append(rec)

# add the total counts and recognition (%) of each row to cm matrix. 
completed_dummy_cm = []
for i, label in enumerate(labels):
    # for each row, append the data as: mpg ranking label | the actual data for each ranking | total count for the row | recognition percept for the row
    completed_dummy_cm.append([label] + list(dummy_matrix[i]) + [totals[i], round(dummy_recognition[i], 1)])

# create the formatted cm. 
dummy_cm_table = tabulate(completed_dummy_cm, headers=headers, tablefmt="grid")
print(dummy_cm_table)


STEP 4: Confusion Matrices
kNN Classifier (Stratified 10-Fold Cross Validation Results):
-------------------------------------------------------------------
Dummy Classifier (Stratified 10-Fold Cross Validation Results):
+---------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+---------+-------------------+
|   MPG Ranking |   1 |   2 |   3 |   4 |   5 |   6 |   7 |   8 |   9 |   10 |   Total |   Recognition (%) |
|             1 |   7 |  10 |  11 |   0 |   0 |   0 |   0 |   0 |   0 |    0 |      28 |              25   |
+---------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+---------+-------------------+
|             2 |   7 |   1 |   8 |   0 |   0 |   0 |   0 |   0 |   0 |    0 |      16 |               6.2 |
+---------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+---------+-------------------+
|             3 |  10 |  16 |   5 |   0 |   0 |   0 |   0 |   0 |   0 |    0 |      31 |              16.1 |
+---------------

# Mini Reflection
- The low accuracy of my classifiers in steps 1, 2, and 3 could be due to the fact that our prediction class (DOE MPG ratings) has 10 possible values (1-10). The model can struggle to predict this many class labels with only three features, especially given a model our models are very simple. 
- I did refer to AI in some areas of my PA. However, in all such cases, I did not copy/paste code, nor did rely on it for understanding. 