In [1]:
##############################################
# Programmer: Hannah Horn
# Class: CPSC 322, Fall 2024
# Programming Assignment #4
# Last Modified: 10/18/2024
# I did not attempt the bonus.
# 
# Description: This program implements three different classifiers (Linear, KNN, and Dummy)
# and compares the accuracy of these classifiers based on the five random testing instances. 
##############################################

In [2]:
# some useful mysklearn package import statements and reloads
import importlib
import random
from mypytable import MyPyTable

import mysklearn.myutils
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils

# uncomment once you paste your mypytable.py into mysklearn package
# import mysklearn.mypytable
# importlib.reload(mysklearn.mypytable)
# from mysklearn.mypytable import MyPyTable 

import mysklearn.myclassifiers
importlib.reload(mysklearn.myclassifiers)
from mysklearn.myclassifiers import MySimpleLinearRegressionClassifier,\
    MyKNeighborsClassifier,\
    MyDummyClassifier

# Part 2 🚗 Auto Classification 🚗

### Step 0: Train/Test Sets: Random Instances

The dataset, auto-data-remove-NA copy.txt originally had 260 instances, after dropping the five random rows to use for the test set, the dataset stored in table now has 255 rows which will make up the training set. 

In [3]:
table = myutils.load_and_prepare_train_data("auto-data-remove-NA copy.txt")

### Step 1: Train/Test Sets: Random Instances and Linear Regression

The first step is to set up the training data. You need to get the relevant data and split according to X and y values. Since the get_column function returns a 1D list and the fit method expects a 2D list we need to reshape the X_values to this format. We then fit the linear classifier. 

In [4]:
# get the data for the trainign set
X_values = table.get_column("weight")
y_values = table.get_column("mpg")

# since the get_column returns a 1D list for both values, the fit method needs to accept a 2D list
X_values_reshaped = []
for weight in X_values:
    X_values_reshaped.append([weight])

# create instance of the MySimpleLinearRegressionClassifier
linear_classifier = MySimpleLinearRegressionClassifier(discretizer = myutils.doe_rating_assign)

# fit the classifier (understand the relationship between features and target values)
linear_classifier.fit(X_values_reshaped, y_values)

The next step in this process is to set up the testing data. We first verify the five random indexes we are working with and reload the original dataset into a new MyPyTable object so we have access to the original indexes that were dropped from the training dataset. Similar to the training data, we get the weight column and convert it to a 2D list format. Next, we need to map the mpg column to the corresponding DOE ratings so we can compare the actual ratings versus the classifiers predicted ratings. Finally, after calling the predict method, we calculate and print out the accuracy!

In [6]:
X_test_indexes = myutils.select_random_instances_from_dataset(table.data, 5)
print("DEBUG: these is X_test_indexes:", X_test_indexes)

original_data = myutils.load_and_prepare_test_data("auto-data-remove-NA copy.txt")
original_weight = original_data.get_column("weight")

test_weights = []
for index in X_test_indexes:
    weight = original_weight[index]
    test_weights.append([weight]) # make sure append in 2D list format

original_mpg = original_data.get_column("mpg")

test_doe_mpg_rating = []
for index in X_test_indexes:
    mpg = original_mpg[index]
    rating = myutils.doe_rating_assign(mpg)
    test_doe_mpg_rating.append(rating)
print("These are the DOE rating for the test mpg instances: ", test_doe_mpg_rating)

model_predictions = linear_classifier.predict(test_weights)
print("These are the predicted DOE mpg ratings based on the test weights:", model_predictions)

# call calculate accuracy function
myutils.calculate_accuracy(model_predictions, test_doe_mpg_rating, X_test_indexes)

DEBUG: these is X_test_indexes: [154, 164, 38, 237, 241]
These are the DOE rating for the test mpg instances:  [3, 4, 7, 5, 4]
These are the predicted DOE mpg ratings based on the test weights: [2, 4, 7, 5, 4]
This is the number of correct predictions: 4
This is the accuracy of the classifier based on the test instances, 80.0%


### Step 2 Train/Test Sets: Random Instances and kNN

The first step for this classifier is to get the relevant columns (cylinder, weight, acceleration) that will be used to predict the DOE mpg ratings. We then have to normalize the three attribute values and combine these new values into a 2D list. We extract the target(y_value) and get the DOE mpg ratings for each instance. Finally, we fit the classifier with these values. 

In [14]:
# get columns 
cylinder_values = table.get_column("cylinders")
weight_values = table.get_column("weight")
acceleration_values = table.get_column("acceleration")

# normalize each feature
normalized_cylinder = myutils.normalize_train_attribute(cylinder_values)
normalized_weight = myutils.normalize_train_attribute(weight_values)
normalized_acceleration = myutils.normalize_train_attribute(acceleration_values)

# combine normalized attributes into a 2D list
combined_X_train = myutils.combine_normalized_attributes(normalized_cylinder, normalized_weight, normalized_acceleration)

# extract target (y_value) labels (DOE mpg ratings) that correspond to each instance
mpg_values = table.get_column("mpg")
y_train = []
for value in mpg_values:
    rating = myutils.DOE_rating_assign(value)
    y_train.append(rating)

# initialize and fit classifier
knn_classifier = MyKNeighborsClassifier()
knn_classifier.fit(combined_X_train, y_train)

The next step of this process is to set up the testing data. We first reload the original data and go through a similar process as the training data (normalization and combining into a 2D list). 

In [15]:
original_table = myutils.load_and_prepare_test_data("auto-data-remove-NA copy.txt")
X_test_indexes = myutils.select_random_instances_from_dataset(table.data, 5)
print("DEBUG: this is X_test_indexes:", X_test_indexes)

# get columns
original_cylinder = original_table.get_column("cylinders")
original_weight = original_table.get_column("weight")
original_acceleration = original_table.get_column("acceleration")

# normalize each attribute
normalized_cylinder2 = myutils.normalize_test_attribute(original_cylinder, X_test_indexes)
normalized_weight2 = myutils.normalize_test_attribute(original_weight, X_test_indexes)
normalized_acceleration2 = myutils.normalize_test_attribute(original_acceleration, X_test_indexes)

# combine normalized attributes into a 2D list
combined_X_test = myutils.combine_normalized_attributes(normalized_cylinder2, normalized_weight2, normalized_acceleration2)

DEBUG: this is X_test_indexes: [154, 164, 38, 237, 241]


The final step of this process is to get the five nearest neighbors and call the predict method of the class. 

In [16]:
knn_classifier.kneighbors(combined_X_test, n_neighbors = 5)
predicted_ratings = knn_classifier.predict(combined_X_test)
print("These are the predicted DOE mpg ratings from knn:", predicted_ratings)

# get actual DOE mpg rating for test indexes
original_mpg = original_table.get_column("mpg")
test_actual_doe_rating_mpg = []
for index in X_test_indexes:
    mpg_value = original_mpg[index]
    rating = myutils.DOE_rating_assign(mpg_value)
    test_actual_doe_rating_mpg.append(rating)
print("this is the actual doe rating for each mpg in the test index:", test_actual_doe_rating_mpg)

myutils.calculate_accuracy(predicted_ratings, test_actual_doe_rating_mpg, X_test_indexes)

These are the predicted DOE mpg ratings from knn: [3, 4, 7, 5, 4]
this is the actual doe rating for each mpg in the test index: [3, 4, 7, 5, 4]
this is the number of correct predictions: 5
This is the accuracy of the classifier based on the test instances, 100.0%


### Step 3 Train/Test Sets: Random Instances and Dummy Classification

For the Dummy Classifier, we want to predict the DOE mpg rating. We first set up the training data by calling the function that loads the file into a new MyPyTable object and drops the testing rows from the data. We then create an instance of the class and fit with the data. 

In [17]:
data = myutils.load_and_prepare_train_data("auto-data-remove-NA copy.txt")

# split the dataset into features (X) and target values (y)
X_values = table.get_column("mpg")

y_values = []
for value in X_values:
    rating = myutils.DOE_rating_assign(value)
    y_values.append(rating)

# create instance of MyDummyClassifier
dummy_classifier = MyDummyClassifier()
# fit the classifier (understand the relationship between features and target values)
dummy_classifier.fit(X_values, y_values)

We then set up the testing data by reloading the original data into a new MyPyTable object and extracting the testing indexes. We then get the mpg values and call the predict method on those values. We then get the actual rating for the test mpg values and calculate the accuracy of the classifier. 

In [18]:
data = myutils.load_and_prepare_test_data("auto-data-remove-NA copy.txt")
X_test_indexes = myutils.select_random_instances_from_dataset(table.data, 5)
print("this is X_test_indexes:", X_test_indexes)

original_mpg = original_table.get_column("mpg")

# now get the actual mpg value for the test index
test_mpg = []
for index in X_test_indexes:
    mpg = original_mpg[index]
    test_mpg.append(mpg)
print("this is after getting mpg for test index: ", test_mpg)

model_predictions = dummy_classifier.predict(test_mpg)
print("These are the predicted DOE mpg ratings based on the test mpg values:", model_predictions)

# now get actual rating for mpg value at test index
test_rating = []
for index in test_mpg:
    rating = myutils.DOE_rating_assign(index)
    test_rating.append(rating)
print("this is the doe rating for each mpg in test index:", test_rating)

myutils.calculate_accuracy(model_predictions, test_rating, X_test_indexes)

this is X_test_indexes: [154, 164, 38, 237, 241]
this is after getting mpg for test index:  [16.0, 18.0, 30.0, 20.6, 18.2]
These are the predicted DOE mpg ratings based on the test mpg values: [4, 4, 4, 4, 4]
this is the doe rating for each mpg in test index: [3, 4, 7, 5, 4]
this is the number of correct predictions: 2
This is the accuracy of the classifier based on the test instances, 40.0%


### Step 4 Classifier Comparison: Linear Regression vs kNN vs Dummy

We have now implemented the three different classifiers on the same five random testing instances from the dataset and calculated their accuracy. Here are the results from the first run (random seed = 49):
1. `Linear Regression Classifier:` This classifier successfully predicted four out of the five instances, giving it a **80%** accuracy rating. 

2. `KNN Classifier:` This classifier successfully predicted all five of the testing instances, giving it a **100%** accuracy rating. 

3. `Dummy Classifier:` This classifier predicted just one out of the five instances, giving it a **20%** accuracy rating. 

Based on the accuracy rating, while the Linear Regression Classifier performed well, the KNN Classifier could not be beat by getting a perfect 5/5 score. It doesn't come as a suprise that the Dummy Classifier performed the worse as it just predicts the most common rating. 

I then adjusted the random seed and re ran each of the classifiers to see how the results would vary. Here are the results from first five runs:
1. Random Seed = 12
*   `Linear Regression Classifier:` 80% accuracy
*   `KNN Classifier:` 80% accuracy
*   `Dummy Classifier:` 0% accuracy

2. Random Seed = 1
*   `Linear Regression Classifier:` 40% accuracy
*   `KNN Classifier:` 100% accuracy
*   `Dummy Classifier:` 40% accuracy

3. Random Seed = 33
*   `Linear Regression Classifier:` 80% accuracy
*   `KNN Classifier:` 100% accuracy
*   `Dummy Classifier:` 20% accuracy

4. Random Seed = 76
*   `Linear Regression Classifier:` 60% accuracy
*   `KNN Classifier:` 60% accuracy
*   `Dummy Classifier:` 0% accuracy

5. Random Seed = 147
*   `Linear Regression Classifier:` 80% accuracy
*   `KNN Classifier:` 100% accuracy
*   `Dummy Classifier:` 40% accuracy

As you can see, overall **KNN remained the most accurate classifier**, the dummy had the worst score, and the linear classifier stayed relatively consistent across the runs. 

I think that we could improve the reliability of our comparisons by making the testing data more representative of the dataset rather than just a completely random sample of the data. 