# PA7: Decision Trees
- Programmer: Lydia Lonzarich
- Class: CPSC 322-01, Fall 2025
- Programming Assignment #7
- Date of current version: 11/26/2025
- Description: this notebook implements a decision tree classifier using hte TDIDT algorithm, selecting attributes to split on using entropy, and examines the effect of using different feature subsets for classification. 

In [38]:
# some useful mysklearn package import statements and reloads
import importlib

import numpy as np

import mysklearn.myutils
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils
from mysklearn.myutils import cross_val_predict

import mysklearn.mypytable
importlib.reload(mysklearn.mypytable)
from mysklearn.mypytable import MyPyTable 

import mysklearn.myclassifiers
importlib.reload(mysklearn.myclassifiers)
from mysklearn.myclassifiers import MyDecisionTreeClassifier

import mysklearn.myevaluation
importlib.reload(mysklearn.myevaluation)
# import mysklearn.myevaluation as myevaluation
from mysklearn.myevaluation import confusion_matrix

from tabulate import tabulate

# Load the Dataset

### Notes on the Dataset
- Includes 500 instances
- Has 9 (categorical) predictive features, each with at most 3 values that it can take on
- The target class is 'label' and indicates whether a mushroom is EDIBLE or POISONOUS
- dataset features: 
    - cap-color: gray, brown, other
    - odor: foul, none, other
    - stalk-surface-above-ring: smooth, silky, other
    - stalk-surface-below-ring: smooth, silky, other
    - stalk-color-above-ring: pink, white, other
    - stalk-color-below-ring: pink, white, other
    - ring-type: pendant, evanescent, other
    - population: several, other
    - habitat: wood, grass, other


In [39]:
filename = "input_data/mushroom_reduced.csv"
pytable = MyPyTable()
pytable.load_from_file(filename)

# convert all integer values to floats.
pytable.convert_to_numeric()

# convert the dataset to an array.
data = np.array(pytable.data, dtype=object)

# Step 1: Using only the Odor Feature
- In this step, I create a decision tree classifier to predict the label of a mushroom, using a subset of the entire mushroom dataset built from only the odor feature. I tested my classifier using stratified k-fold cross validation, with k=10.
- The purpose of this step is to give a baseline set of results. It will demonstrate how accurately the model can predict whether a mushroom is poisonous using only 1 feature. 

In [40]:
# "get" X (data) and y (corresponding labels) out of the pytable.
# find indices of the 'odor' column in the table.
odor_indices = pytable.column_names.index("odor")
label_indices = pytable.column_names.index("label")

# separate data into X (samples) and y (corresponding labels).
X = [[x] for x in data[:, odor_indices]]
y = list(data[:, label_indices])

# compute the avg acc and error rate, avg precision, avg recall, and avg F1 over each train/test split of the data.
# nb_acc, nb_err_rate, nb_precision, nb_recall, nb_f1, nb_y_trues, nb_y_preds = cross_val_predict(X, y, 10, MyNaiveBayesClassifier, True)
tree_acc, tree_err_rate, tree_precision, tree_recall, tree_f1, tree_y_trues, tree_y_preds = cross_val_predict(X, y, 10, MyDecisionTreeClassifier, True)

print("Naive Bayes Classifier Results:")
print("accuracy = ", tree_acc)
print("error rate = ", tree_err_rate)
print("precision = ", tree_precision)
print("recall: ", tree_recall)
print("F1-score: ", tree_f1)

Naive Bayes Classifier Results:
accuracy =  0.845278450363196
error rate =  0.15472154963680387
precision =  0.7891374978068526
recall:  0.9433333333333334
F1-score:  0.8580008749420515


# Step 2
- In this step, I create a decision tree classifier to predict the label of a mushroom using a subset of the entire dataset built from 2-5 features. I repeat this process 3-4 times to compare classification performance across different subsets of data. 

#### Subset 1
(cap-color + odor + stalk-color-above-ring)

In [41]:
# "get" X (data) and y (corresponding labels) out of the pytable.
# find indices of relevant columns in the table.
cap_color_indices = pytable.column_names.index("cap-color")
odor_indices = pytable.column_names.index("odor")
stalk_color_above_ring_indices = pytable.column_names.index("stalk-color-above-ring")
label_indices = pytable.column_names.index("label")

# separate data into X (samples) and y (corresponding labels).
X = np.column_stack((data[:, cap_color_indices], data[:, odor_indices], data[:, stalk_color_above_ring_indices]))
X = X.tolist()
y = list(data[:, label_indices])

# compute the avg acc and error rate, avg precision, avg recall, and avg F1 over each train/test split of the data.
# nb_acc, nb_err_rate, nb_precision, nb_recall, nb_f1, nb_y_trues, nb_y_preds = cross_val_predict(X, y, 10, MyNaiveBayesClassifier, True)
tree_acc, tree_err_rate, tree_precision, tree_recall, tree_f1, tree_y_trues, tree_y_preds = cross_val_predict(X, y, 10, MyDecisionTreeClassifier, True)

print("=====================================================")
print("SUBSET 1 RESULTS...")
print("(cap-color + odor + stalk-color-above-ring)")
print("Decision Tree Classifier Results:")
print("accuracy = ", tree_acc)
print("error rate = ", tree_err_rate)
print("precision = ", tree_precision)
print("recall: ", tree_recall)
print("F1-score: ", tree_f1)
print("=====================================================")

print("\n")

print("=====================================================")
print("SUBSET 1 CONFUSION MATRIX...")
print("=====================================================")

labels = ["edible", "poisonous"]

headers = ["class", "edible", "poisonous", "Total", "Recognition (%)"]

print("Decision Tree Classifier (Stratified 10-Fold Cross Validation Results):")
tree_matrix = confusion_matrix(tree_y_trues, tree_y_preds, labels) # ==> a list of lists.
tree_matrix = np.array(tree_matrix)
totals = tree_matrix.sum(axis=1) # get the totals for each class in the table.

# initialize a list to store the recognition % for each row. 
tree_recognition = [] 

# iterate through each row in the nb_matrix to calculate its recognition %.
# recognition: - another measure of how well the model predicted the true label. 
#              - we look at how close the diagonal values in the cm (row i, column i) is to the the total for a row i. 
for row in range(len(tree_matrix)):
    # if there are instances the current row, calcuate its recognition %: diagonal / row_total * 100 
    if totals[row] > 0:
        rec = tree_matrix[row, row] / totals[row] * 100
    else:
        rec = 0

    tree_recognition.append(rec)

# add the total counts and recognition (%) of each row to cm matrix. 
completed_tree_cm = []
for i, label in enumerate(labels):
    # for each row, append the data as: mpg ranking label | the actual data for each ranking | total count for the row | recognition percept for the row
    completed_tree_cm.append([label] + list(tree_matrix[i]) + [totals[i], round(tree_recognition[i], 1)])

# create the formatted cm. 
tree_cm_table = tabulate(completed_tree_cm, headers=headers, tablefmt="grid")
print(tree_cm_table)

SUBSET 1 RESULTS...
(cap-color + odor + stalk-color-above-ring)
Decision Tree Classifier Results:
accuracy =  0.8357661708751298
error rate =  0.1642338291248703
precision =  0.8682907772304324
recall:  0.8016666666666667
F1-score:  0.8226902111066531


SUBSET 1 CONFUSION MATRIX...
Decision Tree Classifier (Stratified 10-Fold Cross Validation Results):
+-----------+----------+-------------+---------+-------------------+
| class     |   edible |   poisonous |   Total |   Recognition (%) |
| edible    |      221 |          33 |     254 |              87   |
+-----------+----------+-------------+---------+-------------------+
| poisonous |       49 |         197 |     246 |              80.1 |
+-----------+----------+-------------+---------+-------------------+


#### Subset 2
(cap-color + habitat + population + stalk-color-below-ring)

- note: I expect this subset to give poorer performance metrics than subset 1 because it does not include the odor feature, which was found to be a strong feature for mushroom classification in this dataset.

In [42]:
# "get" X (data) and y (corresponding labels) out of the pytable.
# find indices of relevant columns in the table.
cap_color_indices = pytable.column_names.index("cap-color")
habitat_indices = pytable.column_names.index("habitat")
population_indices = pytable.column_names.index("population")
stalk_color_below_ring_indices = pytable.column_names.index("stalk-color-below-ring")
label_indices = pytable.column_names.index("label")

# separate data into X (samples) and y (corresponding labels).
X = np.column_stack((data[:, cap_color_indices], data[:, habitat_indices], data[:, population_indices], data[:, stalk_color_below_ring_indices]))
X = X.tolist()
y = list(data[:, label_indices])

# compute the avg acc and error rate, avg precision, avg recall, and avg F1 over each train/test split of the data.
# nb_acc, nb_err_rate, nb_precision, nb_recall, nb_f1, nb_y_trues, nb_y_preds = cross_val_predict(X, y, 10, MyNaiveBayesClassifier, True)
tree_acc, tree_err_rate, tree_precision, tree_recall, tree_f1, tree_y_trues, tree_y_preds = cross_val_predict(X, y, 10, MyDecisionTreeClassifier, True)

print("=====================================================")
print("SUBSET 2 RESULTS...")
print("(cap-color + habitat + population + stalk-color-below-ring)")
print("Decision Tree Classifier Results:")
print("accuracy = ", tree_acc)
print("error rate = ", tree_err_rate)
print("precision = ", tree_precision)
print("recall: ", tree_recall)
print("F1-score: ", tree_f1)
print("=====================================================")

print("\n")

print("=====================================================")
print("SUBSET 2 CONFUSION MATRIX...")
print("=====================================================")

labels = ["edible", "poisonous"]

headers = ["class", "edible", "poisonous", "Total", "Recognition (%)"]

print("Decision Tree Classifier (Stratified 10-Fold Cross Validation Results):")
tree_matrix = confusion_matrix(tree_y_trues, tree_y_preds, labels) # ==> a list of lists.
tree_matrix = np.array(tree_matrix)
totals = tree_matrix.sum(axis=1) # get the totals for each class in the table.

# initialize a list to store the recognition % for each row. 
tree_recognition = [] 

# iterate through each row in the nb_matrix to calculate its recognition %.
# recognition: - another measure of how well the model predicted the true label. 
#              - we look at how close the diagonal values in the cm (row i, column i) is to the the total for a row i. 
for row in range(len(tree_matrix)):
    # if there are instances the current row, calcuate its recognition %: diagonal / row_total * 100 
    if totals[row] > 0:
        rec = tree_matrix[row, row] / totals[row] * 100
    else:
        rec = 0

    tree_recognition.append(rec)

# add the total counts and recognition (%) of each row to cm matrix. 
completed_tree_cm = []
for i, label in enumerate(labels):
    # for each row, append the data as: mpg ranking label | the actual data for each ranking | total count for the row | recognition percept for the row
    completed_tree_cm.append([label] + list(tree_matrix[i]) + [totals[i], round(tree_recognition[i], 1)])

# create the formatted cm. 
tree_cm_table = tabulate(completed_tree_cm, headers=headers, tablefmt="grid")
print(tree_cm_table)

SUBSET 2 RESULTS...
(cap-color + habitat + population + stalk-color-below-ring)
Decision Tree Classifier Results:
accuracy =  0.7551712210307853
error rate =  0.24482877896921482
precision =  0.7796986711454942
recall:  0.7108333333333333
F1-score:  0.7400429505451034


SUBSET 2 CONFUSION MATRIX...
Decision Tree Classifier (Stratified 10-Fold Cross Validation Results):
+-----------+----------+-------------+---------+-------------------+
| class     |   edible |   poisonous |   Total |   Recognition (%) |
| edible    |      203 |          51 |     254 |              79.9 |
+-----------+----------+-------------+---------+-------------------+
| poisonous |       71 |         175 |     246 |              71.1 |
+-----------+----------+-------------+---------+-------------------+


#### Subset 3
(odor + habitat + population + stalk-color-below-ring)

- note: I expect this subset to give better performance metrics than subset 2 because it includes the same features, but I have included the odor feature in the subset.

In [43]:
# "get" X (data) and y (corresponding labels) out of the pytable.
# find indices of relevant columns in the table.
odor_indices = pytable.column_names.index("odor")
habitat_indices = pytable.column_names.index("habitat")
population_indices = pytable.column_names.index("population")
stalk_color_below_ring_indices = pytable.column_names.index("stalk-color-below-ring")
label_indices = pytable.column_names.index("label")

# separate data into X (samples) and y (corresponding labels).
X = np.column_stack((data[:, odor_indices], data[:, habitat_indices], data[:, population_indices], data[:, stalk_color_below_ring_indices]))
X = X.tolist()
y = list(data[:, label_indices])

# compute the avg acc and error rate, avg precision, avg recall, and avg F1 over each train/test split of the data.
# nb_acc, nb_err_rate, nb_precision, nb_recall, nb_f1, nb_y_trues, nb_y_preds = cross_val_predict(X, y, 10, MyNaiveBayesClassifier, True)
tree_acc, tree_err_rate, tree_precision, tree_recall, tree_f1, tree_y_trues, tree_y_preds = cross_val_predict(X, y, 10, MyDecisionTreeClassifier, True)

print("=====================================================")
print("SUBSET 3 RESULTS...")
print("(odor + habitat + population + stalk-color-below-ring)")
print("Decision Tree Classifier Results:")
print("accuracy = ", tree_acc)
print("error rate = ", tree_err_rate)
print("precision = ", tree_precision)
print("recall: ", tree_recall)
print("F1-score: ", tree_f1)
print("=====================================================")

print("\n")

print("=====================================================")
print("SUBSET 3 CONFUSION MATRIX...")
print("=====================================================")

labels = ["edible", "poisonous"]

headers = ["class", "edible", "poisonous", "Total", "Recognition (%)"]

print("Decision Tree Classifier (Stratified 10-Fold Cross Validation Results):")
tree_matrix = confusion_matrix(tree_y_trues, tree_y_preds, labels) # ==> a list of lists.
tree_matrix = np.array(tree_matrix)
totals = tree_matrix.sum(axis=1) # get the totals for each class in the table.

# initialize a list to store the recognition % for each row. 
tree_recognition = [] 

# iterate through each row in the nb_matrix to calculate its recognition %.
# recognition: - another measure of how well the model predicted the true label. 
#              - we look at how close the diagonal values in the cm (row i, column i) is to the the total for a row i. 
for row in range(len(tree_matrix)):
    # if there are instances the current row, calcuate its recognition %: diagonal / row_total * 100 
    if totals[row] > 0:
        rec = tree_matrix[row, row] / totals[row] * 100
    else:
        rec = 0

    tree_recognition.append(rec)

# add the total counts and recognition (%) of each row to cm matrix. 
completed_tree_cm = []
for i, label in enumerate(labels):
    # for each row, append the data as: mpg ranking label | the actual data for each ranking | total count for the row | recognition percept for the row
    completed_tree_cm.append([label] + list(tree_matrix[i]) + [totals[i], round(tree_recognition[i], 1)])

# create the formatted cm. 
tree_cm_table = tabulate(completed_tree_cm, headers=headers, tablefmt="grid")
print(tree_cm_table)

SUBSET 3 RESULTS...
(odor + habitat + population + stalk-color-below-ring)
Decision Tree Classifier Results:
accuracy =  0.9272224143894846
error rate =  0.07277758561051538
precision =  0.9394605475040259
recall:  0.9133333333333333
F1-score:  0.924702923227678


SUBSET 3 CONFUSION MATRIX...
Decision Tree Classifier (Stratified 10-Fold Cross Validation Results):
+-----------+----------+-------------+---------+-------------------+
| class     |   edible |   poisonous |   Total |   Recognition (%) |
| edible    |      239 |          15 |     254 |              94.1 |
+-----------+----------+-------------+---------+-------------------+
| poisonous |       21 |         225 |     246 |              91.5 |
+-----------+----------+-------------+---------+-------------------+


# Step 3: Print Decision Rules
- In this step, I print the decision rules inferred from my decision tree classifiers when trained over the entire dataset using the four columns (aka, features) chosen for subset 3.

In [44]:
# "get" X (data) and y (corresponding labels) out of the pytable.
# find indices of relevant columns in the table.
odor_indices = pytable.column_names.index("odor")
habitat_indices = pytable.column_names.index("habitat")
population_indices = pytable.column_names.index("population")
stalk_color_below_ring_indices = pytable.column_names.index("stalk-color-below-ring")
label_indices = pytable.column_names.index("label")

# separate data into X (samples) and y (corresponding labels).
X = np.column_stack((data[:, odor_indices], data[:, habitat_indices], data[:, population_indices], data[:, stalk_color_below_ring_indices]))
X = X.tolist()
y = list(data[:, label_indices])

# train tree on full datset
tree = MyDecisionTreeClassifier()
tree.fit(X, y)

# best subset: 3
attribute_names = ["odor", "habitat", "population", "stalk-color-below-ring"] 
class_name = "label"

tree.print_decision_rules(attribute_names, class_name)

IF att0 == foul AND att3 == other AND att1 == grass THEN label = poisonous
IF att0 == foul AND att3 == other AND att1 == other AND att2 == other THEN label = poisonous
IF att0 == foul AND att3 == other AND att1 == other AND att2 == several THEN label = poisonous
IF att0 == foul AND att3 == other AND att1 == wood THEN label = poisonous
IF att0 == foul AND att3 == pink AND att1 == grass THEN label = poisonous
IF att0 == foul AND att3 == pink AND att1 == other AND att2 == other THEN label = poisonous
IF att0 == foul AND att3 == pink AND att1 == other AND att2 == several THEN label = poisonous
IF att0 == foul AND att3 == pink AND att1 == wood AND att2 == other THEN label = poisonous
IF att0 == foul AND att3 == pink AND att1 == wood AND att2 == several THEN label = poisonous
IF att0 == foul AND att3 == white AND att1 == grass AND att2 == other THEN label = poisonous
IF att0 == foul AND att3 == white AND att1 == grass AND att2 == several THEN label = poisonous
IF att0 == foul AND att3 == whi

#### Reflection: based on these decision rules, determine ways my trees can be pruned.
- Note: The following rules seem to be all the rules that allow us to reduce the number of decision rules we have in the raw output. 
- Note: The goal of pruning is to simplify the decision rules, collapsing unecessary or repetive branches (branches that lead to the same classification).

- IF att0 == foul THEN label = poisonous
    - ==> This collapses rules 1-14 together. In other words, we can replace rules 1-14 with this rule.


- IF att0 == none AND att3 == pink THEN label = edible
    - This collapses rules 17 and 18 together. In other words, we can replace rules 17 and 18 with this rule.

- IF att0 == none AND att3 = white AND att2 == other THEN label = edible
    - This collapses rules 19, 20, 21 together. In other words, we can replace rules 19, 20, and 21 with this rule.