# Artificial Intelligence
# 464/664
# Assignment #7

## General Directions for this Assignment

00. We're using a Jupyter Notebook environment (tutorial available here: https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html),
01. Output format should be exactly as requested (it is your responsibility to make sure notebook looks as expected on Gradescope),
02. Check submission deadline on Gradescope,
03. Rename the file to Last_First_assignment_7,
04. Submit your notebook (as .ipynb, not PDF) using Gradescope, and
05. Do not submit any other files.

## Before You Submit...

1. Re-read the general instructions provided above, and
2. Hit "Kernel"->"Restart & Run All".

## Neural Networks: Architecture

For this assignment we will explore Neural Networks; in particular, we are going to explore model complexity. We will use the same dataset from Assignment #6 to classify a mushroom as either edible ('e') or poisonous ('p'). You are free to use PyTorch, TensorFlow, scikit-learn -- to name a few resources. The goal is to explore different model complexities (architectures) before declaring a winner. Either start with a simple network and make it more complex; or start with a complex model and pare it down. Either way, your submission should clearly demonstrate your exploration.


Your output for each model should look like the output of `cross_validate` from Assignment #6:

```
Fold: 0	Train Error: 15.38%	Validation Error: 0.00%
Fold: 1
...

Mean(Std. Dev.) over all folds:
-------------------------------
Train Error: 100.00%(0.00%) Test Error: 100.00%(0.00%)
```

Notice that "Test Error" has been replaced by "Validation Error." Split your dataset into train, test, and validation sets.


Start with a simple network. Train using the train set. Observe model's performance using the validation set.


Increase the complexity of your network. Train using the train set. Observe model's performance using the validation set.


Model complexity in Assignment #6 was depth limit. You can think of it here as the architecture of the network (number of layers and units per layer). Try at least three different network architectures.


We're trying to find a model complexity that generalizes well. (Recall high bias vs high variance discussion in class.)


Pick the network architecture that you deem best. Use the test set to report your winning model's performance. This is the ONLY time you use the test set.


Try at least three different models; more importantly, document your process: what the results were, how the winning model was determined, what was the winning model's performance on the test data. Clearly highlight these items to receive full credit.

In [1]:
# Import necessary libraries
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score
from typing import List
import numpy as np
import os

# Suppress TensorFlow info and warning messages
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

# Load and preprocess the mushroom dataset
def load_and_preprocess_data(filepath='agaricus-lepiota.data'):
  # Define column names
  columns = ['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
              'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',
              'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
              'stalk-surface-below-ring', 'stalk-color-above-ring',
              'stalk-color-below-ring', 'veil-type', 'veil-color',
              'ring-number', 'ring-type', 'spore-print-color',
              'population', 'habitat']
  
  # Load data
  df = pd.read_csv(filepath, names=columns)
  
  # Separate features and target
  X = df.drop('class', axis=1)
  y = df['class'].apply(lambda x: 1 if x == 'e' else 0)  # Encode target: edible=1, poisonous=0
  
  # One-hot encode features
  X_encoded = pd.get_dummies(X)
  
  return X_encoded, y

def create_folds(data: List, n: int) -> List[List[List]]:
  k, m = divmod(len(data), n)
  return list(data[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

# Split the data into train, validation, and test sets
def split_data(X, y):
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=13)
  return X_train, X_test, y_train, y_test

# Define a custom dense layer
class MyDenseLayer(tf.keras.layers.Layer):
  def __init__(self, input_dim, output_dim):
    super(MyDenseLayer, self).__init__()
    self.W = self.add_weight(shape=(input_dim, output_dim), initializer='random_normal')
    self.b = self.add_weight(shape=(output_dim,), initializer='zeros')

  def call(self, inputs):
    z = tf.matmul(inputs, self.W) + self.b
    output = tf.math.sigmoid(z)
    return output


In [2]:
# do k-fold cross validation
def cross_validate(get_model, X, y, n_folds=5):
  Xy = list(zip(X.to_numpy(), y.to_numpy()))
  folds = create_folds(Xy, n_folds)
  train_errors, validation_errors = [], []
  for i, fold in enumerate(folds):
    # Convert validation data to numpy arrays
    X_validate, y_validate = map(list, zip(*fold))  # Convert to lists first
    X_validate = np.array(X_validate)
    y_validate = np.array(y_validate)
    
    # Initialize training data
    X_train_data = []
    y_train_data = []
    
    # Collect training data from other folds
    for j, fold2 in enumerate(folds):
      if i == j: continue
      X_fold, y_fold = map(list, zip(*fold2))  # Convert to lists first
      X_train_data.extend(X_fold)
      y_train_data.extend(y_fold)
    
    # Convert training data to numpy arrays
    X_train = np.array(X_train_data)
    y_train = np.array(y_train_data)
    
    # Train and evaluate
    model = get_model()
    history = model.fit(X_train, y_train, epochs=10, validation_data=(X_validate, y_validate), verbose=0)
    train_acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']
    train_error_100 = (1 - train_acc[-1]) * 100
    validation_error_100 = (1 - val_acc[-1]) * 100
    train_errors.append(train_error_100)
    validation_errors.append(validation_error_100)
    print(f"Fold {i}: Training Error: {train_error_100:.2g}%\tValidation Error: {validation_error_100:.2g}%")
  print(f"Mean(Std. Dev.) over all folds:")
  print(f"-------------------------------")
  print(f"Train Error: {np.mean(train_errors):.2f}%({np.std(train_errors):.2f}%)\tValidation Error: {np.mean(validation_errors):.2g}%({np.std(validation_errors):.2g}%)")


In [3]:
X, y = load_and_preprocess_data()
X_train, X_test, y_train, y_test = split_data(X, y)

# Convert data to float32
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

# Declare Models

In [4]:
def small_model():
  model = tf.keras.Sequential([
    MyDenseLayer(input_dim=X_train.shape[1], output_dim=4),
    tf.keras.layers.Dense(1, activation='sigmoid')
  ])
  model.compile(optimizer='sgd', loss='binary_crossentropy', metrics=['accuracy'])
  return model

def medium_model():
  model = tf.keras.Sequential([
    MyDenseLayer(input_dim=X_train.shape[1], output_dim=16),
    tf.keras.layers.Dense(1, activation='sigmoid')
  ])
  model.compile(optimizer='sgd', loss='binary_crossentropy', metrics=['accuracy'])
  return model

def large_model():
  model = tf.keras.Sequential([
    MyDenseLayer(input_dim=X_train.shape[1], output_dim=32),
    MyDenseLayer(input_dim=32, output_dim=32),
    tf.keras.layers.Dense(1, activation='sigmoid')
  ])
  model.compile(optimizer='sgd', loss='binary_crossentropy', metrics=['accuracy'])
  return model


# Train Small Model

In [5]:
cross_validate(small_model, X_train, y_train, n_folds=5)

Fold 0: Training Error: 10%	Validation Error: 12%
Fold 1: Training Error: 11%	Validation Error: 10%
Fold 2: Training Error: 9.5%	Validation Error: 7.8%
Fold 3: Training Error: 10%	Validation Error: 9.4%
Fold 4: Training Error: 11%	Validation Error: 12%
Mean(Std. Dev.) over all folds:
-------------------------------
Train Error: 10.50%(0.57%)	Validation Error: 10%(1.5%)


# Train Medium Model

In [6]:
cross_validate(medium_model, X_train, y_train, n_folds=5)

Fold 0: Training Error: 9%	Validation Error: 9.9%
Fold 1: Training Error: 9.3%	Validation Error: 8.5%
Fold 2: Training Error: 9.8%	Validation Error: 8%
Fold 3: Training Error: 9.5%	Validation Error: 7.7%
Fold 4: Training Error: 9.8%	Validation Error: 11%
Mean(Std. Dev.) over all folds:
-------------------------------
Train Error: 9.49%(0.32%)	Validation Error: 9%(1.3%)


# Train Large Model

In [7]:
cross_validate(large_model, X_train, y_train, n_folds=5)

Fold 0: Training Error: 47%	Validation Error: 12%
Fold 1: Training Error: 48%	Validation Error: 49%
Fold 2: Training Error: 46%	Validation Error: 49%
Fold 3: Training Error: 47%	Validation Error: 9.9%
Fold 4: Training Error: 48%	Validation Error: 48%
Mean(Std. Dev.) over all folds:
-------------------------------
Train Error: 47.25%(0.75%)	Validation Error: 34%(18%)


# Evaluate on test set

In [8]:
# Small Model
models = {
  'small sgd': small_model(),
  'medium sgd': medium_model(),
  'large sgd': large_model(),
}

for name, model in models.items():
  history = model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test), verbose=0)
  train_acc = history.history['accuracy']
  val_acc = history.history['val_accuracy']
  train_error_100 = (1 - train_acc[-1]) * 100
  validation_error_100 = (1 - val_acc[-1]) * 100
  print(f"{name}: Training Error: {train_error_100:.2g}%\tValidation Error: {validation_error_100:.2g}%")


small sgd: Training Error: 8.1%	Validation Error: 6.7%
medium sgd: Training Error: 9.3%	Validation Error: 8.2%
large sgd: Training Error: 48%	Validation Error: 48%


## Experiment: Activation Function and Optimizer
Modify the 1) Activation function 2) Optimizer of any chosen model. Try at least one model for each modified component.

Explain the motivation behind the modifications you made.

Explore the effects on the performance.


# Explanation: Modifying Activation Function

Hard sigmoid is a piecewise linear approximation of sigmoid. It's faster to compute than sigmoid, although we won't take advantage of that here since we're still only using one layer. However, it's also less prone to vanishing gradients, so it may improve the medium model performance.

In [9]:
# Implementation and exploration.
def medium_model_hard_sigmoid_activation():
  model = tf.keras.Sequential([
    MyDenseLayer(input_dim=X_train.shape[1], output_dim=16),
    tf.keras.layers.Dense(1, activation='hard_sigmoid')
  ])
  model.compile(optimizer='sgd', loss='binary_crossentropy', metrics=['accuracy'])
  return model


In [10]:
cross_validate(medium_model_hard_sigmoid_activation, X_train, y_train, n_folds=5)


Fold 0: Training Error: 11%	Validation Error: 12%
Fold 1: Training Error: 10%	Validation Error: 9.9%
Fold 2: Training Error: 11%	Validation Error: 9.1%
Fold 3: Training Error: 11%	Validation Error: 9.6%
Fold 4: Training Error: 10%	Validation Error: 12%
Mean(Std. Dev.) over all folds:
-------------------------------
Train Error: 10.53%(0.28%)	Validation Error: 10%(1.2%)


# Review: Effects on performance

The hard sigmoid activation seems to slightly detract from the performance of the medium model. Vanishing gradients are mostly a concern when you're dealing with very deep neural networks, so that factor may not have played a big role here. Sigmoid is smoother though, which is supposed to be better for optimization — so that may be the only relevant factor here.

# Explanation: Modifying Optimizer

Adam adapts the learning rate for each parameter individually — this is particularly helpful in deeper/larger networks where different parameters might need different scales of updates. So, I think it makes more sense to try Adam on the large model.

In [11]:
def large_model_adam_optimizer():
  model = tf.keras.Sequential([
    MyDenseLayer(input_dim=X_train.shape[1], output_dim=32),
    MyDenseLayer(input_dim=32, output_dim=32),
    tf.keras.layers.Dense(1, activation='sigmoid')
  ])
  model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
  return model


In [12]:
cross_validate(large_model_adam_optimizer, X_train, y_train, n_folds=5)


Fold 0: Training Error: 0%	Validation Error: 0.18%
Fold 1: Training Error: 0%	Validation Error: 0%
Fold 2: Training Error: 0%	Validation Error: 0%
Fold 3: Training Error: 0.11%	Validation Error: 0.088%
Fold 4: Training Error: 0%	Validation Error: 0%
Mean(Std. Dev.) over all folds:
-------------------------------
Train Error: 0.02%(0.04%)	Validation Error: 0.053%(0.07%)


# Review: Effects on performance

Adam seems to improve the performance of the large model a LOT. My best guess as to why would be that my hypothesis was mostly correct: Adam (the idea of having adaptive learning rates) is supposed to converge faster, but has higher memory requirements. However when comparing the large sgd vs large adam models, we don't really consider memory requirements or convergence speed, so adam just seems much better despite probably being slower and using more memory.

## OPTIONAL. BONUS. Experiment: Loss Function

Modify the loss function of any chosen model.

Explain the motivation behind the modifications you made.

Explore the effects on the performance.


# Explanation: Modifying Loss Function

The large model with 2 layers might produce more extreme predictions. MSE could do better for the large model (w/ adam optimizer and sigmoid activation) because it penalizes larger errors more heavily (quadratically) compared to binary cross entropy. Also, MSE's gradients are more stable for extreme predictions near 0 or 1, unlike BCE which can give very large gradients in those cases. So MSE may be worth trying over BCE.

In [13]:
# change loss function of large adam model w/ sigmoid
def large_model_mse_loss():
  model = tf.keras.Sequential([
    MyDenseLayer(input_dim=X_train.shape[1], output_dim=32),
    MyDenseLayer(input_dim=32, output_dim=32),
    tf.keras.layers.Dense(1, activation='sigmoid')
  ])
  model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])
  return model


In [14]:
cross_validate(large_model_mse_loss, X_train, y_train, n_folds=5)

Fold 0: Training Error: 0%	Validation Error: 0.18%
Fold 1: Training Error: 0%	Validation Error: 0%
Fold 2: Training Error: 0%	Validation Error: 0%
Fold 3: Training Error: 0%	Validation Error: 0%
Fold 4: Training Error: 0%	Validation Error: 0%
Mean(Std. Dev.) over all folds:
-------------------------------
Train Error: 0.00%(0.00%)	Validation Error: 0.035%(0.07%)


# Explanation: Effects on performance

Both binary cross entropy and mean squared errors do pretty well. MSE seems to do slightly worse, but I'm not sure if this is a statistically significant difference. BCE may actually be more appropriate because we're doing a binary classification task — BCE is specifically designed to model probability distributions for binary outcomes, which maps well to our edible/poisonous mushrooms case.


No other directions for this assignment, other than what's here and in the "General Directions" section. You have a lot of freedom with this assignment. Don't get carried away. It is expected the results may vary, being better or worse, due to the limitations of the dataset. Graders are not going to run your notebooks. The notebook will be read as a report on how different models were explored. Since you'll be using libraries, the emphasis will be on your ability to communicate your findings.

## Before You Submit...

1. Re-read the general instructions provided above, and
2. Hit "Kernel"->"Restart & Run All".