<a href="https://colab.research.google.com/github/nischalon10/NLP_HW1/blob/main/Nischal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mathematical Formulation for IMDB Text Classification using an MLP
# Instructor: Dr. Ankur Mali
# University of South Florida (Spring 2025)

This document describes the mathematical framework for processing IMDB text data using a character-level bag-of-characters representation, passing it through a multi-layer perceptron (MLP), and training the model via gradient descent. The evaluation metrics include loss, accuracy, precision, and recall.

---

## 1. Tokenization and Input Representation

Given a raw text review \( T \), we first tokenize it at the character level. Let \( V \) be the vocabulary (i.e., the set of unique characters) extracted from the training data with size \( |V| = d \).

For each text review \( T \), we construct a binary bag-of-characters vector \( x \in \{0,1\}^d \) such that:

$$
x_j =
\begin{cases}
1, & \text{if the } j\text{-th character in } V \text{ appears in } T, \\
0, & \text{otherwise.}
\end{cases}
$$

Thus, each review is represented as:

$$
x = \mathrm{BOW}(T) \in \mathbb{R}^d.
$$

---

## 2. MLP Model

The MLP we consider has the following structure:
- **Input layer:** Receives $$( x \in \mathbb{R}^d )$$.
- **Hidden Layer 1:** With $h_1$ or $z_1$ (Post-activation) neurons.
- **Hidden Layer 2:** With \( h_2 \) neurons.
- **Output Layer:** With \( c \) neurons (for \( c = 2 \) classes in binary classification).

### 2.1. Model Parameters

- **First Hidden Layer:**
  - Weight matrix: $$(W^{(1)} \in \mathbb{R}^{d \times h_1} )$$
  - Bias vector: $$( b^{(1)} \in \mathbb{R}^{h_1} )$$

- **Second Hidden Layer:**
  - Weight matrix: $$( W^{(2)} \in \mathbb{R}^{h_1 \times h_2} )$$
  - Bias vector: $$( b^{(2)} \in \mathbb{R}^{h_2} )$$

- **Output Layer:**
  - Weight matrix: $$( W^{(3)} \in \mathbb{R}^{h_2 \times c} )$$
  - Bias vector: $$( b^{(3)} \in \mathbb{R}^{c} )$$

> **Note:** In the original code, a third hidden layer size (\( h_3 \)) is provided as a parameter but is not used in the forward computation. Here, the model uses two hidden layers. You can add any N layers, to this pipeline, remember to modify the pipeline accordingly.

### 2.2. Forward Pass

For an input vector \( x \), the forward propagation through the network is as follows:

1. **First Hidden Layer:**

   $$
   h^{(1)} = \text{ReLU}\Big( x\, W^{(1)} + b^{(1)} \Big)
   $$

2. **Second Hidden Layer:**

   $$
   h^{(2)} = \text{ReLU}\Big( h^{(1)}\, W^{(2)} + b^{(2)} \Big)
   $$

3. **Output Layer (Logits):**

   $$
   z = h^{(2)}\, W^{(3)} + b^{(3)}
   $$

The logits \( z \) are then converted to class probabilities using the softmax function:

$$
\hat{y} = \text{softmax}(z) = \frac{\exp(z)}{\sum_{j=1}^{c} \exp(z_j)}
$$

---

## 3. Loss Function

We use the **Categorical Cross Entropy Loss** (with logits) for training. For a single sample with true one-hot label \( y \) and predicted probabilities \( \hat{y} \), the loss is:

$$
L(y, \hat{y}) = -\sum_{j=1}^{c} y_j \log(\hat{y}_j)
$$

For a batch of \( N \) samples, the average loss is computed as:

$$
L = \frac{1}{N} \sum_{i=1}^{N} L(y^{(i)}, \hat{y}^{(i)})
$$

---

## 4. Training via Gradient Descent

The goal is to minimize the loss \( \mathcal{L} \) with respect to the model parameters:

$$
\Theta = \{ W^{(1)},\, b^{(1)},\, W^{(2)},\, b^{(2)},\, W^{(3)},\, b^{(3)} \}
$$

Using gradient descent (or an adaptive method like Adam), each parameter \( \theta \in \Theta \) is updated as:

$$
\theta \leftarrow \theta - \eta\, \nabla_\theta L
$$

where:
- $\eta$ is the learning rate.
- $\nabla_\theta L $ denotes the gradient of the loss with respect to $\theta $.

Backpropagation is used to compute these gradients efficiently.

---

## 5. Evaluation Metrics

In addition to monitoring the loss during training, we evaluate the model performance using:

- **Accuracy:**

  $$
  \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}
  $$

- **Precision:**

  $$
  \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
  $$

- **Recall:**

  $$
  \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
  $$

These metrics are computed on the validation and test sets to assess the model’s generalization performance.

---

## 6. Summary of the Pipeline

1. **Tokenization:**  
   Each review \( T \) is tokenized at the character level and converted into a binary vector $$x \in \{0,1\}^d$$ representing the presence of each character in the vocabulary \( V \).

2. **MLP Forward Propagation:**  
   The input vector \( x \) is propagated through the MLP:
   - First hidden layer: $$ h^{(1)} = \text{ReLU}\big( x\, W^{(1)} + b^{(1)} \big) $$
   - Second hidden layer: $$ h^{(2)} = \text{ReLU}\big( h^{(1)}\, W^{(2)} + b^{(2)} \big) $$
   - Output layer: $$ z = h^{(2)}\, W^{(3)} + b^{(3)} $$
   - Softmax conversion: $$ \hat{y} = \text{softmax}(z) $$

3. **Loss Computation:**  
   The categorical cross entropy loss L is computed using the true labels and the predicted probabilities.

4. **Training:**  
   The model parameters $\Theta$ are updated using gradient descent (or Adam), where:

   $$
   \theta \leftarrow \theta - \eta\, \nabla_\theta L
   $$

5. **Evaluation:**  
   After training, the model is evaluated on the validation and test sets using the loss, accuracy, precision, and recall metrics.

---

This formulation captures the entire process—from transforming raw text into a numeric representation, through the forward and backward passes of an MLP, to the training and evaluation of the system. Shorter version of your slides :)


## MLP on IMDB Dataset

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score

tf.random.set_seed(1234)
np.random.seed(1234)


# -------------------------------
# Original MLP Class Definition
# -------------------------------
class MLP(object):
    def __init__(
        self,
        size_input,
        size_output,
        size_hidden1,
        size_hidden2=None,
        size_hidden3=None,
        num_layers=2,
        activation="relu",
        optimizer="Adam",
        learning_rate=0.001,
        device=None,
    ):
        """
        size_input: int, size of input layer
        size_hidden1: int, size of the 1st hidden layer
        size_hidden2: int or None, size of the 2nd hidden layer (if applicable)
        size_hidden3: int or None, size of the 3rd hidden layer (if applicable)
        size_output: int, size of output layer
        num_layers: int, number of hidden layers (1, 2, or 3)
        activation: str, activation function ('relu', 'tanh', 'leaky_relu')
        optimizer: str, optimizer ('Adam', 'SGD', 'RMSprop')
        learning_rate: float, learning rate for optimization
        device: str or None, either 'cpu' or 'gpu' or None.
        """

        self.size_input = size_input
        self.size_hidden1 = size_hidden1
        self.size_hidden2 = size_hidden2 if num_layers > 1 else None
        self.size_hidden3 = size_hidden3 if num_layers > 2 else None
        self.size_output = size_output
        self.num_layers = num_layers
        self.device = device
        self.learning_rate = learning_rate

        activation_functions = {
            "relu": tf.nn.relu,
            "tanh": tf.nn.tanh,
            "leaky_relu": tf.nn.leaky_relu,
        }
        self.activation = activation_functions.get(activation, tf.nn.relu)

        self.W1 = tf.Variable(tf.random.normal([self.size_input, self.size_hidden1], stddev=0.1))
        self.b1 = tf.Variable(tf.zeros([1, self.size_hidden1]))

        if self.num_layers > 1:
            self.W2 = tf.Variable(tf.random.normal([self.size_hidden1, self.size_hidden2], stddev=0.1))
            self.b2 = tf.Variable(tf.zeros([1, self.size_hidden2]))

        if self.num_layers > 2:
            self.W3 = tf.Variable(tf.random.normal([self.size_hidden2, self.size_hidden3], stddev=0.1))
            self.b3 = tf.Variable(tf.zeros([1, self.size_hidden3]))

        self.W_out = tf.Variable(tf.random.normal([
            self.size_hidden3 if self.num_layers == 3 else self.size_hidden2 if self.num_layers == 2 else self.size_hidden1,
            self.size_output
        ], stddev=0.1))
        self.b_out = tf.Variable(tf.zeros([1, self.size_output]))

        self.variables = [self.W1, self.b1]
        if self.num_layers > 1:
            self.variables.extend([self.W2, self.b2])
        if self.num_layers > 2:
            self.variables.extend([self.W3, self.b3])
        self.variables.extend([self.W_out, self.b_out])

        optimizers = {
            "Adam": tf.keras.optimizers.Adam,
            "SGD": tf.keras.optimizers.SGD,
            "RMSprop": tf.keras.optimizers.RMSprop,
        }
        self.optimizer = optimizers.get(optimizer)(learning_rate=self.learning_rate)

    def forward(self, X):
        if self.device is not None:
            with tf.device("gpu:0" if self.device == "gpu" else "cpu"):
                self.y = self.compute_output(X)
        else:
            self.y = self.compute_output(X)
        return self.y

    def loss(self, y_pred, y_true):
        """
        Computes the loss between predicted and true outputs.
        y_pred: Tensor of shape (batch_size, size_output)
        y_true: Tensor of shape (batch_size, size_output)
        """
        y_true_tf = tf.cast(y_true, dtype=tf.float32)
        y_pred_tf = tf.cast(y_pred, dtype=tf.float32)
        cce = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
        loss_x = cce(y_true_tf, y_pred_tf)
        return loss_x

    def backward(self, X_train, y_train):
        """
        Backward pass: compute gradients of the loss with respect to the variables.
        """
        with tf.GradientTape() as tape:
            predicted = self.forward(X_train)
            current_loss = self.loss(predicted, y_train)
        grads = tape.gradient(current_loss, self.variables)
        return grads

    def compute_output(self, X):
        X_tf = tf.cast(X, dtype=tf.float32)
        h1 = self.activation(tf.matmul(X_tf, self.W1) + self.b1)

        if self.num_layers > 1:
            h2 = self.activation(tf.matmul(h1, self.W2) + self.b2)
        if self.num_layers > 2:
            h3 = self.activation(tf.matmul(h2, self.W3) + self.b3)

        output = tf.matmul(h3 if self.num_layers == 3 else h2 if self.num_layers == 2 else h1, self.W_out) + self.b_out
        return output


# -------------------------------
# Word-Level Tokenizer and Preprocessing Functions
# -------------------------------
def char_level_tokenizer(texts, num_words=None):
    """
    Create and fit a character-level tokenizer.

    Args:
        texts (list of str): List of texts.
        num_words (int or None): Maximum number of tokens to keep.

    Returns:
        tokenizer: A fitted Tokenizer instance.
    """
    tokenizer = tf.keras.preprocessing.text.Tokenizer(
        num_words=num_words, char_level=True, lower=True
    )
    tokenizer.fit_on_texts(texts)
    return tokenizer


def texts_to_bow(tokenizer, texts):
    """
    Convert texts to a bag-of-characters representation.

    Args:
        tokenizer: A fitted character-level Tokenizer.
        texts (list of str): List of texts.

    Returns:
        Numpy array representing the binary bag-of-characters for each text.
    """
    # texts_to_matrix with mode 'binary' produces a fixed-length binary vector per text.
    matrix = tokenizer.texts_to_matrix(texts, mode="binary")
    return matrix


def one_hot_encode(labels, num_classes=2):
    """
    Convert numeric labels to one-hot encoded vectors.
    """
    return np.eye(num_classes)[labels]


# -------------------------------
# Load and Prepare the IMDB Dataset
# -------------------------------
print("Loading IMDB dataset...")
# Load the IMDB reviews dataset with the 'as_supervised' flag so that we get (text, label) pairs.
(ds_train, ds_test), ds_info = tfds.load(
    "imdb_reviews", split=["train", "test"], as_supervised=True, with_info=True
)

# Convert training dataset to lists.
train_texts = []
train_labels = []
for text, label in tfds.as_numpy(ds_train):
    # Decode byte strings to utf-8 strings.
    train_texts.append(text.decode("utf-8"))
    train_labels.append(label)
train_labels = np.array(train_labels)

# Create a validation set from the training data (20% for validation).
train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_texts, train_labels, test_size=0.2, random_state=42
)

# Convert test dataset to lists.
test_texts = []
test_labels = []
for text, label in tfds.as_numpy(ds_test):
    test_texts.append(text.decode("utf-8"))
    test_labels.append(label)
test_labels = np.array(test_labels)

print(
    f"Train samples: {len(train_texts)}, Validation samples: {len(val_texts)}, Test samples: {len(test_texts)}"
)

# -------------------------------
# Preprocessing: Tokenization and Vectorization
# -------------------------------
# Build the character-level tokenizer on the training texts.
tokenizer = char_level_tokenizer(train_texts)
print("Tokenizer vocabulary size:", len(tokenizer.word_index) + 1)

# Convert texts to bag-of-characters representation.
X_train = texts_to_bow(tokenizer, train_texts)
X_val = texts_to_bow(tokenizer, val_texts)
X_test = texts_to_bow(tokenizer, test_texts)

# Convert labels to one-hot encoding.
y_train = one_hot_encode(train_labels)
y_val = one_hot_encode(val_labels)
y_test = one_hot_encode(test_labels)


print(f"NUM_LAYERS, HIDDEN_SIZE, ACTIVATION, OPTIMIZER, LEARNING, BATCH_SIZE, Test Accuracy\n")
for BATCH_SIZE in [128, 64, 32]:
    for NUM_LAYERS in [2, 3, 1]:
        for HIDDEN_SIZE in [128, 256, 512]:
            for ACTIVATION in ["relu", "tanh", "leaky_relu"]:
                for OPTIMIZER in ["Adam", "SGD", "RMSprop"]:
                    for LEARNING in [0.001, 0.0005, 0.0001]:
                        # -------------------------------
                        # Model Setup
                        # -------------------------------
                        # The input size is determined by the dimension of the bag-of-characters vector.
                        size_input = X_train.shape[1]
                        size_hidden1 = HIDDEN_SIZE
                        size_hidden2 = HIDDEN_SIZE
                        size_hidden3 = HIDDEN_SIZE
                        size_output = 2  # Binary classification.
                        batch_size = BATCH_SIZE
                        epochs = 10
                        num_batches = int(np.ceil(X_train.shape[0] / batch_size))
                        # Instantiate the MLP model.
                        # model = MLP(
                        #     size_input, size_hidden1, size_hidden2, size_hidden3, size_output, device=None
                        # )
                        model = MLP(
                            size_input=size_input,
                            size_output=size_output,
                            size_hidden1=size_hidden1,
                            size_hidden2=size_hidden2,
                            size_hidden3=size_hidden3,
                            num_layers=NUM_LAYERS,
                            activation=ACTIVATION,
                            optimizer=OPTIMIZER,
                            learning_rate=LEARNING,
                            device=None,
                        )

                        # Define the optimizer.
                        # optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
                        optimizers = {
                            "Adam": tf.keras.optimizers.Adam,
                            "SGD": tf.keras.optimizers.SGD,
                            "RMSprop": tf.keras.optimizers.RMSprop,
                        }
                        optimizer = optimizers.get(OPTIMIZER)(learning_rate=LEARNING)

                        for epoch in range(epochs):
                            # Shuffle training data at the start of each epoch.
                            indices = np.arange(X_train.shape[0])
                            np.random.shuffle(indices)
                            X_train = X_train[indices]
                            y_train = y_train[indices]

                            epoch_loss = 0
                            for i in range(num_batches):
                                start = i * batch_size
                                end = min((i + 1) * batch_size, X_train.shape[0])
                                X_batch = X_train[start:end]
                                y_batch = y_train[start:end]
                                predictions = model.forward(X_batch)
                                loss_value = model.loss(predictions, y_batch)
                                grads = model.backward(X_batch, y_batch)
                                optimizer.apply_gradients(zip(grads, model.variables))
                                epoch_loss += loss_value.numpy() * (end - start)

                            epoch_loss /= X_train.shape[0]

                            # Evaluate on validation set.
                            val_logits = model.forward(X_val)
                            val_loss = model.loss(val_logits, y_val).numpy()
                            val_preds = np.argmax(val_logits.numpy(), axis=1)
                            true_val = np.argmax(y_val, axis=1)
                            accuracy = np.mean(val_preds == true_val)
                            precision = precision_score(true_val, val_preds)
                            recall = recall_score(true_val, val_preds)

                            # print(
                            #     f"Epoch {epoch+1:02d} | Training Loss: {epoch_loss:.4f} | Val Loss: {val_loss:.4f} | "
                            #     f"Accuracy: {accuracy:.4f} | Precision: {precision:.4f} | Recall: {recall:.4f}"
                            # )

                            # # -------------------------------
                            # # Final Evaluation on Test Set
                            # # -------------------------------
                            # print("\nEvaluating on test set...")
                        test_logits = model.forward(X_test)
                        test_loss = model.loss(test_logits, y_test).numpy()
                        test_preds = np.argmax(test_logits.numpy(), axis=1)
                        true_test = np.argmax(y_test, axis=1)
                        test_accuracy = np.mean(test_preds == true_test)
                        test_precision = precision_score(true_test, test_preds)
                        test_recall = recall_score(true_test, test_preds)

                            # print(
                            #     f"Test Loss: {test_loss:.4f} | Test Accuracy: {test_accuracy:.4f} | "
                            #     f"Test Precision: {test_precision:.4f} | Test Recall: {test_recall:.4f}"
                            # )

                        print(f"{NUM_LAYERS},{HIDDEN_SIZE},{ACTIVATION},{OPTIMIZER},{LEARNING},{BATCH_SIZE},{test_accuracy:.4f}\n")

Loading IMDB dataset...
Train samples: 20000, Validation samples: 5000, Test samples: 25000
Tokenizer vocabulary size: 80169
NUM_LAYERS, HIDDEN_SIZE, ACTIVATION, OPTIMIZER, LEARNING, BATCH_SIZE, Test Accuracy

1,128,relu,Adam,0.001,128,0.8454

1,128,relu,Adam,0.0005,128,0.8507

1,128,relu,Adam,0.0001,128,0.8559

1,128,relu,SGD,0.001,128,0.6173

1,128,relu,SGD,0.0005,128,0.5524

1,128,relu,SGD,0.0001,128,0.4694

1,128,relu,RMSprop,0.001,128,0.8396

1,128,relu,RMSprop,0.0005,128,0.8494

1,128,relu,RMSprop,0.0001,128,0.8535

1,128,tanh,Adam,0.001,128,0.8504

1,128,tanh,Adam,0.0005,128,0.8532

1,128,tanh,Adam,0.0001,128,0.8577

1,128,tanh,SGD,0.001,128,0.6142

1,128,tanh,SGD,0.0005,128,0.5560

1,128,tanh,SGD,0.0001,128,0.5063

1,128,tanh,RMSprop,0.001,128,0.8534

1,128,tanh,RMSprop,0.0005,128,0.8559

1,128,tanh,RMSprop,0.0001,128,0.8577

1,128,leaky_relu,Adam,0.001,128,0.8482

1,128,leaky_relu,Adam,0.0005,128,0.8522

1,128,leaky_relu,Adam,0.0001,128,0.8580

1,128,leaky_relu,SGD,0.001,128,0

## Random MLP on IMDB Dataset

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score

tf.random.set_seed(1221)
np.random.seed(1221)


# ---------------------------
# Hyperparameters
BATCH_SIZE = 128
NUM_LAYERS = 1
HIDDEN_SIZE = 256
ACTIVATION = "tanh"
OPTIMIZER = "RMSprop"
LEARNING = 0.0001
# ---------------------------


# -------------------------------
# Original MLP Class Definition
# -------------------------------
class MLP_rnd(object):
    def __init__(
        self,
        size_input,
        size_output,
        size_hidden1,
        size_hidden2=None,
        size_hidden3=None,
        num_layers=2,
        activation="relu",
        device=None,
    ):
        self.size_input = size_input
        self.size_hidden1 = size_hidden1
        self.size_hidden2 = size_hidden2 if num_layers > 1 else None
        self.size_hidden3 = size_hidden3 if num_layers > 2 else None
        self.size_output = size_output
        self.num_layers = num_layers
        self.device = device

        activation_functions = {
            "relu": tf.nn.relu,
            "tanh": tf.nn.tanh,
            "leaky_relu": tf.nn.leaky_relu,
        }
        self.activation = activation_functions.get(activation, tf.nn.relu)

        self.W1 = tf.Variable(tf.random.normal([self.size_input, self.size_hidden1], stddev=0.1))
        self.b1 = tf.Variable(tf.zeros([1, self.size_hidden1]))

        if self.num_layers > 1:
            self.W2 = tf.Variable(tf.random.normal([self.size_hidden1, self.size_hidden2], stddev=0.1))
            self.b2 = tf.Variable(tf.zeros([1, self.size_hidden2]))

        if self.num_layers > 2:
            self.W3 = tf.Variable(tf.random.normal([self.size_hidden2, self.size_hidden3], stddev=0.1))
            self.b3 = tf.Variable(tf.zeros([1, self.size_hidden3]))

        self.W_out = tf.Variable(tf.random.normal([
            self.size_hidden3 if self.num_layers == 3 else self.size_hidden2 if self.num_layers == 2 else self.size_hidden1,
            self.size_output
        ], stddev=0.1))
        self.b_out = tf.Variable(tf.zeros([1, self.size_output]))

        self.variables = [self.W_out, self.b_out]
        if self.num_layers > 2:
            self.variables.extend([self.W3, self.b3])
        if self.num_layers > 1:
            self.variables.extend([self.W2, self.b2])
        self.variables.extend([self.W1, self.b1])

    def forward(self, X):
        if self.device is not None:
            with tf.device("gpu:0" if self.device == "gpu" else "cpu"):
                self.y = self.compute_output(X)
        else:
            self.y = self.compute_output(X)
        return self.y

    def loss(self, y_pred, y_true):
        y_true_tf = tf.cast(y_true, dtype=tf.float32)
        y_pred_tf = tf.cast(y_pred, dtype=tf.float32)
        cce = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
        loss_x = cce(y_true_tf, y_pred_tf)
        return loss_x

    def backward(self, X_train, y_train):
        with tf.GradientTape() as tape:
            predicted = self.forward(X_train)
            current_loss = self.loss(predicted, y_train)
        grads = tape.gradient(current_loss, self.variables)
        return grads

    def compute_output(self, X):
        X_tf = tf.cast(X, dtype=tf.float32)
        h1 = self.activation(tf.matmul(X_tf, self.W1) + self.b1)

        if self.num_layers > 1:
            h2 = self.activation(tf.matmul(h1, self.W2) + self.b2)
        if self.num_layers > 2:
            h3 = self.activation(tf.matmul(h2, self.W3) + self.b3)

        output = tf.matmul(h3 if self.num_layers == 3 else h2 if self.num_layers == 2 else h1, self.W_out) + self.b_out
        return output


# -------------------------------
# Character-Level Tokenizer and Preprocessing Functions
# -------------------------------
def char_level_tokenizer(texts, num_words=1000):
    """
    Create and fit a character-level tokenizer.

    Args:
        texts (list of str): List of texts.
        num_words (int or None): Maximum number of tokens to keep.

    Returns:
        tokenizer: A fitted Tokenizer instance.
    """
    tokenizer = tf.keras.preprocessing.text.Tokenizer(
        num_words=num_words, char_level=False, lower=True
    )
    tokenizer.fit_on_texts(texts)
    return tokenizer


def texts_to_bow(tokenizer, texts):
    """
    Convert texts to a bag-of-characters representation.

    Args:
        tokenizer: A fitted character-level Tokenizer.
        texts (list of str): List of texts.

    Returns:
        Numpy array representing the binary bag-of-characters for each text.
    """
    # texts_to_matrix with mode 'binary' produces a fixed-length binary vector per text.
    matrix = tokenizer.texts_to_matrix(texts, mode="binary")
    return matrix


def one_hot_encode(labels, num_classes=2):
    """
    Convert numeric labels to one-hot encoded vectors.
    """
    return np.eye(num_classes)[labels]


# -------------------------------
# Load and Prepare the IMDB Dataset
# -------------------------------
print("Loading IMDB dataset...")
# Load the IMDB reviews dataset with the 'as_supervised' flag so that we get (text, label) pairs.
(ds_train, ds_test), ds_info = tfds.load(
    "imdb_reviews", split=["train", "test"], as_supervised=True, with_info=True
)

# Convert training dataset to lists.
train_texts = []
train_labels = []
for text, label in tfds.as_numpy(ds_train):
    # Decode byte strings to utf-8 strings.
    train_texts.append(text.decode("utf-8"))
    train_labels.append(label)
train_labels = np.array(train_labels)

# Create a validation set from the training data (20% for validation).
train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_texts, train_labels, test_size=0.2, random_state=42
)

# Convert test dataset to lists.
test_texts = []
test_labels = []
for text, label in tfds.as_numpy(ds_test):
    test_texts.append(text.decode("utf-8"))
    test_labels.append(label)
test_labels = np.array(test_labels)

print(
    f"Train samples: {len(train_texts)}, Validation samples: {len(val_texts)}, Test samples: {len(test_texts)}"
)

# -------------------------------
# Preprocessing: Tokenization and Vectorization
# -------------------------------
# Build the character-level tokenizer on the training texts.
tokenizer = char_level_tokenizer(train_texts)
print("Tokenizer vocabulary size:", len(tokenizer.word_index) + 1)

# Convert texts to bag-of-characters representation.
X_train = texts_to_bow(tokenizer, train_texts)
X_val = texts_to_bow(tokenizer, val_texts)
X_test = texts_to_bow(tokenizer, test_texts)

# Convert labels to one-hot encoding.
y_train = one_hot_encode(train_labels)
y_val = one_hot_encode(val_labels)
y_test = one_hot_encode(test_labels)



# -------------------------------
# Model Setup
# -------------------------------
# The input size is determined by the dimension of the bag-of-characters vector.
size_input = X_train.shape[1]
# Set hidden layer sizes as desired.
size_hidden1 = HIDDEN_SIZE
size_hidden2 = HIDDEN_SIZE
size_hidden3 = HIDDEN_SIZE
size_output = 2

# Instantiate the MLP model.
model = MLP_rnd(
    size_input=size_input,
    size_output=size_output,
    size_hidden1=size_hidden1,
    size_hidden2=size_hidden2,
    size_hidden3=size_hidden3,
    num_layers=NUM_LAYERS,
    activation=ACTIVATION,
    device=None,
)

optimizers = {
    "Adam": tf.keras.optimizers.Adam,
    "SGD": tf.keras.optimizers.SGD,
    "RMSprop": tf.keras.optimizers.RMSprop,
}
optimizer = optimizers.get(OPTIMIZER)(learning_rate=LEARNING)

# -------------------------------
# Training Parameters and Loop
# -------------------------------
batch_size = BATCH_SIZE
epochs = 10
num_batches = int(np.ceil(X_train.shape[0] / batch_size))

print("\nStarting training...\n")
for epoch in range(epochs):
    # Shuffle training data at the start of each epoch.
    indices = np.arange(X_train.shape[0])
    np.random.shuffle(indices)
    X_train = X_train[indices]
    y_train = y_train[indices]

    epoch_loss = 0
    for i in range(num_batches):
        start = i * batch_size
        end = min((i + 1) * batch_size, X_train.shape[0])
        X_batch = X_train[start:end]
        y_batch = y_train[start:end]

        # Compute gradients and update weights.
        # with tf.GradientTape() as tape:
        #     predictions = model.forward(X_batch)
        #     loss_value = model.loss(predictions, y_batch)
        # grads = tape.gradient(loss_value, model.variables)
        predictions = model.forward(X_batch)
        loss_value = model.loss(predictions, y_batch)
        grads = model.backward(X_batch, y_batch)
        optimizer.apply_gradients(zip(grads, model.variables))
        epoch_loss += loss_value.numpy() * (end - start)

    epoch_loss /= X_train.shape[0]

    # Evaluate on validation set.
    val_logits = model.forward(X_val)
    val_loss = model.loss(val_logits, y_val).numpy()
    val_preds = np.argmax(val_logits.numpy(), axis=1)
    true_val = np.argmax(y_val, axis=1)
    accuracy = np.mean(val_preds == true_val)
    precision = precision_score(true_val, val_preds)
    recall = recall_score(true_val, val_preds)

    print(
        f"Epoch {epoch+1:02d} | Training Loss: {epoch_loss:.4f} | Val Loss: {val_loss:.4f} | "
        f"Accuracy: {accuracy:.4f} | Precision: {precision:.4f} | Recall: {recall:.4f}"
    )

# -------------------------------
# Final Evaluation on Test Set
# -------------------------------
print("\nEvaluating on test set...")
test_logits = model.forward(X_test)
test_loss = model.loss(test_logits, y_test).numpy()
test_preds = np.argmax(test_logits.numpy(), axis=1)
true_test = np.argmax(y_test, axis=1)

test_accuracy = np.mean(test_preds == true_test)
test_precision = precision_score(true_test, test_preds)
test_recall = recall_score(true_test, test_preds)

print(
    f"Test Loss: {test_loss:.4f} | Test Accuracy: {test_accuracy:.4f} | "
    f"Test Precision: {test_precision:.4f} | Test Recall: {test_recall:.4f}"
)


Loading IMDB dataset...
Train samples: 20000, Validation samples: 5000, Test samples: 25000
Tokenizer vocabulary size: 134

Starting training...

Epoch 01 | Training Loss: 0.6879 | Val Loss: 0.6852 | Accuracy: 0.5636 | Precision: 0.5433 | Recall: 0.6258
Epoch 02 | Training Loss: 0.6828 | Val Loss: 0.6820 | Accuracy: 0.5704 | Precision: 0.5454 | Recall: 0.6840
Epoch 03 | Training Loss: 0.6796 | Val Loss: 0.6796 | Accuracy: 0.5764 | Precision: 0.5512 | Recall: 0.6795
Epoch 04 | Training Loss: 0.6775 | Val Loss: 0.6771 | Accuracy: 0.5802 | Precision: 0.5641 | Recall: 0.5903
Epoch 05 | Training Loss: 0.6759 | Val Loss: 0.6767 | Accuracy: 0.5822 | Precision: 0.5592 | Recall: 0.6531
Epoch 06 | Training Loss: 0.6747 | Val Loss: 0.6756 | Accuracy: 0.5852 | Precision: 0.5651 | Recall: 0.6267
Epoch 07 | Training Loss: 0.6739 | Val Loss: 0.6749 | Accuracy: 0.5866 | Precision: 0.5676 | Recall: 0.6180
Epoch 08 | Training Loss: 0.6731 | Val Loss: 0.6746 | Accuracy: 0.5860 | Precision: 0.5660 | Recal

## MLP with feedback alignment on IMDB Dataset

In [None]:
class MLP_FA(object):
    def __init__(self, size_input, size_hidden1, size_hidden2, size_hidden3, size_output, device=None):
        """
        size_input: int, size of input layer
        size_hidden1: int, size of the 1st hidden layer
        size_hidden2: int, size of the 2nd hidden layer
        size_hidden3: int, size of the 3rd hidden layer (Note: Not used in compute_output in this example)
        size_output: int, size of output layer
        device: str or None, either 'cpu' or 'gpu' or None.
        """
        self.size_input = size_input
        self.size_hidden1 = size_hidden1
        self.size_hidden2 = size_hidden2
        self.size_hidden3 = size_hidden3  # (Currently not used)
        self.size_output = size_output
        self.device = device

        # Initialize weights and biases for first hidden layer
        self.W1 = tf.Variable(tf.random.normal([self.size_input, self.size_hidden1], stddev=0.1))
        self.b1 = tf.Variable(tf.zeros([1, self.size_hidden1]))

        # Initialize weights and biases for second hidden layer
        self.W2 = tf.Variable(tf.random.normal([self.size_hidden1, self.size_hidden2], stddev=0.1))
        self.b2 = tf.Variable(tf.zeros([1, self.size_hidden2]))

        # Initialize weights and biases for output layer
        self.W3 = tf.Variable(tf.random.normal([self.size_hidden2, self.size_output], stddev=0.1))
        self.b3 = tf.Variable(tf.zeros([1, self.size_output]))

        # Create fixed random feedback matrices for feedback alignment:
        # B3: used to propagate the error from the output layer to the second hidden layer.
        # It replaces the use of W3^T. Its shape is (size_output, size_hidden2).
        self.B3 = tf.Variable(tf.random.normal([self.size_output, self.size_hidden2]), trainable=False)

        # B2: used to propagate the error from the second hidden layer to the first hidden layer.
        # Its shape is (size_hidden2, size_hidden1).
        self.B2 = tf.Variable(tf.random.normal([self.size_hidden2, self.size_hidden1]), trainable=False)

        # Define variables to be updated during training
        self.variables = [self.W1, self.W2, self.W3, self.b1, self.b2, self.b3]

    def forward(self, X):
        """
        Forward pass.
        X: Tensor, inputs.
        """
        if self.device is not None:
            with tf.device('gpu:0' if self.device == 'gpu' else 'cpu'):
                self.y = self.compute_output(X)
        else:
            self.y = self.compute_output(X)
        return self.y

    def loss(self, y_pred, y_true):
        """
        Computes the loss between predicted and true outputs.
        y_pred - Tensor of shape (batch_size, size_output)
        y_true - Tensor of shape (batch_size, size_output)
        """
        y_true_tf = tf.cast(y_true, dtype=tf.float32)
        y_pred_tf = tf.cast(y_pred, dtype=tf.float32)
        cce = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
        loss_x = cce(y_true_tf, y_pred_tf)
        return loss_x

    def backward(self, X_train, y_train):
        """
        Backward pass using feedback alignment.
        Computes gradients manually using fixed random feedback matrices.
        X_train: Input data (numpy array)
        y_train: One-hot encoded labels (numpy array)
        Returns: List of gradients corresponding to [dW1, dW2, dW3, db1, db2, db3]
        """
        # Cast input to float32 tensor
        X_tf = tf.cast(X_train, tf.float32)

        # --- Forward Pass ---
        # First hidden layer
        h1 = tf.matmul(X_tf, self.W1) + self.b1
        a1 = tf.nn.relu(h1)
        # Second hidden layer
        h2 = tf.matmul(a1, self.W2) + self.b2
        a2 = tf.nn.relu(h2)
        # Output layer (logits)
        logits = tf.matmul(a2, self.W3) + self.b3
        # Softmax predictions
        y_pred = tf.nn.softmax(logits)

        # --- Compute Output Error ---
        # For cross-entropy with softmax, the derivative is (y_pred - y_true)
        delta3 = y_pred - tf.cast(y_train, tf.float32)  # shape: (batch, size_output)
        batch_size = tf.cast(tf.shape(X_tf)[0], tf.float32)

        # --- Gradients for Output Layer ---
        dW3 = tf.matmul(tf.transpose(a2), delta3) / batch_size
        db3 = tf.reduce_mean(delta3, axis=0, keepdims=True)

        # --- Feedback Alignment for Second Hidden Layer ---
        # Instead of delta2 = (delta3 dot W3^T) * ReLU'(h2), use a fixed random matrix B3.
        relu_grad_h2 = tf.cast(h2 > 0, tf.float32)
        # delta3 has shape (batch, size_output) and B3 has shape (size_output, size_hidden2)
        delta2 = tf.matmul(delta3, self.B3) * relu_grad_h2  # shape: (batch, size_hidden2)

        dW2 = tf.matmul(tf.transpose(a1), delta2) / batch_size
        db2 = tf.reduce_mean(delta2, axis=0, keepdims=True)

        # --- Feedback Alignment for First Hidden Layer ---
        # Instead of delta1 = (delta2 dot W2^T) * ReLU'(h1), use a fixed random matrix B2.
        relu_grad_h1 = tf.cast(h1 > 0, tf.float32)
        # delta2 has shape (batch, size_hidden2) and B2 has shape (size_hidden2, size_hidden1)
        delta1 = tf.matmul(delta2, self.B2) * relu_grad_h1  # shape: (batch, size_hidden1)

        dW1 = tf.matmul(tf.transpose(X_tf), delta1) / batch_size
        db1 = tf.reduce_mean(delta1, axis=0, keepdims=True)

        return [dW1, dW2, dW3, db1, db2, db3]

    def compute_output(self, X):
        """
        Custom method to obtain output tensor during the forward pass.
        """
        X_tf = tf.cast(X, dtype=tf.float32)
        h1 = tf.matmul(X_tf, self.W1) + self.b1
        z1 = tf.nn.relu(h1)
        h2 = tf.matmul(z1, self.W2) + self.b2
        z2 = tf.nn.relu(h2)
        output = tf.matmul(z2, self.W3) + self.b3
        return output


# -------------------------------
# Character-Level Tokenizer and Preprocessing Functions
# -------------------------------
def char_level_tokenizer(texts, num_words=None):
    """
    Create and fit a character-level tokenizer.

    Args:
        texts (list of str): List of texts.
        num_words (int or None): Maximum number of tokens to keep.

    Returns:
        tokenizer: A fitted Tokenizer instance.
    """
    tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=num_words, char_level=True, lower=True)
    tokenizer.fit_on_texts(texts)
    return tokenizer

def texts_to_bow(tokenizer, texts):
    """
    Convert texts to a bag-of-characters representation.

    Args:
        tokenizer: A fitted character-level Tokenizer.
        texts (list of str): List of texts.

    Returns:
        Numpy array representing the binary bag-of-characters for each text.
    """
    # texts_to_matrix with mode 'binary' produces a fixed-length binary vector per text.
    matrix = tokenizer.texts_to_matrix(texts, mode='binary')
    return matrix

def one_hot_encode(labels, num_classes=2):
    """
    Convert numeric labels to one-hot encoded vectors.
    """
    return np.eye(num_classes)[labels]

# -------------------------------
# Load and Prepare the IMDB Dataset
# -------------------------------
print("Loading IMDB dataset...")
# Load the IMDB reviews dataset with the 'as_supervised' flag so that we get (text, label) pairs.
(ds_train, ds_test), ds_info = tfds.load('imdb_reviews',
                                           split=['train', 'test'],
                                           as_supervised=True,
                                           with_info=True)

# Convert training dataset to lists.
train_texts = []
train_labels = []
for text, label in tfds.as_numpy(ds_train):
    # Decode byte strings to utf-8 strings.
    train_texts.append(text.decode('utf-8'))
    train_labels.append(label)
train_labels = np.array(train_labels)

# Create a validation set from the training data (20% for validation).
train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_texts, train_labels, test_size=0.2, random_state=42)

# Convert test dataset to lists.
test_texts = []
test_labels = []
for text, label in tfds.as_numpy(ds_test):
    test_texts.append(text.decode('utf-8'))
    test_labels.append(label)
test_labels = np.array(test_labels)

print(f"Train samples: {len(train_texts)}, Validation samples: {len(val_texts)}, Test samples: {len(test_texts)}")

# -------------------------------
# Preprocessing: Tokenization and Vectorization
# -------------------------------
# Build the character-level tokenizer on the training texts.
tokenizer = char_level_tokenizer(train_texts)
print("Tokenizer vocabulary size:", len(tokenizer.word_index) + 1)

# Convert texts to bag-of-characters representation.
X_train = texts_to_bow(tokenizer, train_texts)
X_val   = texts_to_bow(tokenizer, val_texts)
X_test  = texts_to_bow(tokenizer, test_texts)

# Convert labels to one-hot encoding.
y_train = one_hot_encode(train_labels)
y_val   = one_hot_encode(val_labels)
y_test  = one_hot_encode(test_labels)

# -------------------------------
# Model Setup
# -------------------------------
# The input size is determined by the dimension of the bag-of-characters vector.
size_input = X_train.shape[1]
# Set hidden layer sizes as desired.
size_hidden1 = 128
size_hidden2 = 64
size_hidden3 = 32  # Placeholder (not used in the forward pass)
size_output  = 2

# Instantiate the MLP model.
model = MLP_FA(size_input, size_hidden1, size_hidden2, size_hidden3, size_output, device=None)

# Define the optimizer.
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

# -------------------------------
# Training Parameters and Loop
# -------------------------------
batch_size = 128
epochs = 10
num_batches = int(np.ceil(X_train.shape[0] / batch_size))

print("\nStarting training...\n")
for epoch in range(epochs):
    # Shuffle training data at the start of each epoch.
    indices = np.arange(X_train.shape[0])
    np.random.shuffle(indices)
    X_train = X_train[indices]
    y_train = y_train[indices]

    epoch_loss = 0
    for i in range(num_batches):
        start = i * batch_size
        end = min((i+1) * batch_size, X_train.shape[0])
        X_batch = X_train[start:end]
        y_batch = y_train[start:end]

        # Compute gradients and update weights.
        # with tf.GradientTape() as tape:
        #     predictions = model.forward(X_batch)
        #     loss_value = model.loss(predictions, y_batch)
        # grads = tape.gradient(loss_value, model.variables)
        predictions = model.forward(X_batch)
        loss_value = model.loss(predictions, y_batch)
        grads = model.backward(X_batch, y_batch)
        optimizer.apply_gradients(zip(grads, model.variables))
        epoch_loss += loss_value.numpy() * (end - start)

    epoch_loss /= X_train.shape[0]

    # Evaluate on validation set.
    val_logits = model.forward(X_val)
    val_loss = model.loss(val_logits, y_val).numpy()
    val_preds = np.argmax(val_logits.numpy(), axis=1)
    true_val = np.argmax(y_val, axis=1)
    accuracy = np.mean(val_preds == true_val)
    precision = precision_score(true_val, val_preds)
    recall = recall_score(true_val, val_preds)

    print(f"Epoch {epoch+1:02d} | Training Loss: {epoch_loss:.4f} | Val Loss: {val_loss:.4f} | "
          f"Accuracy: {accuracy:.4f} | Precision: {precision:.4f} | Recall: {recall:.4f}")

# -------------------------------
# Final Evaluation on Test Set
# -------------------------------
print("\nEvaluating on test set...")
test_logits = model.forward(X_test)
test_loss = model.loss(test_logits, y_test).numpy()
test_preds = np.argmax(test_logits.numpy(), axis=1)
true_test = np.argmax(y_test, axis=1)
test_accuracy = np.mean(test_preds == true_test)
test_precision = precision_score(true_test, test_preds)
test_recall = recall_score(true_test, test_preds)

print(f"Test Loss: {test_loss:.4f} | Test Accuracy: {test_accuracy:.4f} | "
      f"Test Precision: {test_precision:.4f} | Test Recall: {test_recall:.4f}")

Loading IMDB dataset...
Train samples: 20000, Validation samples: 5000, Test samples: 25000
Tokenizer vocabulary size: 134

Starting training...

Epoch 01 | Training Loss: 0.6810 | Val Loss: 0.6650 | Accuracy: 0.6048 | Precision: 0.5828 | Recall: 0.6502
Epoch 02 | Training Loss: 0.6634 | Val Loss: 0.6642 | Accuracy: 0.6060 | Precision: 0.5821 | Recall: 0.6638
Epoch 03 | Training Loss: 0.6629 | Val Loss: 0.6626 | Accuracy: 0.6098 | Precision: 0.5844 | Recall: 0.6753
Epoch 04 | Training Loss: 0.6611 | Val Loss: 0.6635 | Accuracy: 0.6066 | Precision: 0.5799 | Recall: 0.6844
Epoch 05 | Training Loss: 0.6625 | Val Loss: 0.6633 | Accuracy: 0.6004 | Precision: 0.6024 | Recall: 0.5169
Epoch 06 | Training Loss: 0.6596 | Val Loss: 0.6615 | Accuracy: 0.6074 | Precision: 0.5905 | Recall: 0.6205
Epoch 07 | Training Loss: 0.6577 | Val Loss: 0.6608 | Accuracy: 0.6074 | Precision: 0.5866 | Recall: 0.6444
Epoch 08 | Training Loss: 0.6555 | Val Loss: 0.6624 | Accuracy: 0.6050 | Precision: 0.5765 | Recal

# Assignment 2 Todos

## Overview
- **Objective:**  
  Modify your model’s text preprocessing by changing from character-level tokenization to word-level tokenization. Compare the performance of both tokenization methods. Additionally, perform hyper-parameter optimization by experimenting with various settings (learning rate, hidden layers, hidden sizes, batch sizes, optimizers, and activation functions) and report your findings.

## 1. Initial Setup
- [ ] **Set Random Seeds:**  
  Ensure reproducibility by setting seeds for all random number generators (e.g., Python’s `random`, NumPy, TensorFlow/PyTorch).
  
- [ ] **Prepare the Environment:**  
  - Create a new or update an existing Jupyter Notebook.
  - Ensure that all necessary libraries (e.g., NumPy, pandas, TensorFlow/PyTorch, matplotlib, etc.) are installed.
  
- [ ] **Version Control:**  
  Initialize a Git repository (if not already done) and commit your initial setup.

## 2. Data Preprocessing
- [ ] **Load Dataset:**  
  Load your dataset into the notebook.
  
- [ ] **Tokenization:**
  - **Character-Level Tokenization:**  
    - Tokenize the text data at the character level.
    - Save and log the processed data.
  - **Word-Level Tokenization:**  
    - Modify the tokenization process to tokenize the text by words.
    - Save and log the processed data.
    
- [ ] **Comparison:**  
  - Create a section in your notebook to compare the two tokenization approaches.
  - Visualize or tabulate differences in vocabulary size, sequence lengths, and other relevant metrics.

## 3. Model Architecture
- [ ] **Define the Model:**  
  Develop a model (or models) that can handle both tokenization types. Include the following adjustable hyper-parameters:
  - Learning rate
  - Number of hidden layers
  - Hidden sizes (neurons per layer)
  - Batch sizes
  - Optimizers (e.g., Adam, SGD, RMSProp)
  - Activation functions (e.g., ReLU, Tanh, LeakyReLU)

## 4. Hyper-Parameter Optimization
- [ ] **Experiment Setup:**  
  For each hyper-parameter configuration, perform at least 3 different tests to ensure robustness.
  
- [ ] **Grid/Random Search:**  
  Set up a search over the following hyper-parameter ranges (example values provided):
  - **Learning Rate:** `[0.001, 0.0005, 0.0001]`
  - **Hidden Layers:** `[1, 2, 3]`
  - **Hidden Sizes:** `[128, 256, 512]`
  - **Batch Sizes:** `[32, 64, 128]`
  - **Optimizers:** `[Adam, SGD, RMSProp]`
  - **Activation Functions:** `[ReLU, Tanh, LeakyReLU]`
  
- [ ] **Logging:**  
  Record the results (accuracy, loss, etc.) for each configuration in tables or charts.

## 5. Model Training and Evaluation
- [ ] **Training with Each Configuration:**  
  Run experiments for both tokenization approaches with each set of hyper-parameters:
  - Train the model at least 3 times per configuration (keeping the seed constant at this stage).
  - Log training and validation performance.
  
- [ ] **Identify the Best Model:**  
  Select the best performing configuration based on validation metrics (e.g., accuracy).

## 6. Final Experiments
- [ ] **Robustness Check:**  
  Once the best model is identified:
  - Re-run the experiments at least 3 times with different random seeds.
  - Record the performance (accuracy) for each run.
  
- [ ] **Statistical Reporting:**  
  - Compute the **mean accuracy** and **standard error** across these runs.
  - Include these statistics in your report.

## 7. Documentation and Reporting
- [ ] **Jupyter Notebook:**  
  - Ensure that your notebook is well-commented and clearly documents each step.
  - Include code cells for setting seeds, data preprocessing, model building, training, evaluation, and visualization.
  
- [ ] **Detailed Report (Word Document):**  
  Prepare a report that includes:
  - **Introduction:** Objectives and overview of the work.
  - **Methodology:** Detailed explanation of tokenization changes and hyper-parameter optimization strategy.
  - **Experiments and Results:**  
    - Comparison between character-level and word-level tokenization.
    - Tables/graphs for hyper-parameter experiments.
    - Final model performance with mean accuracy and standard error.
  - **Discussion:** Analysis of results, challenges encountered, and insights.
  - **Conclusion:** Summarize the key findings.
  
- [ ] **Submission:**  
  - Submit your Jupyter Notebook.
  - Submit your Word document report.
  - Ensure that both files are included in your repository or submission package.

## 8. Final Checklist
- [ ] All experiments have at least 3 different tests.
- [ ] Random seeds are set before any experiment.
- [ ] Hyper-parameter optimization covers changes in learning rate, hidden layers, hidden sizes, batch sizes, optimizers, and activation functions.
- [ ] The best model’s performance is verified with experiments on different seeds.
- [ ] Best model should be compared with random model shown above.
- [ ] The report clearly documents the methodology, experiments, results, and final conclusions.
- [ ] If experiments are shown with deeper MLP_FA with best settings (Extra credits -- 2 points)

---

> **Note:**  
> Keep thorough logs and document any observations during your experiments. Clear documentation is key to reproducibility and understanding your results.

