# CS145 Introduction to Data Mining - Assignment 3
## Deadline: 11:59PM, May 5, 2025

## Instructions
Each assignment is structured as a Jupyter notebook, offering interactive tutorials that align with our lectures. You will encounter two types of problems: *write-up problems* and *coding problems*.

1. **Write-up Problems:** These problems are primarily theoretical, requiring you to demonstrate your understanding of lecture concepts and to provide mathematical proofs for key theorems. Your answers should include sufficient steps for the mathematical derivations.
2. **Coding Problems:** Here, you will be engaging with practical coding tasks. These may involve completing code segments provided in the notebooks or developing models from scratch.

To ensure clarity and consistency in your submissions, please adhere to the following guidelines:

* For write-up problems, use Markdown bullet points to format text answers. Also, express all mathematical equations using $\LaTeX$ and avoid plain text such as x0, x^1, or R x Q for equations.
* For coding problems, comment on your code thoroughly for readability and ensure your code is executable. Non-runnable code may lead to a loss of **all** points. Coding problems have automated grading, and altering the grading code will result in a deduction of **all** points.
* Your submission should show the entire process of data loading, preprocessing, model implementation, training, and result analysis. This can be achieved through a mix of explanatory text cells, inline comments, intermediate result displays, and experimental visualizations.

### Submission Requirements

* Submit your solutions in .ipynb format through GradeScope in BruinLearn.
* Late submissions are allowed up to 24 hours post-deadline with a penalty factor of $\mathbf{1}(t\leq24)e^{-(\ln(2)/12)t}$.

### Collaboration and Integrity

* High level discussions are allowed and encouraged, but all final submissions must be your own work. Please acknowledge any collaboration or external sources used, including websites, papers, and GitHub repositories.
* Any suspicious cases of academic misconduct will be reported to The Office of the Dean of Students.

## Part 1: Write-Up Questions

### 1. Neural Network Derivatives (12 points total)


In this problem, you will analyze the effect of ReLU activation on weight gradients and derive gradients for a softmax cross-entropy loss. **Answer each sub-question carefully with full justification.**

#### (a) ReLU Derivatives (6 points)

> Definition.
>
> $$
> \operatorname{ReLU}(x)=\max (0, x), \quad \operatorname{ReLU}^{\prime}(x)= \begin{cases}1, & x>0 \\ 0, & x \leq 0\end{cases}
> $$
>
>
> Hint. When a unit's input is non-positive, its ReLU output is zero and "turns off" all gradient flow through that unit.
>
![image-20250420170015320](https://drive.google.com/uc?id=1X0Est80T0gm1N2VrYq8MzmfdoxcdtAjR)

Consider a neural network with the following structure:
- **Input layer**: $ x_1, x_2 $
- **Hidden layer 1**: $ h_3, h_4 $ with ReLU activation
- **Hidden layer 2**: $ h_1, h_2 $ with ReLU activation
- **Output layer**: $\hat{y}$

The weights $w_1, w_2, \ldots, w_5$ connect individual units as shown in a diagram, and we aim to minimize a loss function $L$ depending only on the output $\hat{y}$. Suppose one ReLU unit $h_1$ is “inactive” (its input is negative, so its output is 0).  

1. Which partial derivatives
   $$
   \frac{\partial L}{\partial w_1}, \quad \frac{\partial L}{\partial w_2}, \quad \frac{\partial L}{\partial w_3}
   $$
   are guaranteed to be zero?  
2. **Justify** your answer by explaining how the ReLU activation cuts off gradients for inactive units.

### **1(a) ReLU Derivatives**

---

#### **1. Which partial derivatives are guaranteed to be zero?**

- $\frac{\partial L}{\partial w_1} = 0$  
- $\frac{\partial L}{\partial w_2} = 0$  
- $\frac{\partial L}{\partial w_3}$ is **not necessarily** zero

---

#### **2. Justification**

Since ReLU(x) = $\max(0, x)$, its derivative is:

$$
\text{ReLU}'(x) = 
\begin{cases}
1 & \text{if } x > 0 \\
0 & \text{if } x \leq 0
\end{cases}
$$

We’re told that hidden unit $h_1$ is inactive, meaning its input is $\leq 0$, so:

$$
h_1 = 0 \quad \text{and} \quad \frac{\partial h_1}{\partial (\cdot)} = 0
$$

This means the gradient $\frac{\partial L}{\partial h_1}$ is blocked — no gradients flow through $h_1$ to earlier layers.

So:

- $\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial h_1} \cdot \frac{\partial h_1}{\partial w_1} = 0$
- $\frac{\partial L}{\partial w_2}$ also involves $h_1$, so it’s 0 too.

However, $\frac{\partial L}{\partial w_3}$ is connected to $h_3$ via $x_1$. We’re not told that $h_3$ is inactive, and it might affect $\hat{y}$ through another active path (e.g., $h_2$), so the gradient might still flow.

In short: if a ReLU unit is inactive, the gradient flow through it stops. Any weight that only influences the output through that unit gets zero gradient.


#### (b) Cross-Entropy Loss: Chain Rule Derivation (6 points)

> We now replace the softmax/matrix formulation with a single‑neuron network using sigmoid + BCE.

Consider a neural network with:
- Inputs: $x_1, x_2$
- Output neuron $z=w_1 x_1+w_2 x_2+b$
- Activation: $\hat{y}=\sigma(z)=\frac{1}{1+e^{-z}}$
- True label: $y \in\{0,1\}$
- Loss:

$$
L=-[y \log \hat{y}+(1-y) \log (1-\hat{y})] \quad \text { (binary cross-entropy) }
$$


**Task.** Derive expressions for

$$
\frac{\partial L}{\partial w_1}, \quad \frac{\partial L}{\partial w_2}, \quad \frac{\partial L}{\partial b}
$$

by unfolding the chain rule through the sigmoid activation and the BCE loss.
Be sure to show each step:
1. $\partial L / \partial \hat{y}$
2. $\partial \hat{y} / \partial z$
3. $\partial z / \partial w_i$ and $\partial z / \partial b$
4. Combine to get $\partial L / \partial w_i, \partial L / \partial b$.

### **1(b) Cross-Entropy Loss: Chain Rule Derivation**
---

#### **Compute** $\frac{\partial L}{\partial \hat{y}}$

Using the chain rule on the loss:

$$
\frac{\partial L}{\partial \hat{y}} = -\left( \frac{y}{\hat{y}} - \frac{1 - y}{1 - \hat{y}} \right)
$$

---

#### **Compute** $\frac{\partial \hat{y}}{\partial z}$

Since $\hat{y} = \sigma(z)$:

$$
\frac{\partial \hat{y}}{\partial z} = \hat{y}(1 - \hat{y})
$$

---

#### **Compute** $\frac{\partial z}{\partial w_i}$ and $\frac{\partial z}{\partial b}$

From $z = w_1x_1 + w_2x_2 + b$:

- $\frac{\partial z}{\partial w_1} = x_1$
- $\frac{\partial z}{\partial w_2} = x_2$
- $\frac{\partial z}{\partial b} = 1$

---

#### **Combine Using the Chain Rule**

Now apply the full chain rule:

**For $w_1$:**

$$
\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w_1}
= \left( -\frac{y}{\hat{y}} + \frac{1 - y}{1 - \hat{y}} \right) \cdot \hat{y}(1 - \hat{y}) \cdot x_1
$$

Simplifying this expression (using the identity that appears often in practice), we get:

$$
\frac{\partial L}{\partial w_1} = (\hat{y} - y) \cdot x_1
$$

**Similarly:**

$$
\frac{\partial L}{\partial w_2} = (\hat{y} - y) \cdot x_2
$$

$$
\frac{\partial L}{\partial b} = (\hat{y} - y) \cdot 1 = \hat{y} - y
$$

---

#### **Final Answer**

- $\frac{\partial L}{\partial w_1} = (\hat{y} - y) \cdot x_1$
- $\frac{\partial L}{\partial w_2} = (\hat{y} - y) \cdot x_2$
- $\frac{\partial L}{\partial b} = \hat{y} - y$

These gradients are used to update the weights during training in logistic regression or a single-neuron binary classifier.


### 2. Two-Layer MLP for XOR (6 points)

We know a single-layer perceptron cannot represent the XOR function. A **two-layer network** with an appropriate choice of weights and biases can solve it.  

1. **Construct** such a two-layer MLP (with a single hidden layer) that outputs 1 for XOR=1 and 0 for XOR=0, given inputs $\{(x_1, x_2)\mid x_1,x_2 \in \{0,1\}\}$.  
   - Specify your network’s architecture (size of hidden layer, activation function, final layer).
   - Provide **explicit** weight and bias values.  
2. **Demonstrate** that for all $(x_1,x_2)$ in $\{0,1\}\times\{0,1\}$, the final output is correct (0 or 1) for XOR.

### 2. Two-Layer MLP for XOR (6 points)

#### 1. Network Architecture

We design a 2-layer MLP that computes the XOR function using:

- **Input:** $x_1, x_2$
- **Hidden layer:** 2 neurons, ReLU activation
- **Output layer:** 1 neuron, ReLU (or identity) activation

#### Weights and Biases

Define:

$$
W^{(1)} = 
\begin{bmatrix}
1 & 1 \\
1 & 1
\end{bmatrix}, \quad
b^{(1)} = 
\begin{bmatrix}
0 \\
-1
\end{bmatrix}
$$

$$
W^{(2)} = 
\begin{bmatrix}
1 & -2
\end{bmatrix}, \quad
b^{(2)} = 0
$$

The network computes:

$$
z_1 = W^{(1)} x + b^{(1)} \\
h = \text{ReLU}(z_1) \\
\hat{y} = \text{ReLU}(W^{(2)} h + b^{(2)})
$$

---

### 2. Verifying XOR Outputs

---

**Case 1: $(x_1, x_2) = (0, 0)$**

$$
z_1 = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} 
\begin{bmatrix} 0 \\ 0 \end{bmatrix} 
+ \begin{bmatrix} 0 \\ -1 \end{bmatrix}
= \begin{bmatrix} 0 \\ -1 \end{bmatrix}
$$

$$
h = \text{ReLU}(z_1) = \begin{bmatrix} 0 \\ 0 \end{bmatrix}
$$

$$
\hat{y} = \text{ReLU}\left( \begin{bmatrix} 1 & -2 \end{bmatrix}
\begin{bmatrix} 0 \\ 0 \end{bmatrix} + 0 \right) 
= \text{ReLU}(0) = 0
$$

---

**Case 2: $(x_1, x_2) = (1, 0)$**

$$
z_1 = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} 
\begin{bmatrix} 1 \\ 0 \end{bmatrix} 
+ \begin{bmatrix} 0 \\ -1 \end{bmatrix}
= \begin{bmatrix} 1 \\ 0 \end{bmatrix}
$$

$$
h = \text{ReLU}(z_1) = \begin{bmatrix} 1 \\ 0 \end{bmatrix}
$$

$$
\hat{y} = \text{ReLU}\left( \begin{bmatrix} 1 & -2 \end{bmatrix}
\begin{bmatrix} 1 \\ 0 \end{bmatrix} + 0 \right) 
= \text{ReLU}(1) = 1
$$

---

**Case 3: $(x_1, x_2) = (0, 1)$**

$$
z_1 = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} 
\begin{bmatrix} 0 \\ 1 \end{bmatrix} 
+ \begin{bmatrix} 0 \\ -1 \end{bmatrix}
= \begin{bmatrix} 1 \\ 0 \end{bmatrix}
$$

$$
h = \text{ReLU}(z_1) = \begin{bmatrix} 1 \\ 0 \end{bmatrix}
$$

$$
\hat{y} = \text{ReLU}\left( \begin{bmatrix} 1 & -2 \end{bmatrix}
\begin{bmatrix} 1 \\ 0 \end{bmatrix} + 0 \right) 
= \text{ReLU}(1) = 1
$$

---

**Case 4: $(x_1, x_2) = (1, 1)$**

$$
z_1 = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} 
\begin{bmatrix} 1 \\ 1 \end{bmatrix} 
+ \begin{bmatrix} 0 \\ -1 \end{bmatrix}
= \begin{bmatrix} 2 \\ 1 \end{bmatrix}
$$

$$
h = \text{ReLU}(z_1) = \begin{bmatrix} 2 \\ 1 \end{bmatrix}
$$

$$
\hat{y} = \text{ReLU}\left( \begin{bmatrix} 1 & -2 \end{bmatrix}
\begin{bmatrix} 2 \\ 1 \end{bmatrix} + 0 \right)
= \text{ReLU}(2 - 2) = \text{ReLU}(0) = 0
$$

---

Thus, the network correctly computes the XOR function for all 4 possible binary input combinations.


### 3. $K$-Means Clustering (8 points)

![image-20250420170622686](https://drive.google.com/uc?id=1SUOHENOHHkgQvRzSI3FIvaZF37c4wPuQ)

Recall the following data points in $\mathbb{R}^2$:

$$
\mathbf{X} = \begin{bmatrix}
5.9 & 3.2 \\
4.6 & 2.9 \\
6.2 & 2.8 \\
4.7 & 3.2 \\
5.5 & 4.2 \\
5.0 & 3.0 \\
4.9 & 3.1 \\
6.7 & 3.1 \\
5.1 & 3.8 \\
6.0 & 3.0
\end{bmatrix}
$$

We apply $K$-Means with $K=3$ using Euclidean distance, starting from cluster centers $\boldsymbol{\mu}_1=(6.2,3.2), \boldsymbol{\mu}_2=(6.6,3.7), \boldsymbol{\mu}_3=(6.5,3.0)$.

1. **(2 pts)** Calculate the center of Cluster 1 after one iteration. Show your intermediate step of assigning points and then compute the updated mean.
2. **(2 pts)** Compute the center of Cluster 2 after **two** iterations (round to three decimals).
3. **(2 pts)** Find the center of Cluster 3 at convergence (round to three decimals).
4. **(2 pts)** How many iterations does it take to converge (no changes in cluster assignments)?

### 3. $K$-Means Clustering (8 points)

We are given:

- Data matrix $\mathbf{X}$ of 10 points in $\mathbb{R}^2$
- Initial cluster centers:

$$
\mu_1 = (6.2, 3.2), \quad \mu_2 = (6.6, 3.7), \quad \mu_3 = (6.5, 3.0)
$$

---

### Iteration 1: Distance Table and Assignments

| Point             | $\mu_1$ (6.2, 3.2) | $\mu_2$ (6.6, 3.7) | $\mu_3$ (6.5, 3.0) | Assigned |
|------------------|-------------------|--------------------|--------------------|----------|
| (5.9, 3.2)        | **0.30**          | 0.86               | 0.63               | 1        |
| (4.6, 2.9)        | **1.63**          | 2.15               | 1.90               | 1        |
| (6.2, 2.8)        | 0.40              | 0.98               | **0.36**           | 3        |
| (4.7, 3.2)        | **1.50**          | 1.96               | 1.81               | 1        |
| (5.5, 4.2)        | 1.22              | **1.21**           | 1.56               | 2        |
| (5.0, 3.0)        | **1.22**          | 1.75               | 1.50               | 1        |
| (4.9, 3.1)        | **1.30**          | 1.80               | 1.60               | 1        |
| (6.7, 3.1)        | 0.51              | 0.61               | **0.22**           | 3        |
| (5.1, 3.8)        | **1.25**          | 1.50               | 1.61               | 1        |
| (6.0, 3.0)        | **0.28**          | 0.92               | 0.50               | 1        |

**New center for Cluster 1**:

$$
\mu_1^{(1)} = \frac{1}{7} \sum_{i \in C_1} x_i = (5.17,\ 3.17)
$$

---

### Iteration 2: Distance Table and Assignments

| Point             | $\mu_1$ (5.17, 3.17) | $\mu_2$ (5.5, 4.2) | $\mu_3$ (6.45, 2.95) | Assigned |
|------------------|---------------------|--------------------|----------------------|----------|
| (5.9, 3.2)        | **0.73**            | 1.08               | 0.60                 | 3        |
| (4.6, 2.9)        | **0.63**            | 1.58               | 1.85                 | 1        |
| (6.2, 2.8)        | 1.09                | 1.57               | **0.29**             | 3        |
| (4.7, 3.2)        | **0.47**            | 1.28               | 1.77                 | 1        |
| (5.5, 4.2)        | 1.08                | **0.00**           | 1.57                 | 2        |
| (5.0, 3.0)        | **0.24**            | 1.30               | 1.45                 | 1        |
| (4.9, 3.1)        | **0.28**            | 1.25               | 1.56                 | 1        |
| (6.7, 3.1)        | 1.53                | 1.63               | **0.29**             | 3        |
| (5.1, 3.8)        | 0.63                | **0.57**           | 1.60                 | 2        |
| (6.0, 3.0)        | 0.85                | 1.30               | **0.45**             | 3        |

**New center for Cluster 2**:

$$
\mu_2^{(2)} = \frac{1}{2} \left[ (5.1, 3.8) + (5.5, 4.2) \right] = (5.3,\ 4.0)
$$

---

### Iteration 3: Distance Table and Final Assignment

| Point             | $\mu_1$ (4.80, 3.05) | $\mu_2$ (5.3, 4.0) | $\mu_3$ (6.2, 3.03) | Assigned |
|------------------|---------------------|--------------------|---------------------|----------|
| (5.9, 3.2)        | 1.11                | 1.00               | **0.35**            | 3        |
| (4.6, 2.9)        | **0.25**            | 1.30               | 1.60                | 1        |
| (6.2, 2.8)        | 1.42                | 1.50               | **0.23**            | 3        |
| (4.7, 3.2)        | **0.18**            | 1.00               | 1.51                | 1        |
| (5.5, 4.2)        | 1.35                | **0.28**           | 1.37                | 2        |
| (5.0, 3.0)        | **0.21**            | 1.04               | 1.20                | 1        |
| (4.9, 3.1)        | **0.11**            | 0.98               | 1.30                | 1        |
| (6.7, 3.1)        | 1.90                | 1.66               | **0.51**            | 3        |
| (5.1, 3.8)        | 0.81                | **0.28**           | 1.35                | 2        |
| (6.0, 3.0)        | 1.20                | 1.22               | **0.20**            | 3        |

**Final centers:**

- Cluster 1:  
  $$
  \mu_1 = \frac{1}{4} \left[ (4.6, 2.9) + (4.7, 3.2) + (4.9, 3.1) + (5.0, 3.0) \right] = (4.8,\ 3.05)
  $$

- Cluster 2: $(5.3,\ 4.0)$  
- Cluster 3: $(6.2,\ 3.025)$  

---

### Number of Iterations to Converge

Convergence occurs when assignments no longer change.

**Answer:** $3$ iterations

---


In [7]:
import numpy as np
import pandas as pd
from scipy.spatial.distance import cdist

# Define data points
X = np.array([
    [5.9, 3.2],
    [4.6, 2.9],
    [6.2, 2.8],
    [4.7, 3.2],
    [5.5, 4.2],
    [5.0, 3.0],
    [4.9, 3.1],
    [6.7, 3.1],
    [5.1, 3.8],
    [6.0, 3.0]
])

# Initial centers
mu = np.array([
    [6.2, 3.2],
    [6.6, 3.7],
    [6.5, 3.0]
])

iteration = 0
prev_assignments = None

while True:
    iteration += 1
    print(f"\n=== Iteration {iteration} ===")
    
    # Compute distances
    distances = cdist(X, mu, metric='euclidean')
    
    # Assign clusters
    assignments = np.argmin(distances, axis=1)
    
    # Show distance matrix and assignments
    df = pd.DataFrame(distances, columns=[f'mu{i+1} {tuple(mu[i])}' for i in range(3)])
    df['Assigned Cluster'] = assignments + 1
    print(df.round(2))
    
    # Check for convergence
    if np.array_equal(assignments, prev_assignments):
        print("\nConvergence reached.")
        break
    
    # Update centers
    new_mu = []
    for k in range(3):
        cluster_points = X[assignments == k]
        if len(cluster_points) > 0:
            new_mu.append(cluster_points.mean(axis=0))
        else:
            new_mu.append(mu[k])  # if empty, retain old center
    mu = np.array(new_mu)
    
    prev_assignments = assignments.copy()

# Final cluster centers
print("\nFinal Cluster Centers:")
print(mu.round(3))



=== Iteration 1 ===
   mu1 (6.2, 3.2)  mu2 (6.6, 3.7)  mu3 (6.5, 3.0)  Assigned Cluster
0            0.30            0.86            0.63                 1
1            1.63            2.15            1.90                 1
2            0.40            0.98            0.36                 3
3            1.50            1.96            1.81                 1
4            1.22            1.21            1.56                 2
5            1.22            1.75            1.50                 1
6            1.30            1.80            1.60                 1
7            0.51            0.61            0.22                 3
8            1.25            1.50            1.61                 1
9            0.28            0.92            0.50                 1

=== Iteration 2 ===
   mu1 (5.171428571428572, 3.1714285714285713)  mu2 (5.5, 4.2)  \
0                                         0.73            1.08   
1                                         0.63            1.58   
2           

## Part 2: Coding Problems

Below is a skeleton of a PyTorch-based notebook. You will fill in the indicated `TODO` blocks. We will focus on:

1. **Implementing Dropout** in a fully connected network.
2. **Implementing k-Fold Cross-Validation** for hyperparameter selection.

### **Scoring Breakdown (20 points total)**

1. **Insert Dropout (5 points)**  
   *Implementation of `nn.Dropout` in the network initialization and forward pass.*

2. **Evaluate Dropout Performance (5 points)**  
   *Show how dropout changes training/test performance (e.g., final accuracy). Provide a brief analysis.*

3. **Implement k-Fold Cross-Validation (5 points)**  
   *Write a procedure that splits the training data into $k$ folds, trains on $k-1$ folds, and validates on the remaining fold.*

4. **Hyperparameter Tuning and Final Results (5 points)**  
   *Vary hyperparameters (e.g., learning rate, hidden dimension, dropout probability), identify the best setting, and report final test accuracy.*

Below is a condensed example for reference. Use (and modify) as needed, and ensure each `TODO` is addressed in your final notebook.

In [8]:
# %pip install torch torchvision scikit-learn numpy --user # Ensure libraries are installed

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import numpy as np
from torchvision import transforms
from torch.utils.data import DataLoader, RandomSampler, Subset # Added Subset
from sklearn.model_selection import KFold # Added KFold
from itertools import product # Added product

# Set a seed for reproducibility (optional but good practice)
torch.manual_seed(42)
np.random.seed(42)

##################################
# 1. Data Loading and Transforms #
##################################
print("1. Loading Data...") # Keep original print
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

train_dataset = torchvision.datasets.FashionMNIST(
    root='./data',
    train=True,
    download=True,
    transform=transform
)

test_dataset = torchvision.datasets.FashionMNIST(
    root='./data',
    train=False,
    download=True,
    transform=transform
)

batch_size = 64
# Keep original variable name train_loader for the full dataset initially
train_loader = DataLoader(train_dataset, batch_size=batch_size, sampler=RandomSampler(train_dataset))
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
print("Data loaded.") # Keep original print

############################################
# 2. Model Definition + Dropout (5 points) #
############################################
print("\n2. Defining Model...") # Keep original print
class SimpleFCNetwork(nn.Module):
    def __init__(self, input_dim=784, hidden_dim=128, num_classes=10, dropout_prob=0.5):
        super(SimpleFCNetwork, self).__init__()

        # Store hyperparams (optional but good practice)
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.num_classes = num_classes
        self.dropout_prob = dropout_prob # Store dropout prob

        # 2-layer fully-connected
        self.fc1 = nn.Linear(self.input_dim, self.hidden_dim)

        # TODO (Insert Dropout layer here) - (3 points)
        # --- Implementation Start ---
        self.dropout = nn.Dropout(p=self.dropout_prob)
        # --- Implementation End ---

        self.fc2 = nn.Linear(self.hidden_dim, self.num_classes)

    def forward(self, x):
        x = x.view(x.size(0), -1) # Flatten
        x = self.fc1(x)
        x = F.relu(x)

        # TODO (Apply dropout here) - (2 points)
        # --- Implementation Start ---
        # Dropout is automatically handled by model.train() vs model.eval()
        x = self.dropout(x)
        # --- Implementation End ---

        x = self.fc2(x)
        return x
print("Model defined.") # Keep original print

##############################
# 3. Training and Evaluation #
##############################
print("\n3. Defining Training and Evaluation Functions...") # Keep original print
def train_model(model, train_loader, criterion, optimizer, epochs=10):
    """ Trains the model, keeping original print statement format """
    model.train() # Set model to training mode
    for epoch in range(epochs):
        total_loss = 0
        for images, labels in train_loader:
            # Optional: device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
            # images, labels = images.to(device), labels.to(device)
            # model.to(device)

            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        avg_loss = total_loss / len(train_loader)
        # Maintain original print statement (prints loss every epoch)
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {avg_loss:.4f}")
    return model

def evaluate_model(model, data_loader):
    """ Evaluates the model, same as before """
    model.eval() # Set model to evaluation mode
    correct = 0
    total = 0
    with torch.no_grad():
        for images, labels in data_loader:
             # Optional: device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
             # images, labels = images.to(device), labels.to(device)
             # model.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    accuracy = 100.0 * correct / total
    return accuracy
print("Training and evaluation functions defined.") # Keep original print


1. Loading Data...
Data loaded.

2. Defining Model...
Model defined.

3. Defining Training and Evaluation Functions...
Training and evaluation functions defined.



## 1. Insert Dropout (Implementation)

* **Status:** Completed.  
* **Details:** The `SimpleFCNetwork` class was modified as required within the original template structure. An `nn.Dropout` layer was instantiated in `__init__` and applied in the `forward` method after the ReLU activation, replacing the `TODO` placeholders.


In [9]:

# Example usage (for demonstration; you will do more thorough experiments):
# --- Keep original example ---
print("\nRunning initial example usage...")
model = SimpleFCNetwork(dropout_prob=0.5) # Uses default hidden_dim=128
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3) # Original uses SGD

# Note: This trains a model but its results are not directly used later.
# The dropout evaluation and CV parts will train their own models.
model = train_model(model, train_loader, criterion, optimizer, epochs=10) # Original uses 10 epochs
test_accuracy = evaluate_model(model, test_loader)
print(f"Test Accuracy from initial example (Dropout=0.5, SGD): {test_accuracy:.2f}%") # Modified print slightly for clarity




Running initial example usage...
Epoch [1/10], Loss: 1.7445
Epoch [2/10], Loss: 1.1834
Epoch [3/10], Loss: 0.9802
Epoch [4/10], Loss: 0.8824
Epoch [5/10], Loss: 0.8180
Epoch [6/10], Loss: 0.7761
Epoch [7/10], Loss: 0.7457
Epoch [8/10], Loss: 0.7193
Epoch [9/10], Loss: 0.6998
Epoch [10/10], Loss: 0.6814
Test Accuracy from initial example (Dropout=0.5, SGD): 78.25%


In [10]:
###########################################
# 4. Evaluate Dropout Performance (Added) #
###########################################
print("\n4. Evaluating Dropout Performance (Using Adam Optimizer)...")

# --- Train and Evaluate WITHOUT Dropout ---
print("Training model WITHOUT dropout (p=0.0)...")
model_no_dropout = SimpleFCNetwork(hidden_dim=128, dropout_prob=0.0) # Fixed hidden_dim
criterion_comp = nn.CrossEntropyLoss()
optimizer_no_dropout = optim.Adam(model_no_dropout.parameters(), lr=1e-3) # Use Adam here

train_model(model_no_dropout, train_loader, criterion_comp, optimizer_no_dropout, epochs=10) # Use 10 epochs for comparison
test_accuracy_no_dropout = evaluate_model(model_no_dropout, test_loader)
print(f"Test Accuracy WITHOUT Dropout: {test_accuracy_no_dropout:.2f}%")

# --- Train and Evaluate WITH Dropout ---
print("\nTraining model WITH dropout (p=0.5)...")
model_with_dropout = SimpleFCNetwork(hidden_dim=128, dropout_prob=0.5) # Fixed hidden_dim
optimizer_with_dropout = optim.Adam(model_with_dropout.parameters(), lr=1e-3) # Use Adam here

train_model(model_with_dropout, train_loader, criterion_comp, optimizer_with_dropout, epochs=10) # Use 10 epochs for comparison
test_accuracy_with_dropout = evaluate_model(model_with_dropout, test_loader)
print(f"Test Accuracy WITH Dropout (p=0.5): {test_accuracy_with_dropout:.2f}%")

# --- Analysis printout ---
print("\nDropout Performance Analysis:")
print(f"Comparing test accuracies (Adam, 10 epochs): No Dropout={test_accuracy_no_dropout:.2f}%, Dropout(p=0.5)={test_accuracy_with_dropout:.2f}%.")
if test_accuracy_with_dropout > test_accuracy_no_dropout:
    print("Dropout improved test accuracy in this run.")
elif test_accuracy_with_dropout < test_accuracy_no_dropout:
    print("Dropout slightly decreased test accuracy in this run.")
else:
    print("Dropout had minimal impact in this run.")




4. Evaluating Dropout Performance (Using Adam Optimizer)...
Training model WITHOUT dropout (p=0.0)...
Epoch [1/10], Loss: 0.4971
Epoch [2/10], Loss: 0.3766
Epoch [3/10], Loss: 0.3398
Epoch [4/10], Loss: 0.3157
Epoch [5/10], Loss: 0.2991
Epoch [6/10], Loss: 0.2845
Epoch [7/10], Loss: 0.2698
Epoch [8/10], Loss: 0.2591
Epoch [9/10], Loss: 0.2494
Epoch [10/10], Loss: 0.2396
Test Accuracy WITHOUT Dropout: 88.30%

Training model WITH dropout (p=0.5)...
Epoch [1/10], Loss: 0.6170
Epoch [2/10], Loss: 0.4804
Epoch [3/10], Loss: 0.4532
Epoch [4/10], Loss: 0.4364
Epoch [5/10], Loss: 0.4172
Epoch [6/10], Loss: 0.4086
Epoch [7/10], Loss: 0.3998
Epoch [8/10], Loss: 0.3904
Epoch [9/10], Loss: 0.3841
Epoch [10/10], Loss: 0.3791
Test Accuracy WITH Dropout (p=0.5): 86.94%

Dropout Performance Analysis:
Comparing test accuracies (Adam, 10 epochs): No Dropout=88.30%, Dropout(p=0.5)=86.94%.
Dropout slightly decreased test accuracy in this run.


## 2. Evaluate Dropout Performance

* **Objective:** To compare the test performance of the network with and without dropout enabled, using fixed hyperparameters (hidden_dim=128, lr=1e-3 with Adam optimizer, 10 epochs).
* **Results:**
    * Test Accuracy **WITHOUT** Dropout (p=0.0): `88.30%`
    * Test Accuracy **WITH** Dropout (p=0.5): `86.94%`
* **Analysis:** The dedicated comparison in section 4 showed that using dropout (p=0.5) slightly *decreased* test accuracy compared to no dropout for this specific 10-epoch run with the Adam optimizer. This suggests the model might not have been overfitting significantly in that short run, or p=0.5 was too high. The initial example run (using SGD and p=0.5) achieved a different accuracy (`78.25%`), highlighting sensitivity to the optimizer and training details.


In [11]:
###############################################
# 5. k-Fold Cross-Validation (5 points)       #
#    + Hyperparameter Tuning & Results (5 pts) #
###############################################
print("\n5. Implementing k-Fold Cross-Validation and Hyperparameter Tuning...")

# --- Define cross_validate function (replaces original skeleton) ---
def cross_validate(model_class, train_dataset, k=5, param_grid=None, epochs_per_fold=5, batch_size_cv=64):
    """
    Performs k-fold cross-validation to find the best hyperparameters.
    Uses original param_grid docstring format.
    """
    # --- Docstring from original template ---
    # param_grid is a dict of hyperparameters, e.g.:
    # {
    #   'lr': [1e-2, 1e-3, 1e-4],
    #   'hidden_dim': [64, 128, 256],
    #   'dropout_prob': [0.25, 0.5]
    # }
    # ---

    if not isinstance(param_grid, dict):
        raise ValueError("param_grid must be a dictionary")

    best_params = None
    best_accuracy = 0.0 # Changed name slightly from template

    # TODO: (Original comments kept for reference, logic implemented below)
    # (1) Shuffle and split 'train_dataset' into k folds
    # (2) For each param combination:
    #     For each fold:
    #         Train on k-1 folds, validate on the remaining fold
    #         Accumulate validation accuracy
    #     Average the accuracy across folds
    #     Track the best param combination

    # --- Implementation Start ---
    param_combinations = list(product(*param_grid.values()))
    param_names = list(param_grid.keys())

    print(f"\nStarting {k}-Fold Cross-Validation with {len(param_combinations)} parameter combinations.")
    print(f"Training for {epochs_per_fold} epochs per fold.")

    kf = KFold(n_splits=k, shuffle=True, random_state=42)

    for i, values in enumerate(param_combinations):
        current_params = dict(zip(param_names, values))
        print(f"\nTesting Combination {i+1}/{len(param_combinations)}: {current_params}")
        fold_accuracies = []

        for fold, (train_idx, val_idx) in enumerate(kf.split(train_dataset)):
            print(f"  Fold {fold+1}/{k}...")
            train_subset = Subset(train_dataset, train_idx)
            val_subset = Subset(train_dataset, val_idx)

            # Use fold-specific loaders
            train_loader_fold = DataLoader(train_subset, batch_size=batch_size_cv, shuffle=True)
            val_loader_fold = DataLoader(val_subset, batch_size=batch_size_cv, shuffle=False)

            # Instantiate model, filtering params
            model_init_params = {
                'hidden_dim': current_params.get('hidden_dim', 128),
                'dropout_prob': current_params.get('dropout_prob', 0.5),
                'input_dim': 784, 'num_classes': 10 # Fixed for FashionMNIST
            }
            model_fold = model_class(**model_init_params)

            if 'lr' not in current_params:
                 raise KeyError("'lr' must be in param_grid.")
            optimizer_fold = optim.Adam(model_fold.parameters(), lr=current_params['lr']) # Use Adam for CV
            criterion_fold = nn.CrossEntropyLoss()

            # Train model (will print loss each epoch as per modified train_model)
            train_model(model_fold, train_loader_fold, criterion_fold, optimizer_fold, epochs=epochs_per_fold)

            # Validate
            val_acc = evaluate_model(model_fold, val_loader_fold)
            fold_accuracies.append(val_acc)
            print(f"    Fold {fold+1} Validation Accuracy: {val_acc:.2f}%") # Indented print

        # Average accuracy for this param set
        avg_acc = np.mean(fold_accuracies)
        print(f"  Params: {current_params}, Avg Validation Accuracy across {k} folds: {avg_acc:.2f}%") # Indented print

        # Track best params
        if avg_acc > best_accuracy:
            best_accuracy = avg_acc
            best_params = current_params
            print(f"  >> New best parameters found!") # Indented print

    print(f"\nCross-validation finished.")
    if best_params:
        print(f"Best parameters found: {best_params}")
        print(f"Best average validation accuracy: {best_accuracy:.2f}%")
    else:
        print("No best parameters found.")
    # --- Implementation End ---

    return best_params, best_accuracy

# --- Define Hyperparameter Grid and Run CV ---
param_grid_to_search = {
    'lr': [1e-3, 5e-4],
    'hidden_dim': [128, 256],
    'dropout_prob': [0.3, 0.5]
}
num_folds = 3
epochs_cv = 5

# Call cross_validate (replaces commented-out line from template)
best_params, best_val_acc = cross_validate(
    SimpleFCNetwork,
    train_dataset,
    k=num_folds,
    param_grid=param_grid_to_search,
    epochs_per_fold=epochs_cv,
    batch_size_cv=batch_size
)



5. Implementing k-Fold Cross-Validation and Hyperparameter Tuning...

Starting 3-Fold Cross-Validation with 8 parameter combinations.
Training for 5 epochs per fold.

Testing Combination 1/8: {'lr': 0.001, 'hidden_dim': 128, 'dropout_prob': 0.3}
  Fold 1/3...
Epoch [1/5], Loss: 0.5913
Epoch [2/5], Loss: 0.4516
Epoch [3/5], Loss: 0.4097
Epoch [4/5], Loss: 0.3894
Epoch [5/5], Loss: 0.3681
    Fold 1 Validation Accuracy: 86.92%
  Fold 2/3...
Epoch [1/5], Loss: 0.5890
Epoch [2/5], Loss: 0.4517
Epoch [3/5], Loss: 0.4110
Epoch [4/5], Loss: 0.3895
Epoch [5/5], Loss: 0.3749
    Fold 2 Validation Accuracy: 86.59%
  Fold 3/3...
Epoch [1/5], Loss: 0.5893
Epoch [2/5], Loss: 0.4499
Epoch [3/5], Loss: 0.4100
Epoch [4/5], Loss: 0.3883
Epoch [5/5], Loss: 0.3731
    Fold 3 Validation Accuracy: 87.08%
  Params: {'lr': 0.001, 'hidden_dim': 128, 'dropout_prob': 0.3}, Avg Validation Accuracy across 3 folds: 86.86%
  >> New best parameters found!

Testing Combination 2/8: {'lr': 0.001, 'hidden_dim': 128, '

## 3. Implement k-Fold Cross-Validation (Implementation)

* **Status:** Completed.  
* **Details:** The skeleton `cross_validate` function in the template was replaced with a full implementation. It uses `sklearn.model_selection.KFold`, iterates through parameter combinations defined in a `param_grid`, trains/validates on the appropriate data subsets for each fold, averages validation accuracies, and identifies the best parameter set.


In [12]:
# --- Train Final Model with Best Parameters ---
print("\n6. Training Final Model with Best Parameters Found...") # New section header

if best_params:
    # Instantiate final model (replaces commented-out line from template)
    final_model_init_params = {
        'hidden_dim': best_params.get('hidden_dim', 128),
        'dropout_prob': best_params.get('dropout_prob', 0.5),
        'input_dim': 784, 'num_classes': 10
    }
    final_model = SimpleFCNetwork(**final_model_init_params)

    # Create optimizer and criterion for final model
    final_optimizer = optim.Adam(final_model.parameters(), lr=best_params['lr'])
    final_criterion = nn.CrossEntropyLoss()

    # Train on entire train_dataset (replaces commented-out line from template)
    final_epochs = 15 # Use more epochs for final training
    print(f"Training final model on full training data for {final_epochs} epochs with params: {best_params}")
    # Need a loader for the full training set again if original `train_loader` was modified (it wasn't here)
    full_train_loader_final = DataLoader(train_dataset, batch_size=batch_size, sampler=RandomSampler(train_dataset))
    train_model(final_model, full_train_loader_final, final_criterion, final_optimizer, epochs=final_epochs)

    # Evaluate final model (replaces commented-out line from template)
    print("\n7. Evaluating Final Model on Test Set...") # New section header
    test_acc = evaluate_model(final_model, test_loader)

    # Print final results (replaces commented-out line from template)
    print("\n" + "="*50)
    print("                Final Results Summary")
    print("="*50)
    print(f"Hyperparameters selected via {num_folds}-Fold CV:")
    print(f"  Best Parameters: {best_params}")
    print(f"  Best Avg. Validation Acc (CV): {best_val_acc:.2f}%")
    print("-"*50)
    print(f"Final Model Performance:")
    print(f"  Test Accuracy: {test_acc:.2f}%") # Original template format: print("Final Test Accuracy =", test_acc)
    print("="*50)
else:
    print("Cross-validation did not yield best parameters. Skipping final training.")

print("\nScript finished.") # New final print


6. Training Final Model with Best Parameters Found...
Training final model on full training data for 15 epochs with params: {'lr': 0.0005, 'hidden_dim': 256, 'dropout_prob': 0.3}
Epoch [1/15], Loss: 0.5486
Epoch [2/15], Loss: 0.4133
Epoch [3/15], Loss: 0.3783
Epoch [4/15], Loss: 0.3557
Epoch [5/15], Loss: 0.3373
Epoch [6/15], Loss: 0.3254
Epoch [7/15], Loss: 0.3130
Epoch [8/15], Loss: 0.3047
Epoch [9/15], Loss: 0.2945
Epoch [10/15], Loss: 0.2876
Epoch [11/15], Loss: 0.2809
Epoch [12/15], Loss: 0.2723
Epoch [13/15], Loss: 0.2663
Epoch [14/15], Loss: 0.2591
Epoch [15/15], Loss: 0.2559

7. Evaluating Final Model on Test Set...

                Final Results Summary
Hyperparameters selected via 3-Fold CV:
  Best Parameters: {'lr': 0.0005, 'hidden_dim': 256, 'dropout_prob': 0.3}
  Best Avg. Validation Acc (CV): 87.40%
--------------------------------------------------
Final Model Performance:
  Test Accuracy: 88.58%

Script finished.


## 4. Hyperparameter Tuning and Final Results

* **Objective:** Use 3-fold cross-validation to find the best combination of learning rate, hidden dimension size, and dropout probability from the specified grid, then report the final test accuracy using these parameters.

* **Hyperparameter Grid Searched:**
    * `lr`: [0.001, 0.0005]  
    * `hidden_dim`: [128, 256]  
    * `dropout_prob`: [0.3, 0.5]

* **Cross-Validation Results:**
    * **Best Parameters Found:** `{'lr': 0.0005, 'hidden_dim': 256, 'dropout_prob': 0.3}`
    * **Best Average Validation Accuracy (during CV):** `87.40%`

* **Final Model Training and Evaluation:**
    * A final model was trained on the *entire* training dataset using the best parameters (`lr=0.0005`, `hidden_dim=256`, `dropout_prob=0.3`) for 15 epochs.
    * **Final Test Accuracy:** `88.58%`

* **Analysis:** Cross-validation successfully identified optimal hyperparameters from the grid. The best model used a larger hidden layer (256), a moderate dropout (0.3), and a slightly lower learning rate (0.0005). The final test accuracy (`88.58%`) achieved after training with these parameters improved upon the initial baseline runs and the average validation accuracy, confirming the benefit of the tuning process. The selected dropout rate (0.3) proved more effective on average than the higher 0.5 rate during cross-validation.
