# PyTorch Assignment: Convolutional Neural Network (CNN)

**[Duke Community Standard](http://integrity.duke.edu/standard.html): By typing your name below, you are certifying that you have adhered to the Duke Community Standard in completing this assignment.**

Name: 

### Convolutional Neural Network

Adapt the CNN example for MNIST digit classfication from Notebook 3A. 
Feel free to play around with the model architecture and see how the training time/performance changes, but to begin, try the following:

Image ->  
convolution (32 3x3 filters) -> nonlinearity (ReLU) ->  
convolution (32 3x3 filters) -> nonlinearity (ReLU) -> (2x2 max pool) ->  
convolution (64 3x3 filters) -> nonlinearity (ReLU) ->  
convolution (64 3x3 filters) -> nonlinearity (ReLU) -> (2x2 max pool) -> flatten ->
fully connected (256 hidden units) -> nonlinearity (ReLU) ->  
fully connected (10 hidden units) -> softmax 

Note: The CNN model might take a while to train. Depending on your machine, you might expect this to take up to half an hour. If you see your validation performance start to plateau, you can kill the training.

original: $5 \times 5$ convolution -> $2 \times 2$ max pool -> $5 \times 5$ convolution -> $2 \times 2$ max pool -> fully connected to $\mathbb R^{256}$ -> fully connected to $\mathbb R^{10}$ (prediction)

In [1]:
### YOUR CODE HERE ###
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from tqdm.notebook import tqdm, trange

class MNIST_CNN(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.conv1_1 = nn.Conv2d(1, 32, kernel_size=3, padding=1) # 1 input channels for RGB images, 32 output channels
        self.conv1_2 = nn.Conv2d(32, 32, kernel_size=3, padding=1)
        self.conv2_1 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.conv2_2 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
        
        # Flatten -> fully connected layers
        # The input to the first fully connected layer will be 64 * 7 * 7 = 3136 features
        self.fc1 = nn.Linear(64 * 7 * 7, 256) # 64 channels * 7 height * 7 width, NMIST images are standardly 28x28 pixels in grayscale
        self.fc2 = nn.Linear(256, 10) 

        # self.conv1 = nn.Conv2d(1, 32, kernel_size=5, padding=2)
        # self.conv2 = nn.Conv2d(32, 64, kernel_size=5, padding=2)
        # self.fc1 = nn.Linear(7*7*64, 256)
        # self.fc2 = nn.Linear(256, 10)
        
    def forward(self, x):
        # First Block
        x = F.relu(self.conv1_1(x))
        x = F.relu(self.conv1_2(x))
        x = F.max_pool2d(x, kernel_size=2)

        # Second Block
        x = F.relu(self.conv2_1(x))
        x = F.relu(self.conv2_2(x))
        x = F.max_pool2d(x, kernel_size=2)

        # Flatten for fully connected layers
        # x.size(0) gets the batch size, -1 infers the remaining dimensions
        x = x.view(x.size(0), -1) 

        # Fully Connected Layers
        x = F.relu(self.fc1(x))
        x = self.fc2(x) # No softmax here; CrossEntropyLoss handles it internally

        return x
        
        # conv layer 1
        # x = self.conv1(x)
        # x = F.relu(x)
        # x = F.max_pool2d(x, kernel_size=2)
        
        # # conv layer 2
        # x = self.conv2(x)
        # x = F.relu(x)
        # x = F.max_pool2d(x, kernel_size=2)
        
        # # fc layer 1
        # x = x.view(-1, 7*7*64)
        # x = self.fc1(x)
        # x = F.relu(x)
        
        # # fc layer 2
        # x = self.fc2(x)
        # return x     

# Load the data
mnist_train = datasets.MNIST(root="./datasets", train=True, transform=transforms.ToTensor(), download=True)
mnist_test = datasets.MNIST(root="./datasets", train=False, transform=transforms.ToTensor(), download=True)
train_loader = torch.utils.data.DataLoader(mnist_train, batch_size=100, shuffle=True)
test_loader = torch.utils.data.DataLoader(mnist_test, batch_size=100, shuffle=False)

## Training
# Instantiate model  
model = MNIST_CNN()  # <---- change here

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)  # <---- change here

# Iterate through train set minibatchs 
for epoch in trange(3):  # <---- change here
    for images, labels in tqdm(train_loader):
        # Zero out the gradients
        optimizer.zero_grad()

        # Forward pass
        x = images  # <---- change here 
        y = model(x)
        loss = criterion(y, labels)
        # Backward pass
        loss.backward()
        optimizer.step()

## Testing
correct = 0
total = len(mnist_test)

with torch.no_grad():
    # Iterate through test set minibatchs 
    for images, labels in tqdm(test_loader):
        # Forward pass
        x = images  # <---- change here 
        y = model(x)

        predictions = torch.argmax(y, dim=1)
        correct += torch.sum((predictions == labels).float())

print('Test accuracy: {}'.format(correct/total))

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/600 [00:00<?, ?it/s]

  0%|          | 0/600 [00:00<?, ?it/s]

  0%|          | 0/600 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

Test accuracy: 0.9907000064849854


### Short answer

1\. How does the CNN compare in accuracy with yesterday's logistic regression and MLP models? How about training time?

`Higher accuracy`

2\. How many trainable parameters are there in the CNN you built for this assignment?

*Note: The total of trainable parameters counts each element in a tensor. For example, a weight matrix that is 10x5 has 50 trainable parameters.*

`Summing up the parameters from each layer:`

`conv1_1: 320`

`conv1_2: 9,248`

`conv2_1: 18,496`

`conv2_2: 36,928`

`fc1: 803,072`

`fc2: 2,570`

`Total = 320 + 9248 + 18496 + 36928 + 803072 + 2570 = 870,634`

`The MNIST_CNN model you built for this assignment has 870,634 trainable parameters.`

3\. When would you use a CNN versus a logistic regression model or an MLP?

`Use a Convolutional Neural Network (CNN) primarily for data with a grid-like topology, such as images, videos, or even 1D time series, because their architecture (local connectivity, parameter sharing, and pooling) is inherently designed to capture spatial hierarchies and local patterns effectively, leading to robust feature extraction and translation invariance. `

`In contrast, Logistic Regression is suitable for simple, linearly separable tabular data where interpretability is crucial and non-linear relationships are minimal. `

`A Multilayer Perceptron (MLP) serves as a more general-purpose neural network for complex, non-linear tabular data, but it typically struggles with raw image data due to the loss of spatial information upon flattening and an explosion in parameters compared to CNNs.`