<h1><center> 
    Neural network design and application
</center></h1>

<h2><center>CPT_S 434/534, 2022 Spring</center></h2>

<h2><center>HW 2: NN basics -- Part 2 (68 pts)</center></h2>

### Name: *[INPUT YOUR NAME HERE]*

## This assignment includes:

## Coding in Python (pytorch): train softmax classifiers on MNIST (68 points)

Step 0: Install and configure: python ([Anaconda platform](https://docs.anaconda.com/anaconda/install/) recommended), [Jupyter Notebook](https://jupyter.org/install) and [pytorch](https://pytorch.org/get-started/) 

**Remark 1.** [Colab](https://colab.research.google.com) is a cloud platform that enables your Jupyter Notebooks (including this .ipynb assignment) to run with different runtime types (hardware acceleration is possible using GPU or TPU). You may also choose Colab to finish assignments (future assignments may require extensive computation that may be time-consuming on your laptop). 

**Remark 2.** If you use Colab, it is still required to convert your .ipynb to .html and submit **BOTH** files to Canvas. See [this page](https://stackoverflow.com/questions/53460051/convert-ipynb-notebook-to-html-in-google-colab) on how to convert to .html

Step 1: Read provided code (with pytorch) to understand the logic of MLP with one hidden layer, so that you know how to implement in the following step and how to re-use the provided code

Step 2: Complete the code of MLP with two hidden layer of softmax classifier on [MNIST](http://yann.lecun.com/exdb/mnist/) using different hyper-parameters.

Step 3: Record and plot results to show accuracy convergence (against #epoch)

## Submission:

* Convert the .ipynb file to .html file (**save the execution outputs** to show your progress: otherwise grading may be affected)
    
* Upload **both** your .ipynb and .html files to Canvas.

* Deadline: Feb 20, 11:59 PM, Pacific time.

* Plots should be clear and easy to read.

## 1. (Read and run) Train feedforward networks with one hidden layer (one activation layer)

In [None]:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy

# Device configuration: check if there is a configured GPU available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Hyper-parameters 
input_size = 784        # 28 * 28
hidden_size = 500       # the output dimension of the linear model in each MLP hidden layer
num_classes = 10        # the number of classes
num_epochs = 10         # the number of epochs (each epoch: scanning the entire training set)
batch_size = 100        # how many samples are used in each iteration of SGD/Adam update
learning_rate = 0.001   # learning rate or step size used in gradient-based optimization algorithm

# MNIST dataset 
train_dataset = torchvision.datasets.MNIST(root='data', 
                                           train=True, 
                                           transform=transforms.ToTensor(),  
                                           download=True)

test_dataset = torchvision.datasets.MNIST(root='data', 
                                          train=False, 
                                          transform=transforms.ToTensor())

# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

# Fully connected neural network with one hidden layer
class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNet, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size) 
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)  
    
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

# Define a model using class NeuralNet()
model = NeuralNet(input_size, hidden_size, num_classes).to(device)

# Define loss function and optimization algorithm (optimizer)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=0.00001)  

# Train the model
total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):  
        # Move tensors to the configured device
        images = images.reshape(-1, 28*28).to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if (i+1) % 100 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}' 
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

# Test the model and plot training/testing accuracy
# In test phase, we don't need to compute gradients 
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.reshape(-1, 28*28).to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Accuracy of the network on the 10000 test images: {} %'.format(100 * correct / total))



    test_acc_list.append(100 * correct / total)

    with torch.no_grad():
        correct = 0
        total = 0
        for images, labels in train_loader:
            images = images.reshape(-1, 28*28).to(device)
            labels = labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

        print('Accuracy of the network on the training images: {} %'.format(100 * correct / total))
        train_acc_list.append(100 * correct / total)
            
plt.plot(train_acc_list, '-b', label='train acc')
plt.plot(test_acc_list, '-r', label='test acc')
plt.legend()
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.xticks(rotation=60)
plt.title('Accuracy ~ Epoch')
# plt.savefig('assets/accr_{}.png'.format(cfg_idx))
plt.show()
        
# Save the model checkpoint
torch.save(model.state_dict(), 'model.ckpt')

## 2. (To finish, 17 pts) Implement and train a feedforward network with two hidden layers (one layer means one nonlinear functio, so make sure you have two activation layers)

After training: plot training and testing accuracy (against #epoches) 

Hint: modify class NeuralNet

In [None]:
# Your code goes here
# 
# Hint: no need to implement the entire training process as in HW1. 
# Simply modify the above provided code, particularly "class NeuralNet()".
# For example, copy the above "class NeuralNet()" in this cell,
# and modify the functions "__initi__()" and "forward()" to re-define its structure.
# After modifying "class NeuralNet()", copy all necessary code to train on your code,
# including: define a model from class NeuralNet, define loss function and optimizer,
# training for-loops and plot figures
# 
# 


## 3. (To finish, 3 pts) Use SGD (instead of Adam) to train your two-hidden-layer network

Hint: read [this document](https://pytorch.org/docs/stable/optim.html) for torch.optim and take a look at their *example* to understand how to change optimization algorithm. Hyper-parameters of optimization can be the same with the provided code

In [None]:
# Your code goes here


## 4. (To finish, 3 pts each setting, 15 pts in total) Use SGD to train your two-hidden-layer network with different learning rate values in the range of $\{ 0.0001, 0.001, 0.01, 0.1, 1 \}$, and show which learning rate achieves the best testing accuracy.

In [None]:
# Your code goes here



## 5. (To finish, 3 pts each setting, 15 pts in total) Use Adam to train your two-hidden-layer network with different learning rate values in the range of $\{ 0.0001, 0.001, 0.01, 0.1, 1 \}$, and show which learning rate achieves the best testing accuracy.

In [None]:
# Your code goes here


## 6. (To finish, 3 pts each setting, 9 pts in total) Change the dimension of the hidden variable (*hidden_size*) from $500$ to $100, 1000, 2000$, train the corresponding networks, and show the difference of them in testing accuracy.

Hint: you may use exactly the same setting with the above section 1, e.g., still use Adam with original setting for optimization

In [None]:
# Your code goes here


## 7. (To finish, 3 pts each question, 9 pts in total) Answer the following three questions

### Q1 (3 pts): Is the best learning rate for SGD the same with the best learning rate for Adam?

**Answer**:



### Q2 (3 pts): Read [this discussion](https://discuss.pytorch.org/t/how-does-sgd-weight-decay-work/33105/2) for the hyper-parameter of "weight decay" in optimizer and briefly describe how it works (hint: try to link it to anything we have learned in our class, such as the section of ML basic)

**Answer**:



### Q3 (3 pts): In the above section 6, how the dimension of hidden variable impacts the performance?

**Answer**:

