Here's a suggested structure for your exercise:

---

### **Exercise: Understanding Residual Connections**

#### **Objective**
- To demonstrate how residual connections can improve the training of very deep neural networks.

#### **Dataset**
- Use either MNIST or FashionMNIST.

#### **Steps**

1. **Baseline Model Without Residual Connections**
   - Create a fully connected network with 10+ layers (e.g., each layer has 256 neurons and ReLU activations).
   - Train the model for 10 epochs using a learning rate of 0.001 and SGD/Adam optimizer.
   - Evaluate and visualize:
     - Training loss over epochs.
     - Validation accuracy over epochs.

   **Expected Outcome:** The model should exhibit difficulty in converging due to vanishing/exploding gradients.

2. **Introduce Residual Connections**
   - Modify the network to include residual connections between every two layers (or groups of two layers):
     \[
     x_{l+2} = x_{l+2} + x_l
     \]
     - Ensure the input dimensions of the skip connection match the output dimensions of the layers (use linear layers if necessary).

3. **Train the Residual Network**
   - Use the same training setup (learning rate, optimizer, and number of epochs) as in Step 1.
   - Evaluate and visualize:
     - Training loss over epochs.
     - Validation accuracy over epochs.

   **Expected Outcome:** The residual network should converge faster and achieve better accuracy.

#### **Deliverables**
- Plots comparing training loss and validation accuracy for the non-residual and residual networks.
- A short explanation of why residual connections improve training.

#### **Key Discussion Points**
- What causes the deep network without residual connections to fail? (e.g., vanishing/exploding gradients, difficulty in learning identity mappings)
- How do residual connections address these issues?

---

This exercise provides a hands-on way for students to observe the practical benefits of residual connections, grounding the theory in a simple, relatable example. Would you like help with specific implementation code?

DeepCNNwBNRes: Double the number of channels in each of the convolutional blocks and adjust the fully connect layer appropriately.  Train on CIFAR10.  Plot metrics.  Does it to do better than the smaller version in the lesson?  How much larger is the network in terms of parameters?

Consider the model Resnet9.  
* Draw a picture showing how the first residual block is setup here.  It should be similar to the pictures in the lesson, but can be hand drawn and included in your notebook.  (I used drawio to make those pictures)
* Use a OneCycleLR with AdamW and train_network to train Resnet9 on CIFAR10.  You should only need 10 epochs.  You may have to experiment a little.

Resnet9 is in the file resnet9.py.  You can import the model and create an instance of it like this:

```python
from resnet9 import ResNet9
model = ResNet9(3,10) # input channels, num classes
```