# **When Learning Rates Decay: What Really Happens Inside a CNN**
### **DS413 | Elective 4: Deep Learning**

A Blog post by:

**Kein Jake Culanggo**

---
## **Introduction**

Training a neural network is fundamentally an optimization problem. The model must adjust its parameters to minimize a loss function, and the learning rate controls how large each update step is. Although simple on the surface, this single hyperparameter profoundly affects convergence, stability, and generalization. *Goodfellow, Bengio, and Courville (2016)* describe the learning rate as the most influential hyperparameter in deep learning.

**This blog explores how learning rate values shape the behavior of a convolutional neural network (CNN) trained on MNIST.**

By examining several fixed learning rates, we establish why learning rate scheduling is essential for efficient training.

<a href="https://ibb.co/gb5hBxZ7"><img src="https://i.ibb.co/XfhMvqx4/image.png" alt="image" border="0"></a>

---
## **What is Learning Rate Scheduling?**

#### • **Technical Explanation**

In gradient-based optimization, parameters \(\theta\) are updated according to:

<br>

$$
\theta_{t+1} = \theta_t - \eta_t \nabla_\theta L(\theta_t)
$$

where \(\eta_t\) is the learning rate at step \(t\).

<br>


A learning rate schedule modifies \( \eta_t \) across training epochs. The intuition is simple:
- Early training benefits from larger steps for faster exploration.
- Later training requires smaller steps for stable convergence.

This behavior has been supported by studies such as Smith (2017) and Loshchilov & Hutter (2019), which show improved generalization and faster convergence when learning rates are varied dynamically.

<br>

#### • **Intuitive Analogy**

Imagine learning how to sketch. Early in the drawing, broad strokes help form the structure quickly. As details emerge, smaller and more controlled strokes refine the image. If the strokes are always too big, the drawing becomes messy; if always too small, progress is slow.

Learning rate scheduling follows the same logic.


---
## **Experiment Setup**

This experiment tests four fixed learning rates before any scheduling is applied. This baseline helps us understand the behavior that schedulers aim to improve.

#### • **Model: Small CNN**

We use a lightweight convolutional neural network appropriate for MNIST classification. Its simplicity ensures training is fast while still sensitive to learning rate behavior.

**class SimpleCNN(nn.Module):**

    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv_layer = nn.Sequential(
            nn.Conv2d(1, 8, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(8, 16, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )
        self.fc_layer = nn.Sequential(
            nn.Flatten(),
            nn.Linear(16 * 7 * 7, 128),
            nn.ReLU(),
            nn.Linear(128, 10)
        )

    def forward(self, x):
        x = self.conv_layer(x)
        return self.fc_layer(x)


---
## **Training Pipeline**

To isolate the effect of learning rates, each training run uses:

- **Dataset:** MNIST
- **Optimizer:** SGD
- **Loss:** CrossEntropy
- **Batch size:** 64
- **Epochs:** 20
- **Learning rates tested:**
  - 0.1
  - 0.01
  - 0.001
  - 0.0001

The structure remains constant across experiments, allowing learning rate differences to emerge clearly.

**def train_model(model, train_loader, criterion, lr, epochs=20):**

    optimizer = optim.SGD(model.parameters(), lr=lr)
    losses = []
    accuracies = []

    for epoch in range(epochs):

        model.train()
        running_loss = 0

        for images, labels in train_loader:

            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        avg_loss = running_loss / len(train_loader)
        acc = evaluate_model(model)
        
        losses.append(avg_loss)
        accuracies.append(acc)

    return losses, accuracies


---
## **Results and Analysis**

Each of the following discussions corresponds to the plots generated for the different learning rates. Insert each plot beneath its respective section.


### **Plot 1. Learning Rate = 0.1**

#### • **Technical Interpretation**
The loss curve fluctuates heavily, indicating instability. The learning rate is too large, causing the optimizer to overshoot minima repeatedly. Accuracy rises early but fails to stabilize, revealing convergence issues.

This aligns with Goodfellow et al. (2016), who highlight how excessively large learning rates disrupt optimization.

#### • **Intuitive Explanation**
The model is “running too fast.”  
It keeps overshooting and stumbling, making it impossible to settle on a good solution.


<a href="https://ibb.co/HfjwgXs7"><img src="https://i.ibb.co/G4qYPFL7/plot1-training-loss.png" alt="plot1-training-loss" border="0"></a>

---
### **Plot 2. Learning Rate = 0.01**

#### • **Technical Interpretation**
This learning rate produces stable, consistent training. Loss decreases smoothly, and accuracy improves rapidly. This setting achieves the best balance between speed and stability.

#### • **Intuitive Explanation**
This is the “Goldilocks zone.”  
Not too fast, not too slow. Just right for learning efficiently.


<a href="https://ibb.co/zHjqMmh4"><img src="https://i.ibb.co/k6kN7GgS/plot2-validation-loss.png" alt="plot2-validation-loss" border="0"></a>

---
### **Plot 3. Learning Rate = 0.001**

#### • **Technical Interpretation**
Training is stable but slow. Loss decreases gradually, and accuracy improves but not at the same rate as with 0.01. Underfitting occurs within 20 epochs.

Bengio (2012) notes that overly small learning rates underutilize gradient information, slowing convergence.

#### • **Intuitive Explanation**
The model is “walking carefully.”  
It avoids mistakes but moves too slowly to reach strong performance in time.


<a href="https://ibb.co/35vqTbbZ"><img src="https://i.ibb.co/4w2GFLLr/plot3-training-accuracy.png" alt="plot3-training-accuracy" border="0"></a>

---
### **Plot 4. Learning Rate = 0.0001**

#### • **Technical Interpretation**
The model barely learns. Loss decreases minimally, and accuracy plateaus almost immediately. The learning rate is too small for meaningful progress.

#### • **Intuitive Explanation**
The model is “moving in centimeters on a journey that needs meters.”  
It technically improves, but too slowly to matter within the training window.


<a href="https://ibb.co/cSLv2nMQ"><img src="https://i.ibb.co/KpwGVvRW/plot4-validation-accuracy.png" alt="plot4-validation-accuracy" border="0"></a>

---
## **What These Results Tell Us**

Across the four runs:

1. Large learning rates destabilize training.
2. Extremely small learning rates stall learning.
3. Middle-range values (0.01 and 0.001) produce balanced behavior.
4. No single fixed learning rate works best for the entire training process.

This is precisely why learning rate scheduling is widely adopted.

---
## **Why Learning Rate Scheduling Helps**

#### • **Technical View**

- Early training benefits from larger updates to explore the loss landscape.
- Later training requires smaller steps for fine-grained adjustments.
- Schedulers automate this shift, improving convergence and generalization.

#### • **Analogy**

Learning is like hiking:
- Large steps help at the beginning.
- Smaller, careful steps matter near a cliff edge.

Schedulers manage this transition automatically.

<a href="https://imgbb.com/"><img src="https://i.ibb.co/YBrXBS71/image.png" alt="image" border="0"></a>


---
## **Conclusion**

This experiment shows how dramatically learning rates influence CNN training dynamics. High rates destabilize optimization, extremely low rates stall it, and optimal rates produce smooth and efficient learning.

Learning rate scheduling addresses this imbalance by adjusting the learning rate throughout training, ensuring fast exploration early and precise refinement later.

This foundational experiment prepares the ground for more advanced methods such as cosine annealing, warm-up scheduling, and cyclical learning rates.


| Learning Rate Category | Learning Rate | Final Training Loss | Final Validation Loss | Final Validation Accuracy |
| ---------------------- | ------------- | ------------------- | --------------------- | ------------------------- |
| Too High (Exploding)   | 0.1           | 1.8393              | 0.7631                | 80.87%                    |
| High                   | 0.01          | 3.2777              | 2.6486                | 23.04%                    |
| Just Right             | 0.001         | 0.0300              | 0.0052                | 99.78%                    |
| Low                    | 0.0001        | 0.0083              | 0.0007                | 100.00%                   |
| Too Low (Crawling)     | 0.00001       | 0.1563              | 0.0617                | 100.00%                   |


### **`Why Low Learning Rate (0.0001) Produced the Best Model`**

The learning rate controls how much the model updates its weights during each optimization step. If the learning rate is too high (e.g., 0.1 or 0.01), the model can overshoot the optimal minima in the loss landscape, leading to unstable training and poor validation accuracy. Conversely, if the learning rate is too low (e.g., 1e-5), training progresses extremely slowly, which can sometimes lead to overfitting or wasted computation.

In our experiments, the low learning rate of 0.0001 struck the optimal balance. It allowed the model to converge smoothly without overshooting, achieving 100% validation accuracy with minimal training and validation loss. This indicates the model not only fit the training data effectively but also generalized very well to unseen data, avoiding the instability observed in higher learning rates and the slow convergence seen with the very lowest rate.

Think of training a neural network like trying to find the bottom of a valley while walking blindfolded. A high learning rate is like taking giant leaps—you might overshoot the bottom and even bounce back up the hill. A very low learning rate is like taking tiny baby steps—you will eventually reach the bottom, but it takes a very long time. The low learning rate of 0.0001 was just right: small enough to step steadily toward the valley floor without overshooting, but large enough to get there efficiently. This is why it produced the most accurate and stable model.

## **References**

- Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press.
- Smith, L. N. (2017). Cyclical Learning Rates for Training Neural Networks.
- Loshchilov, I., & Hutter, F. (2019). Decoupled Weight Decay Regularization.
- Bengio, Y. (2012). Practical Recommendations for Gradient-Based Training of Deep Architectures.
