# Riding the Learning Curve: How a Single Number Decides Whether Your Neural Network Succeeds or Crashes

Date: November 19, 2025         

Training a deep learning model is a lot like teaching a student.
If you teach too slowly, the student never learns enough.
If you teach too quickly, the student gets confused and makes wild mistakes.

In neural networks, this teaching speed is controlled by the learning rate — one simple number that can decide whether your model becomes smart… or completely fails.

This blog explores what a learning rate is, why it is so powerful, and how changing it affects training in real-life experiments. Using a simple CNN on the MNIST dataset, we’ll compare learning rates and visualize how training behaves at different speeds.

## What Is Learning Rate?

Learning Rate is a key hyperparameter in neural networks that controls how quickly the model learns during training. It determines the size of the steps taken to minimize the loss function. It controls how much change is made in response to the error encountered, each time the model weights are updated. It determines the size of the steps taken towards a minimum of the loss function during optimization.

**_In short, its a hyperparameter that controls how much the model updates its weights in response to the error it makes._**

Formally, it appears in the gradient descent update rule:

$$θt+1​=θt​−η⋅∇θ​J(θ)$$


where: 
- $θt$ = current parameters
- $θt+1$ = updated parameters
- $∇θ​J(θ)$ = gradrient of the loss functions
- $η(eta)$ = learning rate

## Why is this important?

_Because the learning rate determines how big each update step is._

| Learning Rate | What Happens                                    |
| ------------- | ----------------------------------------------- |
| **Too Low**   | Model learns extremely slowly, may get stuck    |
| **Ideal**     | Smooth, stable learning and fast convergence    |
| **Too High**  | Model becomes unstable → oscillates or diverges |

**Visual analogy:** <br>
Small LR → baby steps <br>
Medium LR → walking normally<br>
Large LR → running downhill<br>
Very large LR → falling off a cliff<br>

## Why Learning Rate Matters

1. Controls how fast the model learns
A higher learning rate speeds up learning at the cost of stability.
2. Affects the quality of the final solution
A bad learning rate can trap the model in poor local minima or saddle points.
3. Determines training stability <br>
Too high → training “explodes”. <br>
Too low → training drags for hours.

4. Interacts with optimizers
For optimizers like Adam, RMSProp, and SGD, the default LR is often set to values that balance speed vs. reliability. But there’s never a perfect LR for every model.

## To understand learning rates in action

**To understand learning rates in action, we trained a simple Convolutional Neural Network (CNN) on the MNIST handwritten digit dataset.**

### Experiment Setup

- Dataset: MNIST
- Model: Simple CNN
- Epochs: 5
- Optimizer: Adam
- Learning Rates Tested:
    - **0.0001 (very low)**
    - **0.001 (recommended for Adam)**
    - **0.01 (high)**


### Goal

Show how different learning rates affect:
- Training loss
- Training speed
- Final test accuracy
- Training stability

1. Import Dependencies

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import datasets
import matplotlib.pyplot as plt

KeyboardInterrupt: 

2. Load Dataset

In [2]:
transform = transforms.ToTensor()

train_data = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_data = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)

train_loader = torch.utils.data.DataLoader(train_data, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=32, shuffle= True)


3. Build CNN

In [3]:
class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(1, 16, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
            nn.Conv2d(16, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2)
        )
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(32*7*7, 128), nn.ReLU(),
            nn.Linear(128, 10)
        )

    def forward(self, x): return self.fc(self.conv(x))


4. Training Function

In [5]:
def train_model(lr):
    model = CNN()
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)
    
    loss_list = []
    
    for epoch in range(5):
        for x, y in train_loader:
            pred = model(x)
            loss = criterion(pred, y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        loss_list.append(loss.item())
        print(f"LR={lr}, Epoch={epoch+1}, Loss={loss.item():.4f}")
    return model, loss_list


5. Run Experiments

In [None]:
learning_rates = [0.0001, 0.001, 0.01]
results = {}

for lr in learning_rates:
    model, losses = train_model(lr)
    results[lr] = losses


6. Plot Loss Curves

In [None]:
for lr, loss in results.items():
    plt.plot(loss, label=f"LR={lr}")
plt.legend()
plt.title("Loss Comparison Across Learning Rates")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.show()

## Key Takeaways

Hyperparameter Sensitivity: The learning rate is a decisive factor in model training. Even with identical architectures, variations in LR can determine the difference between rapid convergence and model divergence.

The "Goldilocks" Principle:
- Too Low: Can lead to slow training or, as seen here, getting stuck in suboptimal states.
- Too High: Causes oscillation and inability to reach the absolute global minimum.
- Optimal: Facilitates smooth, efficient descent (e.g., $\alpha = 0.001$ in this experiment).

Visual Diagnostics: Plotting loss curves is essential. Numerical accuracy metrics alone often hide how the model is learning (or failing to learn).

----------

References
Academic Sources

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org

Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1412.6980

Technical Documentation & Tutorials

PyTorch. (2024). Optimizers - PyTorch Documentation. Retrieved from https://pytorch.org/docs/stable/optim.html

Li, F., Karpathy, A., & Johnson, J. (n.d.). CS231n: Convolutional Neural Networks for Visual Recognition (Optimization). Stanford University. https://cs231n.github.io/optimization-1/

Brownlee, J. (2022). A Gentle Introduction to Learning Rate in Deep Learning. Machine Learning Mastery.

------

The learning rate is often the single most significant hyperparameter to tune in deep learning. As shown in the visual analysis, a magnitude change in learning rate (e.g., from $10^{-4}$ to $10^{-3}$) dramatically alters the training trajectory. For future iterations, utilizing Learning Rate Schedulers (decaying the rate over time) is recommended to combine the speed of high initial rates with the precision of low final rates.