Here are the key topic sentences extracted from the Microsoft Learn page on Convolutional Neural Networks (CNNs):

* Deep learning models are particularly useful for data consisting of large arrays of numeric values, such as images.
* At the heart of deep learning‚Äôs success in computer vision is the convolutional neural network (CNN).
* CNNs consist of multiple layers, each performing a specific task in extracting features or predicting labels.
* The break down layer types (convolution, pooling, dropping, flattening, fully connected).
    * A convolutional layer extracts important features in images by applying a filter defined by a kernel.
    * Pooling layers reduce the number of feature values while retaining key differentiating features.
    * Dropping layers help mitigate overfitting by randomly eliminating feature maps during training.
    * Flattening layers convert multidimensional feature maps into a vector for input to a fully connected layer.
    * Fully connected layers generate predictions by passing feature values through hidden layers to an output layer.
* A CNN is trained by passing batches of data through multiple epochs, adjusting weights via backpropagation.

In [None]:
import os
torch_data_dir = '../../generated/data/torch'
os.makedirs(torch_data_dir, exist_ok=True)

#### Example: CNN Training with Weight & Bias Updates

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms

# 1. Define a simple CNN
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        # Convolutional layer: input channels=1 (grayscale), output channels=10, kernel=3x3
        self.conv1 = nn.Conv2d(1, 10, kernel_size=3)
        # Fully connected layer
        self.fc1 = nn.Linear(10*26*26, 10)  # 28x28 image -> 26x26 after 3x3 conv

    def forward(self, x):
        x = F.relu(self.conv1(x))   # Apply convolution + ReLU
        x = x.view(-1, 10*26*26)    # Flatten
        x = self.fc1(x)             # Fully connected
        return F.log_softmax(x, dim=1)

# 2. Load dataset (MNIST digits)
transform = transforms.Compose([transforms.ToTensor()])
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST(torch_data_dir, train=True, download=True, transform=transform),
    batch_size=64, shuffle=True
)

# 3. Initialize model, optimizer, and loss function
model = SimpleCNN()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
loss_fn = nn.CrossEntropyLoss()

# 4. Training loop
for epoch in range(3):  # run for 3 epochs
    total_loss = 0
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()              # Reset gradients
        output = model(data)               # Forward pass
        loss = loss_fn(output, target)     # Compute loss
        loss.backward()                    # Backpropagation (compute gradients)
        optimizer.step()                   # Update weights & biases

        total_loss += loss.item()

    print(f"Epoch {epoch+1}, Average Loss: {total_loss/len(train_loader)}")

    # Inspect weight & bias updates (example: first conv layer)
    print("Conv1 Weights (first filter):", model.conv1.weight[0][0][:5,:5])
    print("Conv1 Bias:", model.conv1.bias.data)


##### Explanation of Weight & Bias Updates
* Forward pass: Input images go through convolution ‚Üí ReLU ‚Üí flatten ‚Üí fully connected ‚Üí softmax.
* Loss calculation: CrossEntropy compares predictions vs. true labels.
* Backward pass: loss.backward() computes gradients for each weight and bias.
* Update step: optimizer.step() adjusts weights and biases using gradient descent.
* Inspection: After each epoch, you can print out the weights and biases to see how they change.

### üìà Expected Outcomes of Increasing Epochs
* Training Loss Decreases (at first):
    * With more epochs, the model sees the data multiple times.
    * Weights and biases are adjusted repeatedly, so the loss usually drops further compared to fewer epochs.
* Accuracy Improves (up to a point):
    * The model learns more features and patterns, so accuracy on the training set and often the validation set increases.
    * Example: Going from 3 epochs to 10 epochs might raise accuracy from ~85% to ~95%.
* Risk of Overfitting:
    * After a certain number of epochs, the model may start memorizing training data instead of generalizing.
    * Training accuracy keeps climbing, but validation/test accuracy may plateau or even decline.
* Weight & Bias Adjustments:
    * Early epochs ‚Üí large changes in weights (big gradient steps).
    * Later epochs ‚Üí smaller refinements as the optimizer converges.
    * Eventually, updates become tiny because the model is close to a minimum in the loss function.

### üîÆ Assumed Example Outcome (MNIST digits with CNN)

| Epochs |  Training Loss |  Training Accuracy |  Validation Accuracy |
|--------|----------------|--------------------|----------------------|
| 3 |  ~0.35 |  ~88% |  ~87% |
| 10 |  ~0.15 |  ~95% |  ~94% |
| 30 |  ~0.05 |  ~99% |  ~92% (overfitting begins)

* More epochs = better learning initially.
* Too many epochs = diminishing returns + overfitting risk.
* The sweet spot is usually found by monitoring validation loss/accuracy and using techniques like early stopping or dropout layers.

#### üéØ How to Determine if 50 Epochs is Excessive
##### 1. Monitor Training vs. Validation Loss
* If training loss keeps decreasing but validation loss plateaus or increases, that‚Äôs a strong sign of overfitting.
    * Example pattern:
        * Epoch 10: Training loss ‚Üì, Validation loss ‚Üì
        * Epoch 30: Training loss ‚Üì, Validation loss stable
        * Epoch 50: Training loss ‚Üì, Validation loss ‚Üë ‚Üí overfitting.

##### 2. Check Accuracy Trends
* Training accuracy may approach 99‚Äì100%.
* Validation accuracy may peak earlier (say at epoch 20‚Äì30) and then stagnate or decline.
* If validation accuracy drops while training accuracy rises, you‚Äôre overfitting.

##### 3. Use Early Stopping
* Instead of fixing 50 epochs, train with a maximum (like 50) but stop automatically when validation loss stops improving for several epochs.
* This prevents wasting time and avoids overfitting.

##### 4. Regularization Helps
* Techniques like dropout, weight decay (L2 regularization), and data augmentation can allow more epochs without severe overfitting.
* But even with these, monitoring validation metrics is essen

###### üîÆ Assumed Outcome for 50 Epochs (CNN on MNIST-like data)

| Epochs |  Training Loss |  Training Accuracy |  Validation Accuracy |
|--------|----------------|--------------------|----------------------|
| 10 |  ~0.15 |  ~95% |  ~94% |
| 30 |  ~0.07 |  ~98% |  ~94% (plateau) |
| 50 |  ~0.03 |  ~99% |  ~92% (overfitting) |

üëâ So yes, 50 epochs can be excessive unless you use early stopping or strong regularization.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
import matplotlib.pyplot as plt

# 1. Define a simple CNN
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=3)
        self.fc1 = nn.Linear(10*26*26, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = x.view(-1, 10*26*26)
        x = self.fc1(x)
        return F.log_softmax(x, dim=1)

# 2. Load dataset
transform = transforms.Compose([transforms.ToTensor()])
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST(torch_data_dir, train=True, download=True, transform=transform),
    batch_size=64, shuffle=True
)
val_loader = torch.utils.data.DataLoader(
    datasets.MNIST(torch_data_dir, train=False, download=True, transform=transform),
    batch_size=64, shuffle=False
)

# 3. Initialize model, optimizer, and loss function
model = SimpleCNN()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
loss_fn = nn.CrossEntropyLoss()

# 4. Training loop with metrics tracking
train_losses = []
val_losses = []
num_epochs = 50

for epoch in range(num_epochs):
    # Training phase
    model.train()
    total_train_loss = 0
    for data, target in train_loader:
        optimizer.zero_grad()
        output = model(data)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()
        total_train_loss += loss.item()

    avg_train_loss = total_train_loss / len(train_loader)
    train_losses.append(avg_train_loss)

    # Validation phase
    model.eval()
    total_val_loss = 0
    with torch.no_grad():
        for data, target in val_loader:
            output = model(data)
            loss = loss_fn(output, target)
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(val_loader)
    val_losses.append(avg_val_loss)

    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}, Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")

# 5. Plot training vs validation loss
plt.figure(figsize=(10, 6))
plt.plot(range(1, num_epochs + 1), train_losses, label='Training Loss', marker='o', markersize=3)
plt.plot(range(1, num_epochs + 1), val_losses, label='Validation Loss', marker='s', markersize=3)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training vs Validation Loss (Overfitting Detection)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('../../generated/images/50_epochs_loss.png')

### Backpropagation in CNNs
* Backpropagation is the algorithm used to train CNNs by minimizing the loss function.

Backpropagation is the process of adjusting weights and biases in a neural network by propagating the error (loss) backward through the layers using calculus (derivatives) to minimize that loss.

* **Efficiency**: Instead of guessing weight changes, it uses calculus to find the best direction.
* **Scalability**: Works for deep networks with many layers.
* **Foundation**: It‚Äôs the core algorithm behind training CNNs, RNNs, and modern deep learning models.

##### üß† Think of backpropagation like learning to throw darts:

* Each throw (forward pass) gives you feedback (loss).
* You measure how far off you are (error).
* You adjust your aim slightly in the opposite direction of the error (gradient update).
* With enough practice (epochs), your throws land closer to the bullseye (minimized loss).

##### üîç Explanation Based on Microsoft Learn Module

The Microsoft Learn page on Deep Neural Network Concepts explains backpropagation as part of the training process:

###### 1. Forward Pass

* Input features (like penguin measurements in the example) are passed through the network.
* Each neuron applies its function, producing outputs layer by layer until the final prediction is made.

###### 2. Loss Calculation

* The prediction is compared to the true label.
* The difference (error) is quantified using a loss function (e.g., mean squared error, cross-entropy).
* Example: True label `[1,0,0]` vs. predicted `[0.4,0.3,0.3]` ‚Üí variance `[0.6,0.3,0.3]` ‚Üí average loss ~0.18.

###### 3. Backpropagation of Error

* The entire network can be seen as a nested function.
* Using differential calculus, the derivative (gradient) of the loss with respect to each weight and bias is computed.
* This tells us whether increasing or decreasing a weight will reduce the loss.
* Gradients are propagated backward from the output layer to earlier layers.

###### 4. Weight & Bias Updates (Optimization)

* An optimizer (like Stochastic Gradient Descent, Adam, or AdaDelta) uses the gradients to adjust weights and biases.
* Update rule (simplified):
  $$w_{new} = w_{old} - \eta \cdot \frac{\partial L}{\partial w}$$
  where $\eta$  is the learning rate.

* Small learning rate ‚Üí slow but stable convergence.
* Large learning rate ‚Üí faster but risk overshooting the minimum.

###### 5. Repeat Across Epochs

* Each epoch reuses updated weights and biases.
* Over time, the network learns to minimize loss and improve accuracy.

##### üßë‚Äçüíª Simple Backpropagation Example (1 Hidden Layer Neural Network)
We‚Äôll train a tiny network to learn the XOR function:

In [None]:
import numpy as np

# 1. Input data (XOR truth table)
X = np.array([[0,0],
              [0,1],
              [1,0],
              [1,1]])   # inputs
y = np.array([[0],
              [1],
              [1],
              [0]])     # expected outputs

# 2. Initialize weights & biases randomly
np.random.seed(42)
W1 = np.random.randn(2, 2)   # weights for input -> hidden
b1 = np.zeros((1, 2))        # biases for hidden
W2 = np.random.randn(2, 1)   # weights for hidden -> output
b2 = np.zeros((1, 1))        # biases for output

# 3. Define activation functions
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return sigmoid(x) * (1 - sigmoid(x))

# 4. Training loop
lr = 0.1   # learning rate
epochs = 5000

for epoch in range(epochs):
    # ---- Forward pass ----
    z1 = np.dot(X, W1) + b1
    a1 = sigmoid(z1)              # hidden layer output
    z2 = np.dot(a1, W2) + b2
    a2 = sigmoid(z2)              # final prediction

    # ---- Loss (Mean Squared Error) ----
    loss = np.mean((y - a2) ** 2)

    # ---- Backpropagation ----
    # Output layer error
    d_loss_a2 = -(y - a2)         # derivative of loss wrt a2
    d_a2_z2 = sigmoid_derivative(z2)
    d_z2 = d_loss_a2 * d_a2_z2    # gradient at output

    # Gradients for W2 and b2
    dW2 = np.dot(a1.T, d_z2)
    db2 = np.sum(d_z2, axis=0, keepdims=True)

    # Hidden layer error
    d_a1_z1 = sigmoid_derivative(z1)
    d_z1 = np.dot(d_z2, W2.T) * d_a1_z1

    # Gradients for W1 and b1
    dW1 = np.dot(X.T, d_z1)
    db1 = np.sum(d_z1, axis=0, keepdims=True)

    # ---- Update weights & biases ----
    W1 -= lr * dW1
    b1 -= lr * db1
    W2 -= lr * dW2
    b2 -= lr * db2

    # Print progress every 1000 epochs
    if epoch % 1000 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")


###### üîç Explanation of the Code:
1. Forward pass:
   * Compute hidden layer activations (a1) and final predictions (a2).

2. Loss calculation:
   * Compare predictions with true labels using mean squared error.

3. Backward pass (backpropagation):
   * Compute gradients of loss w.r.t. output, then propagate back to hidden layer.
   * Use chain rule to calculate derivatives for each weight and bias.

4. Update step:
   * Adjust weights and biases using gradient descent.

##### üìä Expected Output (simplified)
```
Epoch 0, Loss: 0.2500
Epoch 1000, Loss: 0.1253
Epoch 2000, Loss: 0.0621
Epoch 3000, Loss: 0.0305
Epoch 4000, Loss: 0.0152
```
By the end, the network learns the XOR mapping ‚Äî predictions will be close to `[0,1,1,0]`.

In [None]:
# Generating XOR backpropagation training with loss visualization over 5000 epochs
import numpy as np
import matplotlib.pyplot as plt

# Sigmoid activation and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

# XOR input and output
X = np.array([[0,0], [0,1], [1,0], [1,1]])
y = np.array([[0], [1], [1], [0]])

# Seed for reproducibility
np.random.seed(42)

# Initialize weights and biases
input_layer_neurons = 2
hidden_layer_neurons = 2
output_neurons = 1

# Weights and biases
wh = np.random.uniform(size=(input_layer_neurons, hidden_layer_neurons))
bh = np.random.uniform(size=(1, hidden_layer_neurons))
wo = np.random.uniform(size=(hidden_layer_neurons, output_neurons))
bo = np.random.uniform(size=(1, output_neurons))

# Training parameters
epochs = 5000
learning_rate = 0.1
loss_history = []

# Training loop
for epoch in range(epochs):
    # Forward pass
    hidden_input = np.dot(X, wh) + bh
    hidden_output = sigmoid(hidden_input)

    final_input = np.dot(hidden_output, wo) + bo
    output = sigmoid(final_input)

    # Compute loss (mean squared error)
    loss = np.mean((y - output) ** 2)
    loss_history.append(loss)

    # Backpropagation
    error = y - output
    d_output = error * sigmoid_derivative(output)

    error_hidden = d_output.dot(wo.T)
    d_hidden = error_hidden * sigmoid_derivative(hidden_output)

    # Update weights and biases
    wo += hidden_output.T.dot(d_output) * learning_rate
    bo += np.sum(d_output, axis=0, keepdims=True) * learning_rate
    wh += X.T.dot(d_hidden) * learning_rate
    bh += np.sum(d_hidden, axis=0, keepdims=True) * learning_rate

# Plotting the loss over epochs
plt.style.use('seaborn-v0_8')
plt.figure(figsize=(10, 6))
plt.plot(loss_history, color='blue', linewidth=2)
plt.title('XOR Training Loss over Epochs', fontsize=16)
plt.xlabel('Epoch', fontsize=14)
plt.ylabel('Mean Squared Error Loss', fontsize=14)
plt.grid(True)
plt.tight_layout()
plt.savefig('../../generated/images/xor_loss_plot.png')

print("Trained XOR neural network using backpropagation over 5000 epochs and saved loss plot as xor_loss_plot.png")


###### üìä What the Chart Shows
* Early epochs: Loss starts relatively high because weights are random.
* Middle epochs: Loss decreases quickly as the network learns the XOR mapping.
* Later epochs: Loss flattens out, showing convergence ‚Äî the network has effectively learned the function.

##### üîç Key Takeaways
* Backpropagation works by iteratively reducing loss through weight and bias updates.
* The curve‚Äôs downward trend confirms the network is learning.
* Plateauing at the end means additional epochs won‚Äôt improve much ‚Äî the model has converged.

This kind of visualization is exactly how you detect overfitting vs. convergence in larger CNNs: if validation loss starts rising while training loss keeps dropping, that‚Äôs overfitting.

### üßë‚Äçüíª Python Simulation: Effect of Learning Rate
We‚Äôll train a tiny neural network (same XOR example as before) but run it with different learning rates and compare the loss curves.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# XOR dataset
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0],[1],[1],[0]])

# Activation functions
def sigmoid(x): return 1/(1+np.exp(-x))
def sigmoid_derivative(x): return sigmoid(x)*(1-sigmoid(x))

# Training function
def train(lr, epochs=2000):
    np.random.seed(42)
    W1 = np.random.randn(2,2)
    b1 = np.zeros((1,2))
    W2 = np.random.randn(2,1)
    b2 = np.zeros((1,1))
    losses = []

    for epoch in range(epochs):
        # Forward pass
        z1 = np.dot(X, W1) + b1
        a1 = sigmoid(z1)
        z2 = np.dot(a1, W2) + b2
        a2 = sigmoid(z2)

        # Loss
        loss = np.mean((y - a2)**2)
        losses.append(loss)

        # Backpropagation
        d_loss_a2 = -(y - a2)
        d_a2_z2 = sigmoid_derivative(z2)
        d_z2 = d_loss_a2 * d_a2_z2
        dW2 = np.dot(a1.T, d_z2)
        db2 = np.sum(d_z2, axis=0, keepdims=True)

        d_a1_z1 = sigmoid_derivative(z1)
        d_z1 = np.dot(d_z2, W2.T) * d_a1_z1
        dW1 = np.dot(X.T, d_z1)
        db1 = np.sum(d_z1, axis=0, keepdims=True)

        # Update weights
        W1 -= lr * dW1
        b1 -= lr * db1
        W2 -= lr * dW2
        b2 -= lr * db2

    return losses

# Compare different learning rates
lrs = [0.01, 0.1, 1.0]
results = {lr: train(lr) for lr in lrs}

# Plot
for lr, losses in results.items():
    plt.plot(losses, label=f"lr={lr}")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("Effect of Learning Rate on Training")
plt.legend()
plt.tight_layout()
plt.savefig('../../generated/images/changing_learning_rate.png')



##### üìä What You‚Äôll See
* Low learning rate (0.01): Loss decreases very slowly ‚Äî training is stable but inefficient.
* Moderate learning rate (0.1): Loss decreases steadily and converges well ‚Äî often the sweet spot.

High learning rate (1.0): Loss may oscillate or diverge ‚Äî updates are too aggressive, overshooting the minimum.

##### üß† Key Insight
* Too small ‚Üí slow learning.
* Too large ‚Üí unstable learning.
* Just right ‚Üí fast convergence without overshooting.

### üìù Optimizers
> ‚ÄúWe use an optimizer to apply this same trick for all of the weight and bias variables in the model and determine in which direction we need to adjust them (up or down) to reduce the overall amount of loss in the model.‚Äù

Optimizers are the engine of backpropagation: they translate gradient information into actual weight and bias updates. Choosing the right optimizer (and learning rate) directly impacts how fast and how well a neural network learns.

* **Context**: After calculating the loss and its derivative (gradient), the next step is to decide how to adjust weights and biases to minimize loss.
* **Optimizers Defined**: An optimizer is the algorithm that uses gradient information to update weights and biases in the right direction.
* **How It Works**:
    * The entire neural network can be seen as a nested function.
    * By applying differential calculus, we calculate the slope (gradient) of the loss with respect to each weight/bias.
    * The optimizer then adjusts parameters up or down depending on whether the gradient is positive or negative.
* **Common Optimizers**:
    * Stochastic Gradient Descent (SGD): Updates weights using gradients from mini-batches of data.
    * AdaDelta: Adapts learning rates dynamically based on recent gradient history.
    * Adam (Adaptive Moment Estimation): Combines momentum and adaptive learning rates for faster, more stable convergence.
* **Purpose**: All optimizers aim to minimize loss efficiently by finding the best path through the parameter space.

##### Example: Comparing Optimizers on XOR Problem
To trained a small neural network on the XOR dataset using SGD and Adam optimizers, each with learning rates of 0.1 and 0.01. The chart below shows how the loss decreases over 1000 epochs

In [None]:
# Simulating training of a simple neural network on XOR data using different optimizers and learning rates
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import os

# Ensure output directory exists
os.makedirs("../../generated/images", exist_ok=True)

# Define XOR dataset
X = torch.tensor([[0, 0],
                  [0, 1],
                  [1, 0],
                  [1, 1]], dtype=torch.float32)
y = torch.tensor([[0], [1], [1], [0]], dtype=torch.float32)

# Define a simple neural network
class XORNet(nn.Module):
    def __init__(self):
        super(XORNet, self).__init__()
        self.fc1 = nn.Linear(2, 4)
        self.fc2 = nn.Linear(4, 1)

    def forward(self, x):
        x = torch.sigmoid(self.fc1(x))
        x = torch.sigmoid(self.fc2(x))
        return x

# Training function
def train_model(optimizer_name, learning_rate, epochs=1000):
    model = XORNet()
    criterion = nn.MSELoss()

    if optimizer_name == 'SGD':
        optimizer = optim.SGD(model.parameters(), lr=learning_rate)
    elif optimizer_name == 'Adam':
        optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    else:
        raise ValueError("Unsupported optimizer")

    losses = []
    for epoch in range(epochs):
        optimizer.zero_grad()
        outputs = model(X)
        loss = criterion(outputs, y)
        loss.backward()
        optimizer.step()
        losses.append(loss.item())
    return losses

# Settings
optimizers = ['SGD', 'Adam']
learning_rates = [0.1, 0.01]
epochs = 1000

# Train models and collect losses
results = {}
for opt in optimizers:
    for lr in learning_rates:
        key = f"{opt}_lr{lr}"
        losses = train_model(opt, lr, epochs)
        results[key] = losses

# Plotting loss curves
plt.style.use('seaborn-v0_8')
plt.figure(figsize=(10, 6))
for label, losses in results.items():
    plt.plot(losses, label=label)
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("XOR Training Loss Comparison: Optimizers and Learning Rates")
plt.legend()
plt.grid(True)
plot_path = "../../generated/images/xor_optimizer_comparison.png"
plt.tight_layout()
plt.savefig(plot_path)

print("Trained XOR neural network using SGD and Adam optimizers with different learning rates. Loss curves saved as xor_optimizer_comparison.png")


##### üîç Key Insights
* **SGD (lr=0.1)**: Loss decreases steadily but more slowly compared to Adam.
* **SGD (lr=0.01)**: Very stable, but convergence is slow ‚Äî the network takes longer to learn.
* **Adam (lr=0.1)**: Fast convergence, loss drops quickly, but can oscillate if the rate is too high.
* **Adam (lr=0.01)**: Smooth and efficient convergence, often the most balanced choice.

##### üìä Why This Matters
* **Optimizers** (SGD, Adam, AdaDelta, etc.) determine how weights and biases are updated during backpropagation.
* **Learning rate** controls how much they change each step:
    * Too small ‚Üí slow learning.
    * Too large ‚Üí unstable or diverging.
    * Just right ‚Üí fast and stable convergence.

##### üß† Practical Takeaway
* Start with **Adam + lr=0.001‚Äì0.01** for most problems.
* Use **SGD** when you want more control or are fine-tuning.
* Always monitor **validation loss** to avoid overfitting or divergence.