# Lab 2 - Module 3: Learning Rate Exploration

**Learning Objectives:**
- Understand the critical role of learning rate in GD
- Predict and observe: slow convergence, fast convergence, oscillation, divergence
- Develop intuition for choosing learning rates
- Recognize learning rate as a crucial hyperparameter

**Time:** ~20 minutes

---

This module focuses entirely on **learning rate** - the most important hyperparameter in gradient descent!

## Why Learning Rate Matters

The learning rate controls the **step size** in gradient descent:

```
new = old - learning_rate × gradient
```

### The Goldilocks Problem:

- **Too small:** Convergence is painfully slow (wastes computation)
- **Too large:** Oscillation or divergence (never reaches minimum)
- **Just right:** Fast and stable convergence

Finding the "just right" learning rate is an art and science!

## 1. Setup: A Simple Test Function

We'll use a simple 1D quadratic function for clear visualization:

```
f(x) = 0.5 × x²
```

Minimum at x = 0, where f(0) = 0.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from ipywidgets import FloatText, Button, Output, VBox, HBox
from IPython.display import display

# Simple quadratic function
def test_function(x):
    """f(x) = 0.5 * x^2"""
    return 0.5 * x**2

def compute_gradient(x, func, h=1e-5):
    """Numerical gradient"""
    return (func(x + h) - func(x - h)) / (2 * h)

def run_gd_with_lr(x_start, learning_rate, max_steps=100, tol=1e-6):
    """
    Run gradient descent and return full history.
    
    Returns:
        dict with 'x', 'f', 'grad', 'converged', 'n_steps', 'status'
    """
    history = {
        'x': [x_start],
        'f': [test_function(x_start)],
        'grad': [],
        'converged': False,
        'n_steps': 0,
        'status': 'unknown'
    }
    
    x_current = x_start
    
    for step in range(max_steps):
        grad = compute_gradient(x_current, test_function)
        history['grad'].append(grad)
        
        # GD update
        x_new = x_current - learning_rate * grad
        f_new = test_function(x_new)
        
        history['x'].append(x_new)
        history['f'].append(f_new)
        history['n_steps'] = step + 1
        
        # Check for divergence
        if abs(x_new) > 100 or abs(f_new) > 1000:
            history['status'] = 'diverged'
            break
        
        # Check for convergence
        if abs(f_new - history['f'][-2]) < tol:
            history['converged'] = True
            history['status'] = 'converged'
            break
        
        x_current = x_new
    
    # Determine status if not already set
    if history['status'] == 'unknown':
        if history['converged']:
            history['status'] = 'converged'
        elif len(history['x']) >= 10:
            # Check for oscillation: if last few steps bounce around
            recent_x = history['x'][-10:]
            if max(recent_x) - min(recent_x) > 0.1 * abs(x_start):
                history['status'] = 'oscillating'
            else:
                history['status'] = 'slow_convergence'
        else:
            history['status'] = 'in_progress'
    
    return history

print("✓ Test function and GD utilities ready")
print(f"\nTest function: f(x) = 0.5 × x²")
print(f"Minimum at x = 0, f(0) = 0")

## 2. Prediction Questions (Answer BEFORE running experiments)

**Q10 (PREDICTION):** Starting from x = 10.0, predict what will happen with:

1. **LR = 0.001** (very small):
   - Will it converge in 100 steps?
   - Roughly how many steps would it need?
   - Will the path be smooth?

2. **LR = 0.1** (moderate):
   - Will it converge quickly?
   - How many steps approximately?
   - Will it be stable?

3. **LR = 0.8** (large):
   - Will it converge?
   - Will it oscillate (bounce back and forth)?
   - Will it diverge (explode)?

4. **LR = 3.0** (very large):
   - Will it converge at all?
   - What do you expect to happen?
   - How quickly will it diverge?

Write your predictions on the answer sheet before continuing!

## 3. Four Learning Rates: Side-by-Side Comparison

In [None]:
# Starting point
x_start = 10.0

# Four learning rates to compare
learning_rates = [0.001, 0.1, 0.8, 3.0]
lr_labels = ['0.001 (too small)', '0.1 (just right)', '0.8 (too large)', '3.0 (way too large)']
colors = ['blue', 'green', 'orange', 'red']

# Run GD for each learning rate
results = {}
for lr in learning_rates:
    results[lr] = run_gd_with_lr(x_start, lr, max_steps=100)

print("Running gradient descent with four learning rates...")
print("="*80)
print(f"Starting point: x = {x_start}, f(x) = {test_function(x_start):.2f}\n")

for lr, label in zip(learning_rates, lr_labels):
    res = results[lr]
    print(f"Learning Rate = {label}:")
    print(f"  Status: {res['status']}")
    print(f"  Steps taken: {res['n_steps']}")
    print(f"  Final x: {res['x'][-1]:.6f}")
    print(f"  Final f(x): {res['f'][-1]:.6f}")
    print()

## 4. Visualization: Four Scenarios

In [None]:
# Create four subplots
fig, axes = plt.subplots(2, 2, figsize=(14, 10), dpi=100)
axes = axes.flatten()

# Plot function curve
x_plot = np.linspace(-12, 12, 300)
f_plot = test_function(x_plot)

for idx, (lr, label, color) in enumerate(zip(learning_rates, lr_labels, colors)):
    ax = axes[idx]
    res = results[lr]
    
    # Plot function
    ax.plot(x_plot, f_plot, 'b-', linewidth=2, alpha=0.3, label='f(x) = 0.5x²')
    
    # Plot GD path
    x_hist = np.array(res['x'])
    f_hist = np.array(res['f'])
    
    # Color gradient for path
    for i in range(len(x_hist)):
        marker = 'o' if i == 0 else ('*' if i == len(x_hist)-1 else 'o')
        size = 150 if i == 0 or i == len(x_hist)-1 else 60
        alpha = 1.0 if i == 0 or i == len(x_hist)-1 else 0.5
        
        ax.scatter(x_hist[i], f_hist[i], c=color, marker=marker, 
                  s=size, alpha=alpha, edgecolors='black', linewidths=1.5, zorder=3)
    
    # Add arrows
    arrow_freq = max(1, len(x_hist) // 10)
    for i in range(0, min(len(x_hist)-1, 30), arrow_freq):
        ax.annotate('', xy=(x_hist[i+1], f_hist[i+1]), 
                   xytext=(x_hist[i], f_hist[i]),
                   arrowprops=dict(arrowstyle='->', color=color, lw=1.5, alpha=0.6))
    
    ax.set_xlabel('x', fontsize=11)
    ax.set_ylabel('f(x)', fontsize=11)
    ax.set_title(f'LR = {label}\n{res["status"]}', fontsize=11, fontweight='bold')
    ax.grid(True, alpha=0.3)
    ax.set_xlim(-12, 12)
    
    # Adjust ylim based on divergence
    if res['status'] == 'diverged':
        ax.set_ylim(-5, min(200, max(f_hist) + 20))
    else:
        ax.set_ylim(-5, min(60, max(f_hist) + 5))

plt.tight_layout()
plt.show()

## 5. Convergence Curves: Loss Over Time

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5), dpi=100)

# Left: Linear scale
for lr, label, color in zip(learning_rates, lr_labels, colors):
    res = results[lr]
    ax1.plot(range(len(res['f'])), res['f'], 'o-', 
            color=color, label=f'LR = {label}', linewidth=2, markersize=4, alpha=0.8)

ax1.axhline(y=0, color='black', linestyle='--', linewidth=2, alpha=0.3, label='Minimum')
ax1.set_xlabel('Step', fontsize=11)
ax1.set_ylabel('f(x)', fontsize=11)
ax1.set_title('Convergence: Linear Scale', fontsize=12, fontweight='bold')
ax1.legend(fontsize=9)
ax1.grid(True, alpha=0.3)
max_f_val = max([max(results[lr]['f']) for lr in learning_rates])
ax1.set_ylim(-2, min(200, max_f_val + 10))

# Right: Log scale (better for seeing convergence)
for lr, label, color in zip(learning_rates, lr_labels, colors):
    res = results[lr]
    # Add small epsilon to avoid log(0)
    f_vals = np.array(res['f']) + 1e-10
    ax2.semilogy(range(len(f_vals)), f_vals, 'o-',
                color=color, label=f'LR = {label}', linewidth=2, markersize=4, alpha=0.8)

ax2.set_xlabel('Step', fontsize=11)
ax2.set_ylabel('f(x) [log scale]', fontsize=11)
ax2.set_title('Convergence: Log Scale', fontsize=12, fontweight='bold')
ax2.legend(fontsize=9)
ax2.grid(True, alpha=0.3, which='both')

plt.tight_layout()
plt.show()

print("\nNotice:")
print("  • LR = 0.001: Very slow descent (wastes computation)")
print("  • LR = 0.1: Smooth, fast convergence (optimal!)")
print("  • LR = 0.8: May oscillate or diverge (unstable)")
print("  • LR = 3.0: Rapid divergence - completely unstable!")

## 6. Detailed Comparison Table

In [None]:
# Create comparison table
comparison_data = []

for lr, label in zip(learning_rates, lr_labels):
    res = results[lr]
    
    # Compute steps to reach f(x) < 0.01 (near minimum)
    steps_to_threshold = None
    for i, f_val in enumerate(res['f']):
        if f_val < 0.01:
            steps_to_threshold = i
            break
    
    comparison_data.append({
        'Learning Rate': label,
        'Total Steps': res['n_steps'],
        'Steps to f < 0.01': steps_to_threshold if steps_to_threshold else 'N/A',
        'Final x': f"{res['x'][-1]:.6f}",
        'Final f(x)': f"{res['f'][-1]:.6f}",
        'Status': res['status'],
        'Converged': 'Yes' if res['converged'] else 'No'
    })

df_comparison = pd.DataFrame(comparison_data)
display(df_comparison)

print("\nKey Observations:")
print("="*80)
print("1. Small LR (0.001): Stable but slow - needs many iterations")
print("2. Moderate LR (0.1): Fast and stable - 'Goldilocks' zone")
print("3. Large LR (0.8): Risky - may oscillate or diverge")
print("\nTrade-off: Speed vs. Stability")

## 7. Interactive: Find Your Own Learning Rate

Now it's your turn! Experiment with different learning rates to find:
1. The **largest LR that still converges**
2. The **smallest LR that converges in < 50 steps**
3. An LR that causes **dramatic divergence**

In [None]:
# Interactive exploration widget
lr_custom_input = FloatText(description="Learning rate:", value=0.1, step=0.05)
run_custom_button = Button(description="Run GD", button_style='success')
output_custom = Output()

def on_run_custom(b):
    lr_custom = lr_custom_input.value
    
    with output_custom:
        output_custom.clear_output(wait=True)
        
        # Run GD
        res = run_gd_with_lr(x_start, lr_custom, max_steps=100)
        
        # Plot
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5), dpi=100)
        
        # Left: Path on function
        ax1.plot(x_plot, f_plot, 'b-', linewidth=2, alpha=0.3, label='f(x) = 0.5x²')
        
        x_hist = np.array(res['x'])
        f_hist = np.array(res['f'])
        
        colors_grad = plt.cm.viridis(np.linspace(0, 1, len(x_hist)))
        for i in range(len(x_hist)):
            marker = 'o' if i == 0 else ('*' if i == len(x_hist)-1 else 'o')
            size = 150 if i == 0 or i == len(x_hist)-1 else 60
            ax1.scatter(x_hist[i], f_hist[i], c=[colors_grad[i]], marker=marker,
                       s=size, edgecolors='black', linewidths=1.5, zorder=3)
        
        ax1.set_xlabel('x', fontsize=11)
        ax1.set_ylabel('f(x)', fontsize=11)
        ax1.set_title(f'GD Path: LR = {lr_custom}\nStatus: {res["status"]}', 
                     fontsize=12, fontweight='bold')
        ax1.grid(True, alpha=0.3)
        ax1.set_xlim(-15, 15)
        ax1.set_ylim(-5, min(100, max(f_hist) + 10))
        
        # Right: Convergence
        ax2.semilogy(range(len(res['f'])), np.array(res['f']) + 1e-10, 'o-',
                    color='purple', linewidth=2, markersize=4)
        ax2.set_xlabel('Step', fontsize=11)
        ax2.set_ylabel('f(x) [log scale]', fontsize=11)
        ax2.set_title('Convergence Curve', fontsize=12, fontweight='bold')
        ax2.grid(True, alpha=0.3, which='both')
        
        plt.tight_layout()
        plt.show()
        
        # Print summary
        print(f"\nLearning Rate: {lr_custom}")
        print("="*60)
        print(f"Status: {res['status']}")
        print(f"Steps: {res['n_steps']}")
        print(f"Final x: {res['x'][-1]:.6f}")
        print(f"Final f(x): {res['f'][-1]:.6f}")
        print(f"Converged: {'Yes' if res['converged'] else 'No'}")
        
        # Provide feedback
        if res['status'] == 'converged' and res['n_steps'] < 30:
            print("\n✓ Excellent! Fast and stable convergence.")
        elif res['status'] == 'converged':
            print("\n✓ Converged, but slowly. Try a larger LR.")
        elif res['status'] == 'oscillating':
            print("\n⚠ Oscillating! Try a smaller LR.")
        elif res['status'] == 'diverged':
            print("\n✗ Diverged! LR is too large.")
        else:
            print("\n⚠ Slow convergence. Try a larger LR.")

run_custom_button.on_click(on_run_custom)

print("Interactive Learning Rate Exploration")
print("="*80)
print("Experiment with different learning rates!")
print("\nChallenges:")
print("  1. Find the LARGEST LR that still converges")
print("  2. Find an LR that converges in < 20 steps")
print("  3. Find an LR that causes divergence\n")

display(VBox([
    lr_custom_input,
    run_custom_button,
    output_custom
]))

## Questions for Your Answer Sheet

**Q11.** Based on your observations, describe the behavior for each learning rate category:
- Too small (e.g., 0.001): What happens? Why is this wasteful?
- Just right (e.g., 0.1): What makes this optimal?
- Too large (e.g., 0.8): What problems occur? Why?
- Way too large (e.g., 3.0): How quickly does it diverge? What does this tell you?

**Q12.** How would you choose a learning rate for a new optimization problem?
- What strategy would you use?
- What signs would indicate your LR is too large? Too small?
- Why is learning rate called a "hyperparameter"?

## Key Takeaways

### The Learning Rate Trade-off:

| Learning Rate | Speed | Stability | Outcome |
|--------------|-------|-----------|----------|
| Too small | Very slow | Very stable | Wastes computation |
| Optimal | Fast | Stable | Best performance |
| Too large | Fast (initially) | Unstable | Oscillation/divergence |

### In Real Machine Learning:

- Learning rate is the **most important hyperparameter**
- Real models: Often start with LR ≈ 0.001-0.01
- **Learning rate schedules**: Start large, decrease over time
- **Adaptive methods**: Adam, RMSprop adjust LR automatically
- **Grid search / tuning**: Try multiple LRs, pick best

## Next Steps

1. **Answer Q10, Q11, Q12** on your answer sheet
2. **Return to the LMS** and continue to Module 4
3. In Module 4, you'll see GD's limitations with local optima on the mountain landscape!