# üìà Week 4, Day 2: Calculus & Optimization

**üéØ Goal:** Master the math that makes AI learn - derivatives, gradients, and optimization

**‚è±Ô∏è Time:** 60-90 minutes

**üåü Why This Matters for AI:**
- **Training IS optimization** - Finding best model parameters
- **Backpropagation** = Chain rule of calculus
- **Gradient descent** = Follow the slope downhill
- **Learning rate** = How big a step to take
- **Every AI model** uses these concepts!

---

## üî• 2024-2025 AI Trend Alert!

**Large Language Model Training**:
- GPT-4: Optimizing 1.8 TRILLION parameters
- **Gradient descent in billion-dimensional space!**
- Advanced optimizers: Adam, AdamW, Lion

**Fine-tuning Revolution**:
- LoRA, QLoRA = Efficient optimization
- **Understanding gradients = Understanding fine-tuning!**

**Scaling Laws**:
- Loss curves follow power laws
- **Calculus predicts how much data/compute you need!**

**You'll learn how ChatGPT, Claude, Gemini were trained!** üöÄ

---

## üìä Why Calculus?

**Calculus** = Math of change and optimization

Think of it as:
- Algebra: Fixed values (y = 2x + 1) üìè
- Calculus: Rates of change (how fast is y changing?) üìà

**Real AI example:**
```python
# Training loop (simplified)
for epoch in range(1000):
    loss = compute_loss(model, data)
    gradient = compute_gradient(loss)  # ‚Üê CALCULUS!
    weights = weights - learning_rate * gradient  # ‚Üê OPTIMIZATION!
```

Let's understand the magic! ‚ú®

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

print("NumPy version:", np.__version__)
print("‚úÖ Ready to learn how AI learns!")

## üìê Derivatives - Rate of Change

### What is a Derivative?

**Derivative** = Slope of a function at a point = "How fast is it changing?"

**Notation:**
- f'(x) = derivative of f with respect to x
- df/dx = same thing, different notation

**In AI:**
- How much does loss change when we change a weight?
- Which direction should we update parameters?

In [None]:
# Visualize derivative as slope
x = np.linspace(-3, 3, 100)
y = x**2  # Function: f(x) = x¬≤

# Pick a point
x0 = 1.5
y0 = x0**2
slope = 2 * x0  # Derivative: f'(x) = 2x

# Tangent line
tangent_y = slope * (x - x0) + y0

plt.figure(figsize=(10, 6))
plt.plot(x, y, 'b-', linewidth=2, label='f(x) = x¬≤')
plt.plot(x, tangent_y, 'r--', linewidth=2, label=f'Tangent (slope = {slope:.1f})')
plt.plot(x0, y0, 'go', markersize=10, label=f'Point ({x0}, {y0:.2f})')

plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Derivative = Slope of Tangent Line', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='k', linewidth=0.5)
plt.axvline(x=0, color='k', linewidth=0.5)
plt.show()

print(f"At x = {x0}:")
print(f"  Function value: f({x0}) = {y0:.2f}")
print(f"  Derivative: f'({x0}) = {slope:.1f}")
print(f"  Meaning: At this point, y increases by {slope:.1f} for every unit increase in x")

## üî¢ Common Derivative Rules

**You don't need to memorize - NumPy/PyTorch computes them automatically!**

But understanding helps!

In [None]:
# Common derivatives
print("üìê COMMON DERIVATIVE RULES\n")
print("=" * 60)

rules = [
    ("Constant", "f(x) = c", "f'(x) = 0"),
    ("Linear", "f(x) = x", "f'(x) = 1"),
    ("Power", "f(x) = x¬≤", "f'(x) = 2x"),
    ("Power (general)", "f(x) = x‚Åø", "f'(x) = n¬∑x‚Åø‚Åª¬π"),
    ("Exponential", "f(x) = eÀ£", "f'(x) = eÀ£"),
    ("Natural log", "f(x) = ln(x)", "f'(x) = 1/x"),
    ("Sum", "f(x) = g(x) + h(x)", "f'(x) = g'(x) + h'(x)"),
    ("Product", "f(x) = g(x)¬∑h(x)", "f'(x) = g'(x)¬∑h(x) + g(x)¬∑h'(x)"),
    ("Chain rule", "f(g(x))", "f'(g(x))¬∑g'(x)")
]

for name, func, deriv in rules:
    print(f"{name:20} {func:20} ‚Üí {deriv}")

print("\n" + "=" * 60)
print("üß† Chain rule is CRITICAL for backpropagation!")
print("   Neural networks = nested functions")
print("   Chain rule = how to compute gradients through layers")

## üìä Partial Derivatives & Gradients

**Partial derivative** = Derivative with respect to ONE variable (others held constant)

**Gradient** = Vector of all partial derivatives

**In AI:** Functions have MILLIONS of parameters!

In [None]:
# Function with 2 variables: f(x,y) = x¬≤ + y¬≤ (bowl shape)
def f(x, y):
    return x**2 + y**2

# Partial derivatives
def df_dx(x, y):
    return 2*x  # ‚àÇf/‚àÇx

def df_dy(x, y):
    return 2*y  # ‚àÇf/‚àÇy

# Gradient vector
def gradient(x, y):
    return np.array([df_dx(x, y), df_dy(x, y)])

# Visualize the function
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)

fig = plt.figure(figsize=(14, 5))

# 3D surface
ax1 = fig.add_subplot(121, projection='3d')
ax1.plot_surface(X, Y, Z, cmap='viridis', alpha=0.8)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_zlabel('f(x,y)')
ax1.set_title('f(x,y) = x¬≤ + y¬≤ (3D View)', fontweight='bold')

# Contour plot with gradient vectors
ax2 = fig.add_subplot(122)
contour = ax2.contour(X, Y, Z, levels=20, cmap='viridis')
ax2.clabel(contour, inline=True, fontsize=8)

# Plot gradient vectors at several points
points = [(-2, -2), (-2, 2), (2, -2), (2, 2), (-1, 0), (0, 1.5)]
for px, py in points:
    grad = gradient(px, py)
    ax2.arrow(px, py, -grad[0]*0.2, -grad[1]*0.2, 
             head_width=0.15, head_length=0.1, fc='red', ec='red')
    ax2.plot(px, py, 'ro', markersize=8)

ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_title('Contour Plot with Gradient Vectors (red arrows)', fontweight='bold')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("üìä Key insights:")
print("  - Gradient points in direction of STEEPEST ASCENT")
print("  - Negative gradient points toward MINIMUM")
print("  - At minimum (0,0), gradient = [0,0]")
print("\nüß† This is the foundation of gradient descent!")

## üéØ Gradient Descent - How AI Learns!

**Algorithm:**
1. Start with random parameters
2. Compute loss (how wrong is the model?)
3. Compute gradient (which direction to improve?)
4. Update parameters: `w = w - learning_rate √ó gradient`
5. Repeat until converged!

**This is LITERALLY how neural networks train!**

In [None]:
# Gradient descent on f(x) = x¬≤ + 10
def f(x):
    return x**2 + 10

def df(x):
    return 2*x

# Gradient descent
x = 5.0  # Start far from minimum
learning_rate = 0.1
num_iterations = 20

history = {'x': [x], 'f': [f(x)]}

print("üöÄ GRADIENT DESCENT IN ACTION\n")
print("=" * 60)
print(f"Starting at x = {x}, f(x) = {f(x):.2f}")
print(f"Learning rate = {learning_rate}")
print(f"\nIteration | x       | f(x)    | gradient")
print("-" * 60)

for i in range(num_iterations):
    gradient = df(x)
    x = x - learning_rate * gradient  # UPDATE RULE!
    
    history['x'].append(x)
    history['f'].append(f(x))
    
    if i < 10 or i == num_iterations - 1:  # Print first 10 and last
        print(f"{i+1:9} | {x:7.4f} | {f(x):7.4f} | {gradient:7.4f}")

print("=" * 60)
print(f"\n‚úÖ Converged to x = {x:.6f}, f(x) = {f(x):.6f}")
print(f"   True minimum: x = 0, f(x) = 10")

In [None]:
# Visualize gradient descent path
x_range = np.linspace(-6, 6, 100)
y_range = f(x_range)

plt.figure(figsize=(12, 6))

# Plot function
plt.plot(x_range, y_range, 'b-', linewidth=2, label='f(x) = x¬≤ + 10')

# Plot gradient descent path
plt.plot(history['x'], history['f'], 'ro-', markersize=8, linewidth=2, 
        label='Gradient Descent Path', alpha=0.7)

# Highlight start and end
plt.plot(history['x'][0], history['f'][0], 'g*', markersize=20, label='Start')
plt.plot(history['x'][-1], history['f'][-1], 'r*', markersize=20, label='End')

plt.xlabel('x', fontsize=12)
plt.ylabel('f(x)', fontsize=12)
plt.title('Gradient Descent: Finding the Minimum', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("üìä Notice how:")
print("  - Steps are LARGE when far from minimum (steep slope)")
print("  - Steps get SMALLER as we approach minimum (flatter slope)")
print("  - This is why gradient descent is efficient!")

## üéöÔ∏è Learning Rate - The Most Important Hyperparameter!

In [None]:
# Compare different learning rates
learning_rates = [0.01, 0.1, 0.5, 1.0]

plt.figure(figsize=(14, 10))

for idx, lr in enumerate(learning_rates, 1):
    plt.subplot(2, 2, idx)
    
    # Run gradient descent
    x = 5.0
    history = {'x': [x], 'f': [f(x)]}
    
    for i in range(20):
        gradient = df(x)
        x = x - lr * gradient
        history['x'].append(x)
        history['f'].append(f(x))
    
    # Plot
    x_range = np.linspace(-6, 6, 100)
    plt.plot(x_range, f(x_range), 'b-', linewidth=2, alpha=0.5)
    plt.plot(history['x'], history['f'], 'ro-', markersize=6)
    plt.plot(history['x'][0], history['f'][0], 'g*', markersize=15)
    plt.plot(history['x'][-1], history['f'][-1], 'r*', markersize=15)
    
    plt.title(f'Learning Rate = {lr}', fontweight='bold')
    plt.xlabel('x')
    plt.ylabel('f(x)')
    plt.grid(True, alpha=0.3)
    plt.ylim(0, 40)

plt.tight_layout()
plt.show()

print("üéöÔ∏è Learning Rate Effects:\n")
print("  Too small (0.01): Slow convergence, many iterations needed")
print("  Just right (0.1): Fast, stable convergence")
print("  Too large (0.5): Overshooting, oscillations")
print("  Way too large (1.0): Divergence! Explodes!")
print("\nüß† This is why tuning learning rate is crucial for AI training!")

## üß† Real Example: Training a Simple Linear Model

In [None]:
# Generate synthetic data: y = 3x + 2 + noise
np.random.seed(42)
X_train = np.random.randn(100, 1) * 2
y_train = 3 * X_train + 2 + np.random.randn(100, 1) * 0.5

# Visualize data
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, alpha=0.5)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Training Data (y = 3x + 2 + noise)', fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()

print("üéØ Goal: Learn a line y = w*x + b that fits this data")
print("   True values: w = 3, b = 2")
print("   We'll use gradient descent to find them!")

In [None]:
# Linear regression with gradient descent
print("üöÄ TRAINING LINEAR MODEL WITH GRADIENT DESCENT\n")
print("=" * 60)

# Initialize parameters randomly
w = np.random.randn()
b = np.random.randn()
print(f"Initial parameters: w = {w:.3f}, b = {b:.3f}")

# Hyperparameters
learning_rate = 0.01
epochs = 100

# Track history
loss_history = []
w_history = [w]
b_history = [b]

# Training loop
for epoch in range(epochs):
    # Forward pass: predictions
    y_pred = w * X_train + b
    
    # Compute loss (Mean Squared Error)
    loss = np.mean((y_pred - y_train)**2)
    loss_history.append(loss)
    
    # Compute gradients (partial derivatives)
    dw = (2/len(X_train)) * np.sum((y_pred - y_train) * X_train)
    db = (2/len(X_train)) * np.sum(y_pred - y_train)
    
    # Update parameters (gradient descent!)
    w = w - learning_rate * dw
    b = b - learning_rate * db
    
    w_history.append(w)
    b_history.append(b)
    
    # Print progress
    if epoch % 20 == 0 or epoch == epochs - 1:
        print(f"Epoch {epoch:3d}: Loss = {loss:.4f}, w = {w:.3f}, b = {b:.3f}")

print("=" * 60)
print(f"\n‚úÖ Final parameters: w = {w:.3f}, b = {b:.3f}")
print(f"   True parameters:  w = 3.000, b = 2.000")
print(f"   Error: w_error = {abs(w-3):.3f}, b_error = {abs(b-2):.3f}")

In [None]:
# Visualize training results
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Plot 1: Loss curve
axes[0].plot(loss_history, linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss (MSE)')
axes[0].set_title('Loss Curve (Training Progress)', fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Plot 2: Parameter evolution
axes[1].plot(w_history, label='w (slope)', linewidth=2)
axes[1].plot(b_history, label='b (intercept)', linewidth=2)
axes[1].axhline(y=3, color='blue', linestyle='--', alpha=0.5, label='True w')
axes[1].axhline(y=2, color='orange', linestyle='--', alpha=0.5, label='True b')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Parameter Value')
axes[1].set_title('Parameter Convergence', fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Plot 3: Final fit
axes[2].scatter(X_train, y_train, alpha=0.5, label='Data')
x_line = np.linspace(X_train.min(), X_train.max(), 100)
y_line = w * x_line + b
axes[2].plot(x_line, y_line, 'r-', linewidth=2, label=f'Learned: y={w:.2f}x+{b:.2f}')
axes[2].set_xlabel('X')
axes[2].set_ylabel('y')
axes[2].set_title('Final Model Fit', fontweight='bold')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüéâ The model learned the relationship from data!")
print("   This is EXACTLY how neural networks train!")

## üîÑ Backpropagation - Chain Rule in Action

In [None]:
# Simple 2-layer neural network (forward & backward pass)
print("üß† BACKPROPAGATION EXAMPLE (2-Layer Network)\n")
print("=" * 60)

# Input
x = np.array([[0.5, 0.8]])
y_true = np.array([[0.9]])

# Initialize weights (small random values)
np.random.seed(42)
W1 = np.random.randn(2, 3) * 0.1  # Layer 1: 2 ‚Üí 3
b1 = np.zeros((1, 3))
W2 = np.random.randn(3, 1) * 0.1  # Layer 2: 3 ‚Üí 1
b2 = np.zeros((1, 1))

# Activation function (sigmoid)
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

# FORWARD PASS
print("FORWARD PASS:")
z1 = x @ W1 + b1
a1 = sigmoid(z1)
print(f"  Layer 1: z1 = x @ W1 + b1")
print(f"  Layer 1 output (a1): {a1}")

z2 = a1 @ W2 + b2
a2 = sigmoid(z2)
print(f"  Layer 2: z2 = a1 @ W2 + b2")
print(f"  Final output (a2): {a2}")
print(f"  True value: {y_true}")

# Loss
loss = 0.5 * (a2 - y_true)**2
print(f"  Loss: {loss[0,0]:.6f}\n")

# BACKWARD PASS (Backpropagation!)
print("BACKWARD PASS (computing gradients):")

# Output layer gradients
dL_da2 = a2 - y_true  # Derivative of loss w.r.t. output
dL_dz2 = dL_da2 * sigmoid_derivative(z2)  # Chain rule!
dL_dW2 = a1.T @ dL_dz2  # Gradient for W2
dL_db2 = np.sum(dL_dz2, axis=0, keepdims=True)  # Gradient for b2

print(f"  ‚àÇL/‚àÇW2 shape: {dL_dW2.shape}")
print(f"  ‚àÇL/‚àÇb2 shape: {dL_db2.shape}")

# Hidden layer gradients
dL_da1 = dL_dz2 @ W2.T  # Backpropagate through W2
dL_dz1 = dL_da1 * sigmoid_derivative(z1)  # Chain rule!
dL_dW1 = x.T @ dL_dz1  # Gradient for W1
dL_db1 = np.sum(dL_dz1, axis=0, keepdims=True)  # Gradient for b1

print(f"  ‚àÇL/‚àÇW1 shape: {dL_dW1.shape}")
print(f"  ‚àÇL/‚àÇb1 shape: {dL_db1.shape}")

print("\n" + "=" * 60)
print("‚ú® This is backpropagation!")
print("   - Forward: Compute outputs")
print("   - Backward: Compute gradients (chain rule)")
print("   - Update: weights = weights - lr * gradients")
print("\nüî• PyTorch/TensorFlow do this automatically!")

## üéØ MINI CHALLENGE: Train a Neural Network!

In [None]:
# TODO: Complete this neural network training loop!

# Generate XOR dataset (classic problem linear models can't solve!)
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])  # XOR truth table

print("üéØ XOR Problem (Requires Non-Linear Model)\n")
print("Input ‚Üí Output")
for i in range(len(X)):
    print(f"{X[i]} ‚Üí {y[i][0]}")

# Initialize network (2 ‚Üí 4 ‚Üí 1)
np.random.seed(42)
W1 = np.random.randn(2, 4) * 0.5
b1 = np.zeros((1, 4))
W2 = np.random.randn(4, 1) * 0.5
b2 = np.zeros((1, 1))

# Hyperparameters
learning_rate = 0.5
epochs = 10000

# Training loop
loss_history = []

for epoch in range(epochs):
    # TODO: Forward pass
    z1 = X @ W1 + b1
    a1 = sigmoid(z1)
    z2 = a1 @ W2 + b2
    a2 = sigmoid(z2)
    
    # TODO: Compute loss
    loss = np.mean((a2 - y)**2)
    loss_history.append(loss)
    
    # TODO: Backward pass (backpropagation)
    dL_da2 = 2 * (a2 - y) / len(X)
    dL_dz2 = dL_da2 * sigmoid_derivative(z2)
    dL_dW2 = a1.T @ dL_dz2
    dL_db2 = np.sum(dL_dz2, axis=0, keepdims=True)
    
    dL_da1 = dL_dz2 @ W2.T
    dL_dz1 = dL_da1 * sigmoid_derivative(z1)
    dL_dW1 = X.T @ dL_dz1
    dL_db1 = np.sum(dL_dz1, axis=0, keepdims=True)
    
    # TODO: Update weights
    W2 -= learning_rate * dL_dW2
    b2 -= learning_rate * dL_db2
    W1 -= learning_rate * dL_dW1
    b1 -= learning_rate * dL_db1
    
    # Print progress
    if epoch % 2000 == 0:
        print(f"Epoch {epoch:5d}: Loss = {loss:.6f}")

# Test the trained network
print("\n" + "=" * 60)
print("‚úÖ TRAINED NETWORK PREDICTIONS:\n")
print("Input    | True | Predicted | Rounded")
print("-" * 60)
for i in range(len(X)):
    z1 = X[i:i+1] @ W1 + b1
    a1 = sigmoid(z1)
    z2 = a1 @ W2 + b2
    pred = sigmoid(z2)
    print(f"{X[i]}  |  {y[i][0]}   |   {pred[0,0]:.4f}   |    {round(pred[0,0])}")

print("\nüéâ Network successfully learned XOR!")
print("   This requires non-linear activation (sigmoid)!")

In [None]:
# Visualize training
plt.figure(figsize=(10, 6))
plt.plot(loss_history, linewidth=2)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss (MSE)', fontsize=12)
plt.title('Neural Network Training: XOR Problem', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.yscale('log')  # Log scale to see details
plt.show()

print("üìä Notice:")
print("  - Loss decreases rapidly at first")
print("  - Then slows down (diminishing returns)")
print("  - This is typical of neural network training!")

## üéâ Congratulations!

**You just learned:**
- ‚úÖ Derivatives (rate of change, slopes)
- ‚úÖ Partial derivatives & gradients
- ‚úÖ Gradient descent (THE optimization algorithm)
- ‚úÖ Learning rate (most important hyperparameter)
- ‚úÖ Backpropagation (chain rule for neural networks)
- ‚úÖ Trained a linear model from scratch
- ‚úÖ Built and trained a neural network for XOR!

**üéØ Calculus for AI Cheat Sheet:**
```python
# Gradient descent
for epoch in range(num_epochs):
    # Forward pass
    predictions = model(X)
    loss = compute_loss(predictions, y_true)
    
    # Backward pass (compute gradients)
    gradients = compute_gradients(loss)
    
    # Update (gradient descent step)
    weights = weights - learning_rate * gradients

# Key concepts
derivative = slope at a point
gradient = vector of partial derivatives
learning_rate = step size (tune carefully!)
chain_rule = backpropagation through layers
```

**üß† Key Insights:**
- Training = Optimization = Finding minimum loss
- Gradient points toward steepest increase
- Negative gradient points toward minimum
- Learning rate controls step size
- Backpropagation = chain rule applied layer by layer

**üéØ Practice Exercise:**

Train a 3-layer neural network:
1. Architecture: 2 ‚Üí 8 ‚Üí 4 ‚Üí 1
2. Dataset: Generate non-linear classification data
3. Use ReLU activation (try: max(0, x))
4. Implement Adam optimizer (advanced gradient descent)
5. Plot decision boundary

---

**üìö Next Lesson:** Day 3 - Probability & Statistics (Understanding Uncertainty!)

**üí° Fun Fact:** 
- GPT-4 training: Gradient descent in 1.8 trillion dimensions!
- Took months of compute on thousands of GPUs
- Cost: Estimated $100+ million
- All using the same principles you just learned!

---

*You now understand how ChatGPT, Claude, and Gemini learn!* üöÄ