# Lab 4, Module 2: Training a Neural Network

## Learning to Learn: Automatic Weight Optimization with Gradient Descent

**Estimated time: 15-20 minutes**

---

## Section 1: Why Train Automatically? (2 min)

In Module 1, you manually adjusted **9 sliders** to find weights that could separate XOR data:
- 2 hidden neurons × (2 weights + 1 bias) = 6 parameters
- 1 output neuron × (2 weights + 1 bias) = 3 parameters
- **Total: 9 parameters to tune by hand!**

You discovered that finding the right weights is HARD:
- The perfect solution exists: `H1=(-10,-10,-10), H2=(-10,-10,5), Out=(-10,10,-5)`
- But finding it manually required lots of trial and error
- Most random combinations gave terrible results

### The Problem Gets Worse Fast

Imagine if we had:
- **10 hidden neurons**: 31 parameters
- **100 hidden neurons**: 301 parameters
- **Modern deep networks**: millions of parameters!

Manual tuning becomes impossible. We need an **automatic** way to find good weights.

### Enter: Gradient Descent

Think of the loss function as a landscape:
- **High points** = bad weights (high error)
- **Low points** = good weights (low error)
- **Goal**: Roll downhill to find the lowest point

Gradient descent is like **rolling a ball downhill** in this loss landscape:
1. Start at a random location (random weights)
2. Look around and find which direction is steepest downward (compute gradients)
3. Take a small step in that direction (update weights)
4. Repeat until you reach a valley bottom (converged!)

**The key insight**: Even though the code looks complex (calculating derivatives, chain rule, etc.), it's just these 4 steps over and over. The math handles **9 weights simultaneously**, but the idea is simple: always move downhill!

In this module, you'll **watch this process happen in real time**!

### Making Training Reliable

In this module, we'll see how gradient descent can automatically find good solutions - even for tricky problems like XOR. We'll use:
- **Smart initialization**: Trying multiple random starting points and picking the best one
- **Momentum**: A better gradient descent algorithm that helps escape local minima
- **Flexible learning rates**: Adjusting the step size to balance speed and stability

These techniques make training **robust and reliable** across different datasets!

### This Module: Compare Algorithms and Datasets

In this module, you'll:
- **Try different datasets**: See how problem difficulty affects convergence
- **Compare algorithms**: Basic gradient descent vs. gradient descent with momentum
- **Experiment with learning rates**: Find the sweet spot between speed and stability

You'll discover that **better algorithms converge more reliably** across different datasets!


---

## Section 2: Setup the Training System (1 min)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
from ipywidgets import Button, Output, VBox, HBox, Layout, HTML, Checkbox
from IPython.display import display, clear_output

# Sigmoid activation function
def sigmoid(z):
    """Sigmoid with clipping to prevent overflow."""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

In [None]:
def create_xor_dataset(dataset_type='clean', n_per_cluster=25, noise_std=0.2, seed=42):
    """
    Create XOR-style datasets with different difficulty levels.

    Parameters:
    - dataset_type: 'corner', 'corner_noisy', 'clean', 'noisy', or 'perfect'
    - n_per_cluster: number of samples per cluster (default 25)
    - noise_std: standard deviation of Gaussian noise (default 0.2)
    - seed: random seed for reproducibility

    Returns:
    - X: features (n_samples, 2)
    - y: labels (n_samples,)
    """
    np.random.seed(seed)

    if dataset_type == 'corner':
        # One corner vs other three corners (easier - linearly separable)
        corners = np.array([
            [-1.0, -1.0],  # BL - Class 0
            [-1.0, 1.0],   # TL - Class 1
            [1.0, -1.0],   # BR - Class 1
            [1.0, 1.0],    # TR - Class 1
        ])
        labels = np.array([0, 1, 1, 1])
        X = np.repeat(corners, n_per_cluster, axis=0)
        y = np.repeat(labels, n_per_cluster)
        X = X + np.random.randn(len(X), 2) * noise_std

    elif dataset_type == 'corner_noisy':
        # One corner vs three with more noise
        corners = np.array([
            [-1.0, -1.0],  # BL - Class 0
            [-1.0, 1.0],   # TL - Class 1
            [1.0, -1.0],   # BR - Class 1
            [1.0, 1.0],    # TR - Class 1
        ])
        labels = np.array([0, 1, 1, 1])
        X = np.repeat(corners, n_per_cluster, axis=0)
        y = np.repeat(labels, n_per_cluster)
        X = X + np.random.randn(len(X), 2) * (noise_std * 1.5)  # More noise

    elif dataset_type == 'clean':
        # Standard XOR (moderate difficulty)
        corners = np.array([
            [-1.0, -1.0],  # BL - Class 0
            [1.0, 1.0],    # TR - Class 0
            [-1.0, 1.0],   # TL - Class 1
            [1.0, -1.0],   # BR - Class 1
        ])
        labels = np.array([0, 0, 1, 1])
        X = np.repeat(corners, n_per_cluster, axis=0)
        y = np.repeat(labels, n_per_cluster)
        X = X + np.random.randn(len(X), 2) * noise_std

    elif dataset_type == 'noisy':
        # XOR with more overlap (harder)
        corners = np.array([
            [-1.0, -1.0],  # BL - Class 0
            [1.0, 1.0],    # TR - Class 0
            [-1.0, 1.0],   # TL - Class 1
            [1.0, -1.0],   # BR - Class 1
        ])
        labels = np.array([0, 0, 1, 1])
        X = np.repeat(corners, n_per_cluster, axis=0)
        y = np.repeat(labels, n_per_cluster)
        X = X + np.random.randn(len(X), 2) * (noise_std * 2.0)  # Much more noise

    elif dataset_type == 'perfect':
        # Minimal noise XOR (easiest)
        corners = np.array([
            [-1.0, -1.0],  # BL - Class 0
            [1.0, 1.0],    # TR - Class 0
            [-1.0, 1.0],   # TL - Class 1
            [1.0, -1.0],   # BR - Class 1
        ])
        labels = np.array([0, 0, 1, 1])
        X = np.repeat(corners, n_per_cluster, axis=0)
        y = np.repeat(labels, n_per_cluster)
        X = X + np.random.randn(len(X), 2) * 0.05  # Minimal noise

    else:
        raise ValueError(f"Unknown dataset_type: {dataset_type}")

    return X, y

# Create initial dataset (will be updated by dropdown)
current_dataset_type = 'clean'
X_train, y_train = create_xor_dataset(dataset_type=current_dataset_type)

print(f"Dataset created: {len(X_train)} samples")
print(f"Class 0: {np.sum(y_train==0)} samples")
print(f"Class 1: {np.sum(y_train==1)} samples")
print(f"Current dataset: {current_dataset_type}")

# Visualize
plt.figure(figsize=(6, 6))
for label in [0, 1]:
    mask = y_train == label
    color = 'blue' if label == 0 else 'red'
    plt.scatter(X_train[mask, 0], X_train[mask, 1], c=color, s=30, alpha=0.6, edgecolors='k', linewidths=0.5)
plt.xlabel('x\u2081', fontsize=12)
plt.ylabel('x\u2082', fontsize=12)
plt.title(f'Dataset: {current_dataset_type}', fontsize=12, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.tight_layout()
plt.show()

In [3]:
def tanh(x):
    return np.tanh(x)

class TinyNetwork:
    """2-2-1 neural network for XOR classification."""
    
    '''def __init__(self, weights=None):
        if weights is None:
            self.w11 = self.w12 = self.b1 = 0.0
            self.w21 = self.w22 = self.b2 = 0.0
            self.w_out1 = self.w_out2 = self.b_out = 0.0
        else:
            self.set_weights(weights)'''
    
    
    def __init__(self, weights=None, rng=None, scale=0.5):
        if rng is None:
            rng = np.random.default_rng()
        if weights is None:
            # Small random initialization to break symmetry
            w = rng.normal(loc=0.0, scale=scale, size=9)
            self.set_weights(w)
        else:
            self.set_weights(weights)
    
    def forward(self, x1, x2):
        """Forward pass for a single input."""
        z1 = self.w11 * x1 + self.w12 * x2 + self.b1
        h1 = tanh(z1)
        z2 = self.w21 * x1 + self.w22 * x2 + self.b2
        h2 = tanh(z2)
        z_out = self.w_out1 * h1 + self.w_out2 * h2 + self.b_out
        output = sigmoid(z_out)
        return output, h1, h2
    
    def predict_batch(self, X):
        """Forward pass for batch of inputs."""
        predictions = []
        h1_vals = []
        h2_vals = []
        for x in X:
            out, h1, h2 = self.forward(x[0], x[1])
            predictions.append(out)
            h1_vals.append(h1)
            h2_vals.append(h2)
        return np.array(predictions), np.array(h1_vals), np.array(h2_vals)
    
    def get_weights(self):
        """Get all weights as a list."""
        return [self.w11, self.w12, self.b1, self.w21, self.w22,
                self.b2, self.w_out1, self.w_out2, self.b_out]
    
    def set_weights(self, weights):
        """Set all weights from a list."""
        self.w11, self.w12, self.b1, self.w21, self.w22, self.b2, \
        self.w_out1, self.w_out2, self.b_out = weights

print("TinyNetwork class ready!")

TinyNetwork class ready!


In [4]:
def compute_loss(network, X, y):
    """Binary cross-entropy loss."""
    predictions, _, _ = network.predict_batch(X)
    epsilon = 1e-10  # Prevent log(0)
    bce = -np.mean(y * np.log(predictions + epsilon) +
                   (1 - y) * np.log(1 - predictions + epsilon))
    return bce

def compute_accuracy(network, X, y):
    """Classification accuracy."""
    predictions, _, _ = network.predict_batch(X)
    pred_labels = (predictions > 0.5).astype(int)
    return np.mean(pred_labels == y)

print("Loss and accuracy functions ready!")

Loss and accuracy functions ready!


In [5]:
def compute_gradients(network, X, y):
    """Analytical backpropagation for 2-2-1 network."""
    n_samples = len(X)
    
    # Initialize gradient accumulators
    grads = {'w11': 0, 'w12': 0, 'b1': 0,
             'w21': 0, 'w22': 0, 'b2': 0,
             'w_out1': 0, 'w_out2': 0, 'b_out': 0}
    
    # Accumulate gradients over all samples
    for i in range(n_samples):
        x1, x2 = X[i]
        target = y[i]
        
        # Forward pass
        z1 = network.w11 * x1 + network.w12 * x2 + network.b1
        h1 = tanh(z1)
        z2 = network.w21 * x1 + network.w22 * x2 + network.b2
        h2 = tanh(z2)
        z_out = network.w_out1 * h1 + network.w_out2 * h2 + network.b_out
        y_pred = sigmoid(z_out)
        
        # Backpropagation
        epsilon = 1e-10
        dL_dy_pred = -(target / (y_pred + epsilon) - (1 - target) / (1 - y_pred + epsilon))
        dy_pred_dz_out = y_pred * (1 - y_pred)
        delta_out = dL_dy_pred * dy_pred_dz_out
        
        # Output layer gradients
        grads['w_out1'] += delta_out * h1
        grads['w_out2'] += delta_out * h2
        grads['b_out'] += delta_out
        
        # Hidden layer deltas
        #delta_h1 = delta_out * network.w_out1 * h1 * (1 - h1)
        #delta_h2 = delta_out * network.w_out2 * h2 * (1 - h2)
        delta_h1 = delta_out * network.w_out1 * (1 - h1**2)
        delta_h2 = delta_out * network.w_out2 * (1 - h2**2)
        
        # Hidden layer gradients
        grads['w11'] += delta_h1 * x1
        grads['w12'] += delta_h1 * x2
        grads['b1'] += delta_h1
        grads['w21'] += delta_h2 * x1
        grads['w22'] += delta_h2 * x2
        grads['b2'] += delta_h2
    
    # Average gradients
    for key in grads:
        grads[key] /= n_samples
    
    return [grads['w11'], grads['w12'], grads['b1'],
            grads['w21'], grads['w22'], grads['b2'],
            grads['w_out1'], grads['w_out2'], grads['b_out']]

print("Gradient computation (backpropagation) ready!")

Gradient computation (backpropagation) ready!


In [None]:
def initialize_network_multistart(X, y, n_trials=10, scale=0.5, verbose=True):
    """
    Try N random initializations and return the one with lowest initial loss.

    Parameters:
    - X, y: Training data
    - n_trials: Number of random initializations to try (default 10)
    - scale: Standard deviation for weight initialization (default 0.5)
    - verbose: Print progress (default True)

    Returns:
    - best_network: TinyNetwork with best initial weights
    - best_loss: Initial loss of best network
    """
    best_network = None
    best_loss = np.inf

    if verbose:
        print(f"Trying {n_trials} random initializations...")

    for trial in range(n_trials):
        # Create network with random init
        network = TinyNetwork(scale=scale)

        # Compute initial loss
        loss = compute_loss(network, X, y)

        if verbose and trial < 5:  # Show first few trials
            print(f"  Trial {trial+1}: Initial loss = {loss:.4f}")

        # Keep if best so far
        if loss < best_loss:
            best_loss = loss
            best_network = network

    if verbose:
        print(f"[OK] Best initialization: Loss = {best_loss:.4f}")

    return best_network, best_loss

print("Multi-start initialization function ready!")


In [6]:
#def train_step(network, X, y, learning_rate):
#    """Single gradient descent step."""
#    grads = compute_gradients(network, X, y)
#    weights = network.get_weights()
#    new_weights = [w - learning_rate * g for w, g in zip(weights, grads)]
#    network.set_weights(new_weights)
#    loss = compute_loss(network, X, y)
#    acc = compute_accuracy(network, X, y)
#    return loss, acc


def train_step(network, X, y, base_lr, training_state):
    grads = compute_gradients(network, X, y)
    w = np.array(network.get_weights(), dtype=float)
    direction = -np.array(grads)

    # Try a few candidate step sizes
    lrs = [base_lr, base_lr / 2, base_lr * 2]
    best_loss = np.inf
    best_w = w

    for eta in lrs:
        cand_w = w + eta * direction
        network.set_weights(cand_w.tolist())
        loss = compute_loss(network, X, y)
        if loss < best_loss:
            best_loss = loss
            best_w = cand_w

    # Commit to the best candidate
    network.set_weights(best_w.tolist())
    loss = best_loss
    acc = compute_accuracy(network, X, y)
    return loss, acc



print("Training step function ready!")

Training step function ready!


In [None]:
def train_step_momentum(network, X, y, base_lr, training_state):
    """Gradient descent with momentum and line search."""
    grads = compute_gradients(network, X, y)
    w = np.array(network.get_weights(), dtype=float)

    # Momentum parameter (typical value: 0.9)
    beta = 0.9

    # Initialize velocity if first iteration
    if 'velocity' not in training_state:
        training_state['velocity'] = np.zeros_like(grads)

    # Update velocity: v = beta * v + (1-beta) * gradient
    training_state['velocity'] = beta * training_state['velocity'] + (1 - beta) * np.array(grads)

    # Direction is now based on velocity instead of raw gradient
    direction = -training_state['velocity']

    # Line search: try 3 learning rates
    lrs = [base_lr, base_lr / 2, base_lr * 2]
    best_loss = np.inf
    best_w = w

    for eta in lrs:
        cand_w = w + eta * direction
        network.set_weights(cand_w.tolist())
        loss = compute_loss(network, X, y)
        if loss < best_loss:
            best_loss = loss
            best_w = cand_w

    # Commit to the best candidate
    network.set_weights(best_w.tolist())
    loss = best_loss
    acc = compute_accuracy(network, X, y)
    return loss, acc

print("Momentum training function ready!")

### How Gradient Descent Works: Simple Rules, Powerful Results

The code you see below looks complex, but it follows a few **simple rules**:

**The Algorithm (in plain English):**
1. **Forward pass**: Calculate prediction from current weights
2. **Compute error**: How wrong is the prediction?
3. **Backpropagation**: Calculate which direction to adjust each weight
4. **Update weights**: Take a small step in that direction
5. **Repeat** until error is small enough

**The math behind it:**
- **Derivatives** tell us which direction makes error smaller
- **Chain rule** connects output error back to every weight
- **Learning rate** controls step size (too big = unstable, too small = slow)

Despite ~100 lines of code, it's just these 5 steps repeated!

### Why Multiple Starting Points?

The loss landscape has **hills and valleys**. Gradient descent rolls downhill from wherever it starts:
- **Good starting point**: Near a deep valley → converges quickly
- **Bad starting point**: On a plateau or shallow valley → gets stuck

**Multi-start strategy**: Try 10 random starts, pick the one with lowest initial loss. This gives us the best "head start" down the hill!


---

## Understanding Your Controls

Before you start training, here's what each control does:

### Dataset Selector
- **Corner / Corner Noisy**: Easier problem (one corner vs. three corners)
- **Clean XOR**: Moderate difficulty (standard XOR problem)
- **Noisy XOR**: Harder (lots of overlap between classes)
- **Perfect XOR**: Easiest (minimal noise)

**Try different datasets to see how problem difficulty affects learning!**

### Algorithm Selector
- **Basic Gradient Descent**: Uses line search to pick best step size
- **Gradient Descent + Momentum**: Adds "velocity" to help escape local minima and speed up convergence

**Momentum often works better on harder problems!**

### Learning Rate
- **Slow (0.1)**: Safe but may take many steps
- **Moderate (0.3)**: Good default balance
- **Fast (0.5)**: Converges quickly but may be unstable
- **Very Fast (1.0)**: May overshoot or diverge

**Experiment to see the trade-off between speed and stability!**

---

## Section 3: Watch Learning Happen! (8-10 min)

Now let's see gradient descent in action. You'll watch the network **learn** to solve XOR automatically.

### What You'll See:
- **Left panel**: Decision boundary evolving in real-time
- **Right panel**: Loss decreasing as the network learns
- **Training controls**: Step through learning at your own pace

### Smart Initialization Strategy

When you click **"Reset Network"**, the system tries **10 different random starting points** and automatically picks the one with the lowest initial loss. This is called **multi-start initialization**.

**Why this helps:**
- The loss landscape is like a bumpy terrain with many hills and valleys
- Some random starting points are near deep valleys (good!) → fast convergence
- Others are on plateaus or shallow valleys (bad!) → slow or no convergence
- By trying multiple starts, we increase the odds of finding a good valley

**The intuition:**
- Imagine dropping 10 balls randomly on a hilly landscape
- Each ball rolls to a different local low point
- We pick the ball that found the **deepest valley** to start from
- This gives gradient descent the best chance to find a great solution!

**Real-world machine learning:**
- Training often uses smart initialization strategies (Xavier, He initialization)
- Advanced optimizers (Adam, RMSprop) adapt learning rates automatically
- Large models sometimes run multiple training sessions and pick the best result

**Try this:**
1. Train until convergence or until stuck (loss stops decreasing)
2. If stuck, click "Reset Network" to get a fresh starting point
3. Notice how the initial loss varies - some starting points are much better!
4. Compare: Does momentum help escape bad starting points?


In [None]:
def plot_training_state(network, X, y, loss_history, epoch, show_log_scale=True):
    """2-panel visualization: Decision boundary + Loss curve."""
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
    
    # Left panel: Decision boundary in input space
    x_min, x_max = -2.0, 2.0
    y_min, y_max = -2.0, 2.0
    h = 0.05
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    mesh_points = np.c_[xx.ravel(), yy.ravel()]
    
    Z_out, _, _ = network.predict_batch(mesh_points)
    Z_out = Z_out.reshape(xx.shape)
    
    ax1.contourf(xx, yy, Z_out, levels=20, alpha=0.4, cmap='RdBu_r')
    ax1.contour(xx, yy, Z_out, levels=[0.5], colors='green', linewidths=3)
    ax1.scatter(X[y==0, 0], X[y==0, 1], c='blue', s=50, alpha=0.7,
               edgecolors='k', linewidths=1, label='Class 0')
    ax1.scatter(X[y==1, 0], X[y==1, 1], c='red', s=50, alpha=0.7,
               edgecolors='k', linewidths=1, label='Class 1')
    ax1.set_xlim(x_min, x_max)
    ax1.set_ylim(y_min, y_max)
    ax1.set_xlabel('x\u2081', fontsize=12, fontweight='bold')
    ax1.set_ylabel('x\u2082', fontsize=12, fontweight='bold')
    ax1.set_title(f'Decision Boundary (Epoch {epoch})', fontsize=12, fontweight='bold')
    ax1.legend(loc='upper right')
    ax1.grid(True, alpha=0.3)
    ax1.set_aspect('equal')
    
    # Right panel: Loss curve
    if len(loss_history) > 0:
        ax2.plot(loss_history, 'b-', linewidth=2)
        ax2.set_xlabel('Epoch', fontsize=12, fontweight='bold')
        ax2.set_ylabel('Loss (Binary Cross-Entropy)', fontsize=12, fontweight='bold')
        ax2.set_title('Training Loss Curve', fontsize=12, fontweight='bold')
        ax2.grid(True, alpha=0.3)
        
        if show_log_scale and len(loss_history) > 5:
            ax2.set_yscale('log')
            ax2.set_ylabel('Loss (log scale)', fontsize=12, fontweight='bold')
    else:
        ax2.text(0.5, 0.5, 'No training yet...', ha='center', va='center',
                fontsize=14, transform=ax2.transAxes)
        ax2.set_xlabel('Epoch', fontsize=12)
        ax2.set_ylabel('Loss', fontsize=12)
    
    plt.tight_layout()
    plt.show()

print("Visualization function ready!")

In [13]:
# Training state management
'''training_state = {
    'network': TinyNetwork(CONVERGENT_SEEDS[0]),  # Start with first seed
    'epoch': 0,
    'loss_history': [],
    'learning_rate': 0.1,
    'current_seed_idx': 0
}'''

training_state = {
    'network': None,  # Will be initialized with multi-start below
    'epoch': 0,
    'loss_history': [],
    'learning_rate': 0.3,
    'algorithm': 'basic',  # 'basic' or 'momentum'
    'current_seed_idx': None,
}
# Status display
status_html = HTML(value="<h3>Ready to train!</h3>")
plot_output = Output()


# Algorithm selector dropdown
from ipywidgets import Dropdown

algorithm_dropdown = Dropdown(
    options=[
        ('Basic Gradient Descent', 'basic'),
        ('Gradient Descent + Momentum', 'momentum')
    ],
    value='basic',
    description='Algorithm:',
    layout=Layout(width='300px')
)

def on_algorithm_change(change):
    training_state['algorithm'] = change['new']
    # Reset velocity when switching algorithms
    if 'velocity' in training_state:
        del training_state['velocity']

algorithm_dropdown.observe(on_algorithm_change, names='value')

# Learning rate dropdown
lr_dropdown = Dropdown(
    options=[
        ('Slow (0.1)', 0.1),
        ('Moderate (0.3) - Default', 0.3),
        ('Fast (0.5)', 0.5),
        ('Very Fast (1.0)', 1.0)
    ],
    value=0.3,
    description='Learning Rate:',
    layout=Layout(width='300px')
)

def on_lr_change(change):
    training_state['learning_rate'] = change['new']

lr_dropdown.observe(on_lr_change, names='value')

# Dataset selector dropdown
dataset_dropdown = Dropdown(
    options=[
        ('Corner (easier)', 'corner'),
        ('Corner Noisy (moderate)', 'corner_noisy'),
        ('Clean XOR (moderate)', 'clean'),
        ('Noisy XOR (harder)', 'noisy'),
        ('Perfect XOR (easiest)', 'perfect')
    ],
    value='clean',
    description='Dataset:',
    layout=Layout(width='300px')
)

def on_dataset_change(change):
    global X_train, y_train, current_dataset_type
    current_dataset_type = change['new']
    X_train, y_train = create_xor_dataset(dataset_type=current_dataset_type)
    # Reset training when dataset changes
    best_network, _ = initialize_network_multistart(X_train, y_train, n_trials=10, verbose=False)
    training_state['network'] = best_network
    training_state['epoch'] = 0
    training_state['loss_history'] = []
    if 'velocity' in training_state:
        del training_state['velocity']
    update_display()

dataset_dropdown.observe(on_dataset_change, names='value')

# Buttons
train_1_btn = Button(
    description='Train 1 Step',
    button_style='info',
    layout=Layout(width='150px', height='40px')
)

train_10_btn = Button(
    description='Train 10 Steps',
    button_style='primary',
    layout=Layout(width='150px', height='40px')
)

train_converge_btn = Button(
    description='Train to Convergence',
    button_style='success',
    layout=Layout(width='180px', height='40px')
)

reset_btn = Button(
    description='Reset Network',
    button_style='warning',
    layout=Layout(width='150px', height='40px')
)

def update_display():
    """Update status and visualization."""
    epoch = training_state['epoch']
    loss = training_state['loss_history'][-1] if training_state['loss_history'] else compute_loss(training_state['network'], X_train, y_train)
    acc = compute_accuracy(training_state['network'], X_train, y_train)
    
    status_html.value = f"<h3>Epoch {epoch} | Loss: {loss:.6f} | Accuracy: {acc:.2%}</h3>"
    
    with plot_output:
        clear_output(wait=True)
        plot_training_state(training_state['network'], X_train, y_train, 
                          training_state['loss_history'], epoch)

def train_n_steps(n):
    """Train for n steps using selected algorithm."""
    algorithm = training_state.get('algorithm', 'basic')

    for _ in range(n):
        # Dispatch to correct training function
        if algorithm == 'momentum':
            loss, acc = train_step_momentum(training_state['network'], X_train, y_train,
                                          training_state['learning_rate'], training_state)
        else:
            loss, acc = train_step(training_state['network'], X_train, y_train,
                                 training_state['learning_rate'], training_state)

        training_state['loss_history'].append(loss)
        training_state['epoch'] += 1

        # Early stopping if converged
        if acc >= 0.99:
            break

    update_display()

def on_train_1(btn):
    train_n_steps(1)

def on_train_10(btn):
    train_n_steps(10)

def on_train_converge(btn):
    """Train until convergence or max 500 epochs using selected algorithm."""
    algorithm = training_state.get('algorithm', 'basic')
    max_epochs = 500

    while training_state['epoch'] < max_epochs:
        # Dispatch to correct training function
        if algorithm == 'momentum':
            loss, acc = train_step_momentum(training_state['network'], X_train, y_train,
                                          training_state['learning_rate'], training_state)
        else:
            loss, acc = train_step(training_state['network'], X_train, y_train,
                                 training_state['learning_rate'], training_state)

        training_state['loss_history'].append(loss)
        training_state['epoch'] += 1

        if acc >= 0.99 or loss < 0.01:
            break

    update_display()

'''def on_reset(btn):
    """Reset to new random seed."""
    training_state['current_seed_idx'] = (training_state['current_seed_idx'] + 1) % len(CONVERGENT_SEEDS)
    training_state['network'] = TinyNetwork(CONVERGENT_SEEDS[training_state['current_seed_idx']])
    training_state['epoch'] = 0
    training_state['loss_history'] = []
    update_display()'''

def on_reset(btn):
    """Reset with multi-start initialization (tries multiple random starts)."""
    global X_train, y_train

    # Multi-start: try 10 random inits, pick best
    status_html.value = "<h3>&#128260; Finding good starting point (10 random trials)...</h3>"

    best_network, best_loss = initialize_network_multistart(
        X_train, y_train,
        n_trials=10,
        scale=0.5,
        verbose=False  # Don't clutter output
    )

    # Update state
    training_state['network'] = best_network
    training_state['epoch'] = 0
    training_state['loss_history'] = []

    # Clear momentum velocity
    if 'velocity' in training_state:
        del training_state['velocity']

    # Show result
    acc = compute_accuracy(best_network, X_train, y_train)
    status_html.value = f"<h3>&#10024; Reset! Best init: Loss={best_loss:.4f}, Acc={acc:.2%}</h3>"

    update_display()

# Connect buttons
train_1_btn.on_click(on_train_1)
train_10_btn.on_click(on_train_10)
train_converge_btn.on_click(on_train_converge)
reset_btn.on_click(on_reset)


# Initialize network with multi-start (try 10 random inits, pick best)
print("Initializing network with multi-start strategy...")
training_state['network'], initial_loss = initialize_network_multistart(
    X_train, y_train,
    n_trials=10,
    scale=0.5,
    verbose=True
)
print(f"Network initialized! Starting loss: {initial_loss:.4f}\n")

print("Training interface ready!")

Training interface ready!


In [14]:
# Display the interactive training interface
print("="*70)
print("INTERACTIVE TRAINING: WATCH GRADIENT DESCENT LEARN!")
print("="*70)
print("\nInstructions:")
print("  1. Click 'Train 1 Step' to see one gradient descent update")
print("  2. Click 'Train 10 Steps' to speed things up")
print("  3. Click 'Train to Convergence' to watch it finish automatically")
print("  4. Click 'Reset Network' to try a different random starting point")
print("\nWatch the LEFT panel: Decision boundary evolves!")
print("Watch the RIGHT panel: Loss decreases!")
print("="*70)

display(status_html)
display(HTML("<h4>Training Configuration:</h4>"))
display(HBox([dataset_dropdown, algorithm_dropdown, lr_dropdown]))
display(HTML("<h4>Training Controls:</h4>"))
display(HBox([train_1_btn, train_10_btn, train_converge_btn, reset_btn]))
display(plot_output)

# Show initial state
update_display()

INTERACTIVE TRAINING: WATCH GRADIENT DESCENT LEARN!

Instructions:
  1. Click 'Train 1 Step' to see one gradient descent update
  2. Click 'Train 10 Steps' to speed things up
  3. Click 'Train to Convergence' to watch it finish automatically
  4. Click 'Reset Network' to try a different random starting point

Watch the LEFT panel: Decision boundary evolves!
Watch the RIGHT panel: Loss decreases!


HTML(value='<h3>Ready to train!</h3>')

HBox(children=(Button(button_style='info', description='Train 1 Step', layout=Layout(height='40px', width='150…

Output()