<a href="https://colab.research.google.com/github/patemotter/demystifying-ai/blob/main/notebooks/session_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">

#  

# Demystifying AI - Session 2
## How Does AI Learn?

### Pate Motter, PhD

AI Performance Engineer @ Google

</div>


---
# Overview:
Today we will take a look at how AI learns from data.


Sections:

1. How AI Sees the World: Everything is Numbers

1. Measuring Success - How AI Knows It's Improving

1. The Learning Process - Gradient Descent

1. Learning from Mistakes - Backpropagation

1. Putting it all Together - The Training Loop



---

 # Section 1 - How AI Sees the World: Everything is Numbers




## The Challenge of Machine Learning
Imagine trying to teach someone to ride a bike, but with an unusual constraint: they're blindfolded and can only receive numerical information through a computer screen. They can't see the road, feel the wind, or use any of their senses. Instead, they only get numbers like:

```
Current readings:
Speed: 5.2 mph
Handlebar angle: 3.4° left
Bike tilt: 2.1° right
Distance from road center: 0.3m left
```
Try to keep the bike upright and on course using only these numbers!

This is similar to how AI systems learn. They don't have our human intuition, senses, or understanding of the world. They can only learn from numbers, and must find patterns in these numbers to make decisions. The systematic process of converting real-world information into numerical data that AI can understand is called **preprocessing** - this encompasses all the steps needed to transform raw data into a format that machine learning models can effectively learn from.

### Images
When you look at a photo of a sunset, you see colors, shapes, and beauty. But an AI sees a grid of numbers, where each number represents the brightness and color of a tiny dot (pixel) in the image.

In this example say we are trying to measure the letter `X` visually.
```python
We might see something like this on a 3x3 grid of pixels:
[[189,  23, 212],  # Each number represents
 [ 16, 178,  25],  # the brightness
 [211,  20, 205]]  # of a single pixel
```
This process of converting any type of data into a sequence or array of numbers is called **vectorization** - it's the fundamental way we transform real-world data into vectors (ordered lists of numbers) that AI models can process. Vectorization applies to any kind of data that needs to be represented numerically for machine learning.

### Text
We understand words based on their meaning, but AI needs them converted to numbers. The process of systematically mapping categorical or textual data to numerical values is called **encoding**. While the simplest form just assigns unique numbers to each character, modern AI uses more sophisticated approaches:

```python
Simple encoding:
"Hello" → [72, 101, 108, 108, 111]

Modern approach (word embeddings):
"Cat"   → [0.2, -0.5, 0.1, ...]  # Notice how the numbers
"Dog"   → [0.15, -0.4, 0.2, ...] # for similar concepts
"Table" → [-0.8, 0.2, -0.3, ...] # are also similar!
```
These **word embeddings** are a powerful technique that maps words or phrases to high-dimensional vectors where the geometric relationships between vectors capture meaningful semantic relationships between words. That's why the numbers for "cat" and "dog" are more similar to each other than to "table" - just like their meanings!

### Sound
When we hear sound, we experience it as continuous waves of pressure in the air. For AI to process sound, we need to convert these waves into a sequence of numbers:

In [1]:
# @title Waveform Code
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import numpy as np

# --- 1. Generate Continuous Audio Waveform (Lower Frequency - same) ---
sample_rate = 44100
duration = 0.2
time_continuous = np.linspace(0, duration, int(sample_rate * duration), endpoint=False)
frequency = 5
amplitude = 0.8

continuous_waveform = (amplitude * np.sin(2 * np.pi * frequency * time_continuous) +
                       0.3 * amplitude * np.sin(2 * np.pi * 2 * frequency * time_continuous) +
                       0.1 * amplitude * np.sin(2 * np.pi * 3 * frequency * time_continuous))


# --- 2. Discretize the Waveform (Reduced Samples - same) ---
n_samples_discrete = 12
time_discrete_indices = np.linspace(0, len(continuous_waveform)-1, n_samples_discrete, endpoint=True, dtype=int)
time_discrete = time_continuous[time_discrete_indices]
discrete_waveform_values = continuous_waveform[time_discrete_indices]

# --- Create Stepwise Data for Plotly (same) ---
stepwise_time = []
stepwise_values = []
for i in range(len(time_discrete)):
    stepwise_time.append(time_discrete[i])
    stepwise_values.append(discrete_waveform_values[i])
    if i < len(time_discrete) - 1:
        stepwise_time.append(time_discrete[i+1])
        stepwise_values.append(discrete_waveform_values[i])


# --- 3. Create Plotly Subplots Figure ---
fig = make_subplots(rows=1, cols=2,
                    subplot_titles=('Continuous Waveform', 'Discretized Waveform (Stepwise)'))

# --- 3.1 Continuous Waveform Trace (add to subplot 1, column 1) ---
fig.add_trace(go.Scatter(x=time_continuous, y=continuous_waveform, mode='lines', line=dict(color='blue')),
              row=1, col=1)

# --- 3.2 Discrete Waveform Stepwise Trace (add to subplot 1, column 2) ---
fig.add_trace(go.Scatter(x=stepwise_time, y=stepwise_values, mode='lines', line=dict(color='red', shape='hv')),
              row=1, col=2)

# --- Annotations for Stepwise Chart (Bottom-left positioning) ---
annotations = []
for x_val, y_val in zip(time_discrete, discrete_waveform_values):
    annotations.append(
        dict(x=x_val, y=y_val, text=f'{y_val:.2f}',
             xanchor='left',
             yanchor='bottom',
             xshift=5,
             yshift=-5,
             xref=f'x2', yref=f'y2',
             showarrow=False)  # Added showarrow=False to remove arrows
    )
fig.update_layout(annotations=annotations)

# --- 3.3 "Tensor" Representation Annotation (Text below right subplot) ---
tensor_values_text = "[" + ", ".join(f"{val:.2f}" for val in discrete_waveform_values) + "]"
tensor_annotation = go.layout.Annotation(
    x=0.8,
    y=-0.3,
    xref="paper", yref="paper",
    text=f"<b>Discrete Values (Tensor Representation):</b><br>{tensor_values_text}",
    showarrow=False,
    align="center"
)
fig.update_layout(annotations=fig.layout.annotations + (tensor_annotation,))


# --- 4. Update Layout for the Subplots Figure (Increased Height) ---
fig.update_layout(
    title='Audio Waveform: Continuous vs. Discretized',
    yaxis1_title='Amplitude (Continuous)',
    yaxis2_title='Amplitude (Discrete)',
    xaxis1_title='Time (seconds)',
    xaxis2_title='Time (seconds - Discrete Samples)',
    yaxis_range=[-1.2, 1.2],
    margin=dict(b=80),
    showlegend=False,
    height=450
)

# --- 5. Display the Side-by-Side Plots ---
fig.show()

## Making Numbers Work Better
Just converting data to numbers isn't always enough. Think about trying to compare the age of a house (maybe 50 years) with its number of bedrooms (maybe 3). These numbers are on totally different scales! To help the AI learn efficiently and avoid bias towards larger scales, we need to transform these features to comparable ranges, a process called **normalization** - a family of mathematical techniques that adjust features to a standard scale while preserving their relative relationships.

```python
# Before normalization:
House 1: 50 years old, 3 bedrooms
House 2: 2 years old, 5 bedrooms
Overall Data: Age (min=1, max=92), Bedrooms (min=0, max=6)

# Min-Max Formula
x_normalized = (x - min) / (max - min)

# After min-max normalization:
House 1: age = 0.538 = (50 - 1) / (92 - 1)
          br = 0.5   = ( 3 - 0) / ( 6 - 0)
House 1: age = 0.011 = ( 2 - 1) / (92 - 1)
          br = 0.833 = ( 5 - 0) / ( 6 - 0)
```
Now the AI can compare these values fairly!

### Finding Important Characteristics
When you look at a house, you naturally focus on important details like size, location, and condition while ignoring irrelevant details like the house number or the color of the garden hose. We need to help AI do the same thing. This process, called **feature extraction**, involves systematically identifying, deriving, or computing informative attributes from raw data that will help the model learn patterns effectively.


## [Optional Deep Dive] Advanced Data Representation




### Feature Scaling and Normalization

#### Fundamental Scaling Techniques

1. **Min-Max Scaling (Normalization)**
   - Maps features to [0,1] range
   - Formula: $x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}$
   - Variants:
     - Custom range [a,b]: $x_{scaled} = a + \frac{(x - x_{min})(b-a)}{x_{max} - x_{min}}$
   - Properties:
     - Preserves zero values
     - Preserves missing values
     - All features have same scale
   - Limitations:
     - Sensitive to outliers
     - Doesn't handle new data outside original range well

2. **Z-Score Standardization**
   - Centers around mean, scales by standard deviation
   - Formula: $x_{scaled} = \frac{x - \mu}{\sigma}$
   - Properties:
     - Mean = 0, SD = 1
     - Preserves outliers
     - Handles normal distributions well
   - Variations:
     - Robust scaling using median/IQR: $x_{scaled} = \frac{x - median}{IQR}$
     - Modified Z-score for non-normal distributions

3. **Robust Scaling**
   - Uses statistics that are robust to outliers
   - Formula: $x_{scaled} = \frac{x - Q_1}{Q_3 - Q_1}$
   - When to use:
     - Data contains outliers
     - Non-normal distributions
     - Skewed features

### Advanced Feature Engineering

#### 1. Numerical Transformations
- **Power Transforms**
  - Box-Cox: $y_i^{(\lambda)} = \begin{cases} \frac{y_i^\lambda - 1}{\lambda}, & \lambda \neq 0 \\ \ln(y_i), & \lambda = 0 \end{cases}$
  - Yeo-Johnson: Handles negative values
  - Log Transform: $x_{transformed} = \log(x + c)$
  
- **Polynomial Features**
  - Creates interaction terms
  - Degree 2 example:
    ```python
    [x₁, x₂] → [1, x₁, x₂, x₁², x₁x₂, x₂²]
    ```
  - Computational complexity: $O(n^d)$ for degree d

#### 2. Categorical Encoding Techniques

1. **One-Hot Encoding**
   ```python
   # Original:     Encoded:
   Red   →        [1, 0, 0]
   Blue  →        [0, 1, 0]
   Green →        [0, 0, 1]
   ```
   - Pros: No ordinal relationship
   - Cons: Dimensionality explosion
   - Memory usage: O(nk) for n samples, k categories

2. **Label Encoding**
   ```python
   Red   → 0
   Blue  → 1
   Green → 2
   ```
   - Pros: Memory efficient
   - Cons: Implies ordering
   
3. **Target Encoding**
   - Replaces categories with target mean
   - Handles high cardinality
   - Requires cross-validation to prevent leakage
   ```python
   def target_encode(category, target):
       return E(target | category)
   ```

4. **Feature Hashing**
   - Maps high-cardinality features to fixed-size space
   - Formula: $h(x) = hash(x) \mod m$
   - Collision handling strategies

#### 3. Text Vectorization Methods

1. **TF-IDF (Term Frequency-Inverse Document Frequency)**




   - $TF(t,d) = \frac{\text{count of term t in document d}}{\text{total terms in document d}}$




   - $IDF(t) = \log\frac{\text{total documents}}{\text{documents containing t}}$




   - Final score: $TF(t,d) \times IDF(t)$

2. **Word2Vec Embeddings**
   - Skip-gram model
   - Negative sampling
   - Optimization objective:
     $\max_\theta \sum_{(w,c)} \log P(c|w;\theta)$

3. **Subword Tokenization**
   - Byte-Pair Encoding (BPE)
   - WordPiece
   - SentencePiece
   - Handles out-of-vocabulary words

### Dimensionality Considerations

1. **Curse of Dimensionality**
   - Volume of space increases exponentially with dimensions
   - Data becomes sparse
   - Distance metrics become less meaningful
   
2. **Dimensional Reduction Techniques**
   - Principal Component Analysis: $X = U\Sigma V^T$
   - t-SNE: $p_{j|i} = \frac{\exp(-\|x_i-x_j\|^2/2\sigma_i^2)}{\sum_{k\neq i}\exp(-\|x_i-x_k\|^2/2\sigma_i^2)}$
   - UMAP: Based on Riemannian geometry and algebraic topology

### Information Theory Perspectives

1. **Feature Information Content**
   - Entropy: $H(X) = -\sum_i p(x_i)\log p(x_i)$
   - Mutual Information: $I(X;Y) = \sum_{x,y} p(x,y)\log\frac{p(x,y)}{p(x)p(y)}$

2. **Optimal Feature Selection**
   - Maximum relevance: $\max_{S}\frac{1}{|S|}\sum_{x_i\in S}I(x_i;y)$
   - Minimum redundancy: $\min_{S}\frac{1}{|S|^2}\sum_{x_i,x_j\in S}I(x_i;x_j)$

---
# Section 2: Measuring Success - How AI Knows It's Improving

Imagine you're trying to learn a new skill, like throwing darts. When you throw a dart, you can easily see how far you are from the bullseye - maybe 2 inches to the left, or 3 inches too high. This immediate feedback helps you adjust and improve. But how does an AI system know if it's doing well or poorly when it's just working with numbers?

This is where **loss functions** come in - they're the AI's way of measuring how far its predictions are from the correct answers. Just like measuring the distance from your dart to the bullseye, a loss function measures the distance between the AI's prediction and the true answer. The systematic way we quantify the difference between predictions and actual values is called **error measurement**.

When we measure error, we have several ways to think about it:
1. How far off are we? (absolute error)
2. Are we too high or too low? (directional error)
3. How bad are big mistakes compared to small ones? (squared error)

## Types of Loss Functions

Different problems require different ways of measuring error. Let's explore the most common **loss functions** - the mathematical formulas that quantify how wrong our predictions are:







### Mean Squared Error (MSE)
This is like measuring the straight-line distance to your target, but squaring it. This means being wrong by 2 units is treated as 4 times worse than being wrong by 1 unit.

```python
np.mean((predictions - targets) ** 2)
```

### Mean Absolute Error (MAE)
This simply measures the direct distance to the target. Being wrong by 2 units is exactly twice as bad as being wrong by 1 unit.

```python
np.mean(np.abs(predictions - targets))
```

### Cross-Entropy Loss
This is specially designed for probability predictions (like "is this image a cat or dog?"). It measures how well the predicted probabilities match the true values.

```python
def binary_cross_entropy(predictions, targets):
    epsilon = 1e-15  # Prevent log(0)
    return -np.mean(
        targets * np.log(predictions + epsilon) +
        (1 - targets) * np.log(1 - predictions + epsilon)
    )
```

In [2]:
# @title Loss function visualization code
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import numpy as np

# 1. Generate Simplified Synthetic Data with a More Extreme Outlier (same)
np.random.seed(42)
x_data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
true_slope = 2
true_intercept = 1
y_data = true_slope * x_data + true_intercept + np.random.normal(0, 2, len(x_data))

# Add a MORE EXTREME outlier (same)
x_data = np.append(x_data, 8)
y_data = np.append(y_data, 50)

# 2. Fit Linear Regression with MSE
def mse(errors):
  return errors**2

mse_slope, mse_intercept = np.polyfit(x_data, y_data, 1)

# 3. Fit Linear Regression conceptually with MAE (Approximation - same weighting)
def mae(errors):
  return np.abs(errors)

mae_slope, mae_intercept = np.polyfit(x_data, y_data, 1, w=1/np.abs(y_data - (true_slope * x_data + true_intercept)))

# 4. Generate Regression Lines for Plotly (same)
mse_y_predicted = mse_slope * x_data + mse_intercept
mae_y_predicted = mae_slope * x_data + mae_intercept

# 5. Calculate Point-wise Loss Values (same)
mse_point_losses = mse(y_data - mse_y_predicted)
mae_point_losses = mae(y_data - mae_y_predicted)

# 6. Create Plotly Scatter Plot and Regression Lines (same plot)
fig = go.Figure()

# Scatter plot of data points
fig.add_trace(go.Scatter(x=x_data, y=y_data, mode='markers', name='Data Points', marker=dict(color='black', size=10)))

# MSE Regression Line
fig.add_trace(go.Scatter(x=x_data, y=mse_y_predicted, mode='lines', name='MSE Regression Line', line=dict(color='blue', width=2)))
# MAE Regression Line
fig.add_trace(go.Scatter(x=x_data, y=mae_y_predicted, mode='lines', name='MAE Regression Line', line=dict(color='red', width=2)))

# 7. Add Annotations for Point-wise Loss Values (Up & Down per point)
annotations = []
for i in range(len(x_data)):
    x_val = x_data[i]
    y_val = y_data[i]
    mse_loss_val = mse_point_losses[i]
    mae_loss_val = mae_point_losses[i]

    # MAE Annotation (above the point, red color)
    annotations.append(
        dict(x=x_val, y=y_val + 2,  # Position *above* point, adjusted dynamically
             text=f"MAE: {mae_loss_val:.1f}",
             xanchor='center', yanchor='bottom',  # Position above point
             showarrow=False,
             font=dict(color='red'))
    )
    # MSE Annotation (below the point, blue color)
    annotations.append(
        dict(x=x_val, y=y_val - 2,  # Position *below* point, adjusted dynamically
             text=f"MSE: {mse_loss_val:.1f}",
             xanchor='center', yanchor='top',   # Position below point
             showarrow=False,
             font=dict(color='blue'))
    )

fig.update_layout(annotations=annotations) # Apply annotations

# 8. Customize Layout (same as before)
fig.update_layout(
    title='MSE and MAE Loss Values per Data Point', # Updated title
    xaxis_title='X',
    yaxis_title='Y',
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1)
)

# 9. Display Plot
fig.show()

## 2. Loss function deep dive (Optional)



### Advanced Loss Functions

#### Huber Loss
Combines benefits of MSE and MAE:
$$ L_δ(y, f(x)) = \begin{cases}
\frac{1}{2}(y - f(x))^2 & \text{for } |y - f(x)| ≤ δ \\
δ|y - f(x)| - \frac{1}{2}δ^2 & \text{otherwise}
\end{cases} $$

```python
def huber_loss(predictions, targets, delta=1.0):
    """
    Huber Loss
    - MSE for small errors
    - MAE for large errors
    """
    error = predictions - targets
    is_small_error = np.abs(error) <= delta
    squared_loss = 0.5 * error ** 2
    linear_loss = delta * np.abs(error) - 0.5 * delta ** 2
    return np.mean(np.where(is_small_error, squared_loss, linear_loss))
```

#### Focal Loss
Modified cross-entropy that handles class imbalance:
$$ FL(p_t) = -(1-p_t)^γ \log(p_t) $$



### 3. Loss Function Properties





#### Convexity
A loss function L(θ) is convex if:
$$ L(tx_1 + (1-t)x_2) ≤ tL(x_1) + (1-t)L(x_2) $$
for t ∈ [0,1]

Important properties:
- Guaranteed global minimum
- No local minima traps
- Efficient optimization

#### Gradient Properties
Effect on learning:
- MSE gradient: Linear in error
- Cross-entropy gradient: Stronger for wrong predictions
- Huber loss gradient: Bounded

### 4. Implementation Considerations



#### Numerical Stability
Common techniques for stable implementations:

```python
def stable_cross_entropy(predictions, targets):
    """Numerically stable cross-entropy"""
    epsilon = 1e-15
    predictions = np.clip(predictions, epsilon, 1 - epsilon)
    return -np.sum(targets * np.log(predictions))
```

#### Gradient Computation
Analytical gradients for common losses:

```python
def loss_gradients(predictions, targets, loss_type='mse'):
    """Calculate gradients for different losses"""
    if loss_type == 'mse':
        return 2 * (predictions - targets)
    elif loss_type == 'cross_entropy':
        return -(targets / predictions) + (1 - targets)/(1 - predictions)
```

---
# Section 3: The Learning Process - Gradient Descent

Now that we can measure how wrong our predictions are (using loss functions), how does the AI actually improve? It uses a process called **gradient descent** - a method of systematically adjusting the model's internal numbers (weights and biases) to reduce the errors we measured.

## Understanding Gradients

Imagine you're blindfolded on a hill and your goal is to reach the bottom. How would you do it? You'd probably:
1. Feel which way the ground slopes
2. Take a small step in the downhill direction
3. Check if you've actually gone downhill
4. Repeat until you can't go any lower

This is exactly how gradient descent works! The **gradient** is just referring to the slope or steepness in any direction. In machine learning:
- The "hill" is our loss function
- The "slope" is how much the loss would change if we adjust each weight
- "Going downhill" means reducing the loss (making better predictions)

## The Learning Rate

One of the most important concepts in gradient descent is the **learning rate** - how big of a step we take in the downhill direction. This is crucial because:

- Too small: We'll take forever to reach the bottom
- Too large: We might overshoot and bounce around
- Just right: We'll efficiently progress toward the minimum


In [3]:
# @title Gradient Descent Example (code)
import plotly.graph_objects as go
import numpy as np

def surface_function(x, y):
    """More varied surface function (same as before)."""
    return (np.sin(np.sqrt(x**2 + y**2))
            + 0.5 * np.cos(2 * x)
            + 0.3 * np.sin(y/2)
            - 0.05 * x**2 + 0.02 * y
            + 0.2 * np.exp(-((x + 1)**2 + (y - 2)**2) / 2)
            - 0.1 * np.exp(-((x - 3)**2 + (y + 1)**2) / 3)
           )

def gradient(x, y):
    """Numerical gradient."""
    h = 1e-5
    dz_dx = (surface_function(x + h, y) - surface_function(x - h, y)) / (2 * h)
    dz_dy = (surface_function(x, y + h) - surface_function(x, y - h, )) / (2 * h)
    return np.array([dz_dx, dz_dy])

def vanilla_gradient_descent(start_point, learning_rate=0.1, iterations=200):
    """Vanilla Gradient Descent algorithm."""
    path = [start_point]
    current_point = np.array(start_point)

    for _ in range(iterations):
        grad = gradient(current_point[0], current_point[1])
        current_point = current_point - learning_rate * grad
        path.append(current_point)
    return np.array(path)

def momentum_gradient_descent(start_point, learning_rate=0.01, iterations=200, momentum=0.9):
    """Momentum Gradient Descent algorithm."""
    path = [start_point]
    current_point = np.array(start_point)
    velocity = np.zeros_like(current_point) # Initialize velocity to zero

    for _ in range(iterations):
        grad = gradient(current_point[0], current_point[1])
        velocity = momentum * velocity - learning_rate * grad  # Update velocity with momentum
        current_point = current_point + velocity              # Update position with velocity
        path.append(current_point)
    return np.array(path)


# 1. Create the surface data (same as before)
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
Z = surface_function(X, Y)
Z[np.isnan(Z)] = np.nanmax(Z) * 2

# 2. Perform gradient descent with both algorithms, starting from the same point
start_point = [0.1, 1.6]
vanilla_path = vanilla_gradient_descent(start_point)
momentum_path = momentum_gradient_descent(start_point)

vanilla_path_z = [surface_function(p[0], p[1]) for p in vanilla_path]
momentum_path_z = [surface_function(p[0], p[1]) for p in momentum_path]


# 3. Create the Plotly graph with both paths
fig = go.Figure(data=[
    go.Surface(z=Z, x=X, y=Y, colorscale='Viridis', opacity=0.7), # Surface, reduced opacity for path visibility
    go.Scatter3d(x=vanilla_path[:, 0], y=vanilla_path[:, 1], z=vanilla_path_z,
                 mode='lines+markers',
                 marker=dict(size=5, color='blue'),
                 line=dict(color='blue', width=2),
                 name='Vanilla GD'), # Vanilla GD Path in blue
    go.Scatter3d(x=momentum_path[:, 0], y=momentum_path[:, 1], z=momentum_path_z,
                 mode='lines+markers',
                 marker=dict(size=5, color='red'),
                 line=dict(color='red', width=2),
                 name='Momentum GD')  # Momentum GD Path in red
])

fig.update_layout(
    title='Comparison of Gradient Descent Algorithms',
    scene=dict(
        xaxis_title='X',
        yaxis_title='Y',
        zaxis_title='Z',
        camera=dict(eye=dict(x=1.2, y=-1.2, z=0.8)),
        zaxis=dict(range=[np.min(Z), np.max(Z)])
    ),
    margin=dict(l=20, r=20, b=20, t=40),
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1) # Legend at the bottom
)

fig.show()

## Types of Gradient Descent (Optional)

There are several ways to compute and apply gradients:




### Batch Gradient Descent
Uses all training data to compute the gradient:
```python
def batch_gradient_descent(X, y, model, learning_rate, epochs):
    """
    Classic batch gradient descent
    - Compute gradient using all data
    - Update weights once per epoch
    """
    for epoch in range(epochs):
        predictions = model.forward(X)
        loss = compute_loss(predictions, y)
        gradients = compute_gradients(loss)
        model.update_weights(gradients, learning_rate)
```

### Stochastic Gradient Descent (SGD)
Updates using one training example at a time:
```python
def sgd(X, y, model, learning_rate, epochs):
    """
    Stochastic gradient descent
    - Compute gradient using single example
    - Update weights more frequently
    - More noisy but often faster
    """
    for epoch in range(epochs):
        for i in range(len(X)):
            pred = model.forward(X[i:i+1])
            loss = compute_loss(pred, y[i:i+1])
            gradients = compute_gradients(loss)
            model.update_weights(gradients, learning_rate)
```

### Mini-batch Gradient Descent
Compromise between batch and stochastic:
```python
def minibatch_gd(X, y, model, learning_rate, epochs, batch_size):
    """
    Mini-batch gradient descent
    - Compute gradient using batch_size examples
    - Balance between stability and speed
    """
    for epoch in range(epochs):
        for i in range(0, len(X), batch_size):
            batch_X = X[i:i+batch_size]
            batch_y = y[i:i+batch_size]
            pred = model.forward(batch_X)
            loss = compute_loss(pred, batch_y)
            gradients = compute_gradients(loss)
            model.update_weights(gradients, learning_rate)
```

## [Optional Deep Dive] Optimization Theory



### 1. The Mathematics of Gradients

#### Formal Definition
For a function f(x), the gradient ∇f is the vector of partial derivatives:
$$ ∇f = \begin{bmatrix}
\frac{\partial f}{\partial x_1} \\
\frac{\partial f}{\partial x_2} \\
\vdots \\
\frac{\partial f}{\partial x_n}
\end{bmatrix} $$

#### Properties
- Points in direction of steepest increase
- Perpendicular to level sets
- Magnitude indicates steepness

### 2. Convergence Analysis

#### Gradient Descent Update Rule
$$ θ_{t+1} = θ_t - η∇f(θ_t) $$

For convex functions with Lipschitz continuous gradients:
$$ \|∇f(x) - ∇f(y)\| ≤ L\|x - y\| $$

Convergence rate for convex functions:
$$ f(θ_t) - f(θ^*) ≤ \frac{\|θ_0 - θ^*\|^2}{2ηt} $$

### 3. Advanced Optimization Techniques

#### Momentum
Adds velocity to updates:
```python
def momentum_update(params, gradients, velocity, momentum, lr):
    """
    Momentum update
    - Helps escape local minima
    - Smoother convergence
    """
    velocity = momentum * velocity - lr * gradients
    return params + velocity, velocity
```

#### Adam Optimizer
Combines momentum with adaptive learning rates:
$$ m_t = β_1m_{t-1} + (1-β_1)g_t $$
$$ v_t = β_2v_{t-1} + (1-β_2)g_t^2 $$
$$ \hat{m}_t = \frac{m_t}{1-β_1^t} $$
$$ \hat{v}_t = \frac{v_t}{1-β_2^t} $$
$$ θ_{t+1} = θ_t - η\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + ε} $$

```python
class AdamOptimizer:
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.lr = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = None  # First moment estimate
        self.v = None  # Second moment estimate
        self.t = 0     # Timestep
    
    def update(self, params, gradients):
        if self.m is None:
            self.m = np.zeros_like(params)
            self.v = np.zeros_like(params)
        
        self.t += 1
        
        # Update biased first moment estimate
        self.m = self.beta1 * self.m + (1 - self.beta1) * gradients
        
        # Update biased second raw moment estimate
        self.v = self.beta2 * self.v + (1 - self.beta2) * np.square(gradients)
        
        # Compute bias-corrected first moment estimate
        m_hat = self.m / (1 - self.beta1**self.t)
        
        # Compute bias-corrected second raw moment estimate
        v_hat = self.v / (1 - self.beta2**self.t)
        
        # Update parameters
        return params - self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
```

### 4. Optimization Landscapes

#### Critical Points
Different types of stationary points:
- Minima (local/global)
- Maxima (local/global)
- Saddle points

Properties:
$$ ∇f(x) = 0 $$
Classify using Hessian:
$$ H = \begin{bmatrix}
\frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1\partial x_2} & \cdots \\
\frac{\partial^2 f}{\partial x_2\partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \\
\vdots & \vdots & \ddots
\end{bmatrix} $$

#### Escaping Saddle Points
Modern optimizers use various techniques:
- Momentum helps escape shallow local minima
- Noise injection helps explore
- Second-order methods use curvature information

---
# Section 4: Learning from Mistakes - Backpropagation

Think about the last time you made a recipe that didn't turn out quite right. What was your thought process? It probably went something like:

1. *First thought*: "Hmm, this cake is too dense and dry"
2. *Mental backtracking*: "Okay, what did I do differently this time?"
3. *Component analysis*: "Wait... I was low on oil and used a bit less"
4. *Impact realization*: "That's probably why it's dry... and probably made it dense too"
5. *Adjustment planning*: "Next time I'll use the full amount of oil"

This intuitive process of:
- Starting with the final result
- Working backwards to find causes
- Understanding how each change affected the outcome
- Planning proportional adjustments

...is exactly how backpropagation works in neural networks! Just as you naturally traced problems back to specific ingredients and their effects, backpropagation traces errors back to specific weights and their impacts.


Cooking Process | Neural Network Process
--------------------|---------------------
Taste final result | Measure prediction error
Think "what did I change?" | Check weights and activations
Connect changes to problems | Compute gradients
Plan proper adjustments | Update weights
Consider all effects | Account for all connections

This systematic process of learning from mistakes is what makes both good bakers and effective neural networks. Let's see how this works in practice...

**Backpropagation: Learning from Mistakes, Step-by-Step**

Imagine our neural network is trying to predict the price of a house (our "cake"). It takes inputs like square footage, number of bedrooms, location, etc. (the "ingredients"). These inputs are processed through layers of interconnected "neurons" (like steps in the recipe), each with its own "weight" (like the amount of each ingredient).

1.  **The Forward Pass (Making the Prediction):**
    *   The network takes the input features and calculates a prediction. This is like following the recipe.
    *   Each neuron receives inputs from the previous layer, multiplies them by their weights, adds a bias, and then applies an activation function (like the sigmoid, which we saw earlier). This process continues until we reach the final output layer, which produces the prediction (e.g., the predicted house price).

2.  **The Loss Function (Tasting the Cake):**
    *   We compare the network's prediction to the actual house price. The difference is the "error" – how far off we were.
    *   We use a "loss function" (like the Mean Squared Error) to quantify this error as a single number.  A larger loss means a bigger mistake.  This is like tasting the cake and deciding how "bad" it is.

3.  **Backpropagation (Figuring Out What Went Wrong):**
    *   This is where the "learning" happens. We need to figure out how much each weight in the network contributed to the error.
    *   We start at the output layer and work *backward* (hence "backpropagation").
    *   We use the *derivative* of the loss function (remember that from the code?). The derivative tells us the *direction* and *magnitude* of the change needed.  If the derivative is positive, we need to decrease the output; if it's negative, we need to increase it.
    *   We also use the derivatives of the activation functions.  These tell us how sensitive each neuron's output is to changes in its input.
    *   The *chain rule* of calculus is the key here. It allows us to calculate the gradient (the direction and magnitude of change) for each weight, even those in the hidden layers, by combining the derivatives of the loss function and the activation functions.  Think of it like tracing the impact of using less oil, not just on the final cake's dryness, but also on how much the flour could bind, and *then* how that affected the final result.

4.  **Weight Updates (Adjusting the Recipe):**
    *   Once we have the gradients for each weight, we *update* the weights using a simple rule:
        `new_weight = old_weight - (learning_rate * gradient)`
    *   The `learning_rate` is a small number (like 0.01) that controls how much we adjust the weights in each step. We don't want to overcorrect!
    *   If the gradient is positive, it means increasing that weight would *increase* the loss, so we *decrease* the weight (by subtracting a small amount).
    *   If the gradient is negative, it means increasing that weight would *decrease* the loss, so we *increase* the weight.
    * We also update the biases in a similar way, using their gradients.

5.  **Iteration (Baking Again, and Again, and Again...):**
    *   We repeat this entire process (forward pass, loss calculation, backpropagation, weight updates) many times, using many different examples of houses and their prices.
    *   With each iteration, the network's weights are adjusted slightly, gradually reducing the error and improving the predictions.  This is like baking the cake many times, each time making small adjustments based on the previous results.

**In Summary: The Magic of Backpropagation**

Backpropagation is essentially a very efficient way to do "gradient descent."  It's like rolling a ball down a hill (the "loss landscape"). The gradient tells us the direction of the steepest descent, and the learning rate controls how big of a step we take in that direction. The goal is to find the bottom of the hill, where the loss is minimized. The chain rule allows us to calculate the "slope" (gradient) even in very complex, multi-layered networks. This combination of a clear loss function, the chain rule for gradient calculation, and iterative weight updates is what allows neural networks to learn complex patterns from data.



In [4]:
# @title Backpropagation Code
import plotly.graph_objects as go
import numpy as np

def tanh(x):
    return np.tanh(x)

def tanh_derivative(x):
    return 1 - np.tanh(x)**2

def backpropagation_visualization(input_size, hidden_size, output_size, learning_rate=0.01, epochs=50, batch_size=10, momentum=0.9, show_steps=False):
    """
    Visualizes backpropagation.  Includes batch training, momentum,
    adjustable scaling, epoch counter (in top-left corner), hover info,
    and optional step-by-step visualization.
    """

    # --- Initialization (small random biases) ---
    w1 = np.random.randn(input_size, hidden_size)
    b1 = np.random.randn(1, hidden_size) * 0.1
    w2 = np.random.randn(hidden_size, output_size)
    b2 = np.random.randn(1, output_size) * 0.1

    # Momentum terms
    v1 = np.zeros_like(w1)
    vb1 = np.zeros_like(b1)
    v2 = np.zeros_like(w2)
    vb2 = np.zeros_like(b2)

    input_nodes = [(0, i) for i in range(input_size)]
    hidden_nodes = [(1, i) for i in range(hidden_size)]
    output_nodes = [(2, i) for i in range(output_size)]
    all_nodes = input_nodes + hidden_nodes + output_nodes

    edges = []
    for i in range(input_size):
        for j in range(hidden_size):
            edges.append((input_nodes[i], hidden_nodes[j], w1[i, j]))
    for j in range(hidden_size):
        for k in range(output_size):
            edges.append((hidden_nodes[j], output_nodes[k], w2[j, k]))

    # --- Node and Edge Traces (with hover text) ---
    node_x = [x for x, y in all_nodes]
    node_y = [y for x, y in all_nodes]
    node_labels = [f'Input {i}' for i in range(input_size)] + \
                  [f'Hidden {i}' for i in range(hidden_size)] + \
                  [f'Output {i}' for i in range(output_size)]
    node_values = [0.0] * len(all_nodes)  # Initial values

    node_trace = go.Scatter(
        x=node_x,
        y=node_y,
        mode='markers+text',
        text=node_labels,
        textposition='bottom center',
        marker=dict(size=25, color='lightblue'),
        hoverinfo='text+name',
        name='Nodes',
        customdata=node_values,
        hovertemplate="<b>%{text}</b><br>Value: %{customdata:.2f}<extra></extra>"
    )

    edge_traces = []
    for (node1, node2, weight) in edges:
        edge_trace = go.Scatter(
            x=[node1[0], node2[0]],
            y=[node1[1], node2[1]],
            mode='lines',
            line=dict(width=abs(weight) * 1, color='gray'),  # Reduced scaling
            hoverinfo='text',
            hovertext=f"Weight: {weight:.3f}",
            showlegend=False,
            name="Edges"
        )
        edge_traces.append(edge_trace)

    # --- Initial Annotation (Epoch 0, Top-Left Corner) ---
    initial_annotation = go.layout.Annotation(
        x=0,  # Top-left corner
        y=max(input_size, hidden_size, output_size) - 0.5,  # Top, adjusted for node size
        xref="x",
        yref="y",
        text="Epoch: 0",
        showarrow=False,
        font=dict(size=16, color="black"),
        align="left" # Left-align the text
    )

    # --- Forward and Backward Pass (with batch training and momentum) ---
    frames = []
    for epoch in range(epochs):
        # --- BATCH TRAINING ---
        for _ in range(batch_size):
            input_data = np.random.randn(1, input_size)
            target_output = np.array([[1 if i == 0 else 0 for i in range(output_size)]])

            # --- Forward Pass ---
            hidden_layer_input = np.dot(input_data, w1) + b1
            hidden_layer_output = tanh(hidden_layer_input)
            output_layer_input = np.dot(hidden_layer_output, w2) + b2
            predicted_output = tanh(output_layer_input)

            # --- Backward Pass ---
            output_error = target_output - predicted_output
            output_delta = output_error * tanh_derivative(output_layer_input)
            hidden_error = output_delta.dot(w2.T)
            hidden_delta = hidden_error * tanh_derivative(hidden_layer_input)

            # --- Weight Updates (with momentum) ---
            v2 = momentum * v2 + learning_rate * hidden_layer_output.T.dot(output_delta)
            vb2 = momentum * vb2 + learning_rate * np.sum(output_delta, axis=0, keepdims=True)
            v1 = momentum * v1 + learning_rate * input_data.T.dot(hidden_delta)
            vb1 = momentum * vb1 + learning_rate * np.sum(hidden_delta, axis=0, keepdims=True)

            w2 += v2
            b2 += vb2
            w1 += v1
            b1 += vb1
        # --- (Frame Creation - with step-by-step option) ---
        steps = []
        if show_steps:
            #0 Initialize input values
            for i in range (input_size):
                node_values[i] = input_data[0,i]
            steps.append(
              ("Input",
              w1.copy(), b1.copy(), w2.copy(), b2.copy(),
              f"Initialize Input: {input_data.flatten()}", node_values.copy())
            )
        if show_steps:
            # 1. Forward Pass: Input to Hidden
            steps.append(
                ("Forward (Input to Hidden)",
                 w1.copy(), b1.copy(), w2.copy(), b2.copy(),
                 f"Hidden Input (pre-activation): {hidden_layer_input.flatten()}", node_values.copy())
            )
        if show_steps:
            # 2. Forward Pass: Hidden Activation

            #Update hidden node values
            for i in range (hidden_size):
                node_values[input_size + i] = hidden_layer_output[0,i]
            steps.append(
              ("Forward (Hidden Activation)",
                w1.copy(), b1.copy(), w2.copy(), b2.copy(),
                f"Hidden Output (post-activation): {hidden_layer_output.flatten()}", node_values.copy())
            )
        if show_steps:
            # 3. Forward Pass: Hidden to Output
            steps.append(
                ("Forward (Hidden to Output)",
                 w1.copy(), b1.copy(), w2.copy(), b2.copy(),
                 f"Output Input (pre-activation): {output_layer_input.flatten()}", node_values.copy())
            )
        if show_steps:
            # 4. Forward Pass: Output Activation

            #Update output node values
            for i in range (output_size):
                node_values[input_size + hidden_size + i] = predicted_output[0,i]
            steps.append(
              ("Forward (Output Activation)",
                w1.copy(), b1.copy(), w2.copy(), b2.copy(),
                f"Predicted Output: {predicted_output.flatten()}", node_values.copy())
            )
            #Reset node values for next step
            node_values = [0.0] * len(all_nodes)
            for i in range (input_size):
                node_values[i] = input_data[0,i]
        if show_steps:
            # 5. Backward Pass: Output Error
            steps.append(
                ("Backward (Output Error)",
                 w1.copy(), b1.copy(), w2.copy(), b2.copy(),
                 f"Output Error: {output_error.flatten()}", node_values.copy())
            )
        if show_steps:
            # 6. Backward Pass: Hidden Error
            steps.append(
                ("Backward (Hidden Error)",
                 w1.copy(), b1.copy(), w2.copy(), b2.copy(),
                 f"Hidden Error: {hidden_error.flatten()}", node_values.copy())
            )
        if show_steps:
            # 7. Weight Updates
            steps.append(
                ("Weight Update",
                 w1.copy(), b1.copy(), w2.copy(), b2.copy(),
                 "Weights and biases updated", node_values.copy())
            )
        if not show_steps:
          steps.append(("Full Epoch", w1.copy(), b1.copy(), w2.copy(), b2.copy(), "Weights and biases updated", node_values.copy()))


        for step_name, w1_step, b1_step, w2_step, b2_step, step_description, node_vals in steps:
            frame_edge_traces = []
            edge_index = 0
            for i in range(input_size):
                for j in range(hidden_size):
                    updated_weight = w1_step[i, j]
                    color = 'green' if updated_weight > 0 else 'red'
                    frame_edge_traces.append(go.Scatter(x=[input_nodes[i][0], hidden_nodes[j][0]],
                                                      y=[input_nodes[i][1], hidden_nodes[j][1]],
                                                      mode='lines',
                                                      line=dict(width=abs(updated_weight) * 1, color=color),
                                                      hoverinfo='text',
                                                      hovertext=f"Weight: {updated_weight:.3f}",
                                                      showlegend=False,
                                                      name = "Edges"
                                                      ))
                    edge_index += 1

            for j in range(hidden_size):
                for k in range(output_size):
                    updated_weight = w2_step[j, k]
                    color = 'green' if updated_weight > 0 else 'red'
                    frame_edge_traces.append(go.Scatter(x=[hidden_nodes[j][0], output_nodes[k][0]],
                                                      y=[hidden_nodes[j][1], output_nodes[k][1]],
                                                      mode='lines',
                                                      line=dict(width=abs(updated_weight) * 1, color=color),
                                                      hoverinfo='text',
                                                      hovertext=f"Weight: {updated_weight:.3f}",
                                                      showlegend=False,
                                                      name="Edges"
                                                      ))
                    edge_index+=1
            frame_node_trace = go.Scatter(
                x=node_x,
                y=node_y,
                mode='markers+text',
                text=node_labels,
                textposition='bottom center',
                marker=dict(size=25, color='lightblue'),
                hoverinfo='text+name',
                name = 'Nodes',
                customdata=node_vals,
                hovertemplate="<b>%{text}</b><br>Value: %{customdata:.2f}<extra></extra>"
            )
            # --- Epoch Counter (Top-Left Corner) ---
            epoch_annotation = go.layout.Annotation(
                x=0,  # Top-left corner
                y=max(input_size, hidden_size, output_size)- 0.5,  # Top, adjusted
                xref="x",
                yref="y",
                text=f"Epoch: {epoch + 1}",
                showarrow=False,
                font=dict(size=16, color="black"),
                align="left"  # Left-align
            )

            # Step description (only if show_steps is True)
            if show_steps:
                step_annotation = go.layout.Annotation(
                    x=0, # Top-left corner
                    y=max(input_size, hidden_size, output_size)-1,  # Top, adjusted
                    xref="x",
                    yref="y",
                    text=f"Step: {step_name}",
                    showarrow=False,
                    font=dict(size=14, color="black"),
                    align="left" # Left align
                )
                description_annotation = go.layout.Annotation(
                    x=0,
                    y= -1,
                    xref = "x",
                    yref = "y",
                    text=step_description,
                    showarrow = False,
                    font=dict(size=12, color = "black"),
                    align="left" # Left align
                )
                # Add *both* annotations to the frame's layout
                frames.append(go.Frame(data=frame_edge_traces + [frame_node_trace],
                                        layout=go.Layout(annotations=[epoch_annotation, step_annotation, description_annotation]),
                                        name=f'Epoch {epoch + 1}, Step {step_name}'))
            else:
                # Only add the epoch annotation if not showing steps
                frames.append(go.Frame(data=frame_edge_traces + [frame_node_trace],
                                        layout=go.Layout(annotations=[epoch_annotation]),
                                        name=f'Epoch {epoch + 1}'))


    # --- Figure Creation ---
    fig = go.Figure(
        data=edge_traces + [node_trace],
        layout=go.Layout(
            title="Backpropagation Visualization",
            titlefont_size=16,
            showlegend=False,
            hovermode='closest',
            margin=dict(b=20, l=5, r=5, t=40),
            xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
            yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
            updatemenus=[dict(
                type="buttons",
                buttons=[dict(label="Play",
                              method="animate",
                              args=[None, {"frame": {"duration": 500, "redraw": True},
                                            "fromcurrent": True,
                                            "transition": {"duration": 300, "easing": "quadratic-in-out"}}]),
                         dict(label="Pause",
                              method="animate",
                              args=[[None], {"frame": {"duration": 0, "redraw": False},
                                              "mode": "immediate",
                                              "transition": {"duration": 0}}])])],
            annotations=[initial_annotation]  # Initial epoch
        ),
        frames=frames
    )

    return fig

# --- Example Usage ---
input_size = 2
hidden_size = 3
output_size = 2

# Slower, more progressive learning
fig = backpropagation_visualization(input_size, hidden_size, output_size, learning_rate=0.01, epochs=50, batch_size=10, momentum=0.9, show_steps=False)
fig.show()

# With step-by-step visualization
# fig_steps = backpropagation_visualization(input_size, hidden_size, output_size, learning_rate=0.05, epochs=10, batch_size=5, momentum=0.5, show_steps=True)
# fig_steps.show()

### Reading the backpropagation visualization

**Line Thickness:**

*   **Represents:** The *magnitude* (absolute value) of the connection's weight.  Think of it as the "strength" of the connection between two neurons.
*   **Thicker Line:**  Indicates a *larger* weight (either positive or negative).  A larger weight means that the signal passing through that connection has a *stronger* influence on the receiving neuron.  The connected neurons have a large effect on each other.
*   **Thinner Line:** Indicates a *smaller* weight.  A smaller weight means the signal has a *weaker* influence. The connected neurons have less of an effect on each other.
*   **Analogy:** Think of it like the volume of a speaker. A thicker line is like a louder connection; the signal is amplified. A thinner line is like a quieter connection; the signal is dampened.

**Line Color:**

*   **Represents:** The *sign* of the connection's weight (positive or negative). This tells us whether the connection is *excitatory* or *inhibitory*.
*   **Green Line:** Indicates a *positive* weight.  A positive weight means that if the sending neuron's activation is high, it will tend to *increase* the receiving neuron's activation. This is an *excitatory* connection – it encourages the receiving neuron to "fire."
*   **Red Line:** Indicates a *negative* weight.  A negative weight means that if the sending neuron's activation is high, it will tend to *decrease* the receiving neuron's activation. This is an *inhibitory* connection – it discourages the receiving neuron from "firing."
*   **Analogy:**
    *   **Green (Positive):**  Think of it like a "GO" signal or a "YES" vote. The sending neuron is encouraging the receiving neuron.
    *   **Red (Negative):** Think of it like a "STOP" signal or a "NO" vote. The sending neuron is discouraging the receiving neuron.

**Putting it Together: Color and Thickness in Backpropagation**

During backpropagation, the color and thickness of the lines change to show how the network is learning:

*   **Weight Increasing:** If a connection's weight is becoming *more positive*, the line will become thicker and *greener*. If a connection's weight is becoming *more negative*, the line will become thicker and *redder*. In either of those cases, the connection is becoming *stronger*, but in opposite directions (excitatory vs. inhibitory).
*   **Weight Decreasing:** If a connection's weight is getting *closer to zero* (regardless of whether it started positive or negative), the line will become *thinner*. This means the connection is becoming *weaker* and having less influence.

**Example Scenarios:**

1.  **A line starts thin and gray, then becomes thick and green:** The connection was initially weak (close to zero weight).  During training, the network learned that this connection is important for a *positive* contribution to the correct answer. The weight increased, making the connection stronger and excitatory.

2.  **A line starts thick and red, then becomes thin and gray:** The connection was initially strong and *inhibitory*.  During training, the network learned that this connection was hindering the correct answer. The weight decreased (became less negative, closer to zero), making the connection weaker.

3.  **A line starts thin and gray, then becomes thick and red:** The connection started weak. The network learned this connection is important, but for an *inhibitory* contribution. The weight became more negative.

4.  **A line starts thick and green, then remains thick and green, but gets even thicker and greener:** This connection was already strong and excitatory, and training reinforced its importance, making it even stronger.

In essence, the color and thickness provide a visual representation of the "learning" process. They show how the network is adjusting the strengths and types of connections between its neurons to improve its predictions. The changes are often subtle, especially at the beginning of training.


---
# Section 5: The Complete Training Loop

## Putting It All Together

We've learned how neural networks:
- Convert inputs to numbers (Section 1)
- Measure their mistakes (Section 2)
- Take steps to improve (Section 3)
- Track down which weights to change (Section 4)

Now let's see how these pieces work together in a complete training loop. Think of it like practicing a skill - you try, measure how you did, figure out what to improve, make adjustments, and repeat until you get better.

## The Basic Training Loop

Here's what happens in one complete training iteration:

```
PROCEDURE TrainNeuralNetwork(network, training_examples, learning_rate, number_of_iterations):
  FOR each iteration FROM 1 TO number_of_iterations:
    FOR each (input, correct_answer) pair IN training_examples:

      // 1. Show and Guess (Forward Pass)
      Pass the input through the network's layers.
      Each layer transforms the information.
      prediction = the network's final guess.

      // 2. Check the Answer (Calculate Loss)
      Compare the prediction to the correct answer.
      error_amount = how wrong the prediction was.

      // 3. Blame Assignment (Backward Pass)
      Trace the error back through the network.
      Determine how much each connection contributed to the error.

      // 4. Adjust and Improve (Update Weights)
      FOR each connection IN network:
        IF connection helped make a correct prediction:
          Strengthen the connection slightly.
        ELSE IF connection contributed to a wrong prediction:
          Weaken the connection slightly.
        END IF
        // The size of the adjustment is controlled by the learning rate.
      END FOR

        // 5. Return the error
        DISPLAY error_amount

    END FOR
  END FOR
END PROCEDURE
```

In [5]:
import numpy as np  # Import the NumPy library for numerical operations (like working with arrays)
import plotly.graph_objects as go  # Import Plotly for creating interactive graphs
import plotly.express as px  # Import Plotly Express for easier graph creation
import pandas as pd  # Import Pandas for data manipulation (used for creating a DataFrame for plotting)

def sigmoid(x):
    """
    The sigmoid function, also known as the "S" curve.  It squashes any input value
    to a value between 0 and 1.  This is useful for turning network outputs into
    probabilities (though we don't use it on the output layer for this *regression* problem).
    """
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    """
    The derivative of the sigmoid function.  This is needed for the backpropagation
    algorithm, which is how the network learns.  It tells us how much the sigmoid
    function's output changes with respect to its input.
    """
    return x * (1 - x)

def mse_loss(y_true, y_predicted):
    """
    Calculates the Mean Squared Error (MSE) loss.  This is a common way to measure
    how well the network is doing.  It's the average of the squared differences
    between the network's predictions and the actual target values.  Lower MSE is better.
    """
    return np.mean((y_true - y_predicted) ** 2)

def mse_loss_derivative(y_true, y_predicted):
    """
    The derivative of the MSE loss function.  This is also needed for backpropagation.
    It tells us how much the loss changes with respect to the network's predictions.
    """
    return 2 * (y_predicted - y_true) / y_true.size  # y_true.size is the number of samples


def train_network(X, y, hidden_size, epochs, learning_rate):
    """
    Trains a simple neural network with one hidden layer.

    Args:
        X: The input data.  Each row is a sample, and each column is a feature.
        y: The target data (the correct answers we want the network to learn).
        hidden_size: The number of neurons (think of them as processing units) in the hidden layer.
        epochs: The number of times the network will see the entire training dataset.
        learning_rate:  A small number that controls how much the network's weights
                       and biases are adjusted during each training step.  Too large,
                       and the network might overshoot the best values; too small,
                       and training will be very slow.

    Returns:
        A tuple containing the trained weights and biases, and a history of the
        training process (including the loss at each epoch and predictions at
        specific epochs).
    """
    input_size = X.shape[1]  # Number of input features
    output_size = y.shape[1]  # Number of output values (1 for regression)

    # Initialize weights and biases randomly.  Weights are the "strengths" of
    # connections between neurons, and biases are like thresholds for activation.
    # Xavier/Glorot initialization helps to prevent training problems.
    weights_input_hidden = np.random.randn(input_size, hidden_size) * np.sqrt(2 / (input_size + hidden_size))
    biases_hidden = np.zeros((1, hidden_size))  # Start biases at zero
    weights_hidden_output = np.random.randn(hidden_size, output_size) * np.sqrt(2 / (hidden_size + output_size))
    biases_output = np.zeros((1, output_size))

    history = {'loss': [], 'predictions': {}}  # Keep track of the loss and some predictions

    for epoch in range(epochs):  # Loop through the training data multiple times
        # --- Forward Pass (Make a Prediction) ---

        # 1. Input to Hidden Layer:
        hidden_layer_input = np.dot(X, weights_input_hidden) + biases_hidden  # Multiply inputs by weights, add biases
        hidden_layer_output = sigmoid(hidden_layer_input)  # Apply the sigmoid activation function

        # 2. Hidden Layer to Output:
        output_layer_input = np.dot(hidden_layer_output, weights_hidden_output) + biases_output
        predicted_output = output_layer_input  # NO activation function on the output for regression

        # --- Calculate Loss (How Bad Was the Prediction?) ---
        loss = mse_loss(y, predicted_output)  # Calculate the Mean Squared Error
        history['loss'].append(loss)  # Record the loss

        # --- Store Predictions (for later plotting) ---
        if epoch in [5, 100, epochs - 1]:  # Save predictions at a few key epochs
            x_range = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)  # Create a range of x-values for plotting
            #  Calculate the network's predictions for those x-values *at this epoch*:
            hidden_layer_input_range = np.dot(x_range, weights_input_hidden) + biases_hidden
            hidden_layer_output_range = sigmoid(hidden_layer_input_range)
            output_layer_input_range = np.dot(hidden_layer_output_range, weights_hidden_output) + biases_output
            history['predictions'][epoch] = output_layer_input_range.flatten()  # Store the predictions

        # --- Backward Pass (Learn from Mistakes - Backpropagation) ---

        # Calculate how much the output layer's error contributed to the overall loss:
        output_error = mse_loss_derivative(y, predicted_output)  # Derivative of the loss

        # Calculate how much the hidden layer's error contributed to the output layer's error:
        hidden_error = np.dot(output_error, weights_hidden_output.T) * sigmoid_derivative(hidden_layer_output)

        # --- Update Weights and Biases (Gradient Descent) ---

        # Adjust the weights and biases based on the error and the learning rate:
        weights_hidden_output -= learning_rate * np.dot(hidden_layer_output.T, output_error)
        biases_output -= learning_rate * np.sum(output_error, axis=0, keepdims=True)
        weights_input_hidden -= learning_rate * np.dot(X.T, hidden_error)
        biases_hidden -= learning_rate * np.sum(hidden_error, axis=0, keepdims=True)

        # Print the loss every 100 epochs to monitor training progress
        if (epoch + 1) % 1000 == 0 or (epoch + 1) in [1, 100]:
            print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss:.4f}')

    return weights_input_hidden, biases_hidden, weights_hidden_output, biases_output, history


# --- Generate Synthetic Data (Make Up Some Data for Demonstration) ---
np.random.seed(42)  # Set a "seed" for reproducibility (same random numbers each time)
X = np.random.rand(100, 1) * 10  # 100 data points, each with 1 feature (value between 0 and 10)
true_slope = 2  # The real slope we want the network to learn
true_intercept = 1  # The real intercept we want the network to learn
noise = np.random.randn(100, 1) * 2  # Add some random noise to make it more realistic
y = true_slope * X + true_intercept + noise  # Create the target values (y = mx + b + noise)

# --- Standardize Input Data (Make the Input Data Have Zero Mean and Unit Variance) ---
# This often helps the network learn more easily.
X = (X - X.mean()) / X.std()

# --- Hyperparameters (Settings for the Training Process) ---
learning_rate = 0.01  # Set the learning rate
epochs = 5000  # Set the number of training epochs
hidden_size = 4  # Set the number of neurons in the hidden layer

# --- Train the Network ---
w_ih, b_h, w_ho, b_o, history = train_network(X, y, hidden_size, epochs, learning_rate)

# --- Plotting the Loss Curve (Show How the Error Decreases Over Time) ---
fig_loss = go.Figure(data=[go.Scatter(y=history['loss'], mode='lines')])
fig_loss.update_layout(title='Training Loss Curve', xaxis_title='Epoch', yaxis_title='Mean Squared Error Loss')
fig_loss.show()

# --- Plotting the Predictions and Data (Show How Well the Network Fits the Data) ---
x_range = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)  # Create a range of x-values for plotting
df = pd.DataFrame({'X': X.flatten(), 'y': y.flatten()})  # Create a Pandas DataFrame for Plotly Express
fig_scatter = px.scatter(df, x='X', y='y', title='Data Points and Network Prediction', labels={'y': 'True Y', 'X': 'X'})

# Add predictions from different epochs to the plot
for epoch, predictions in history['predictions'].items():
    fig_scatter.add_trace(go.Scatter(x=x_range.flatten(), y=predictions, mode='lines', name=f'Epoch {epoch+1}'))

fig_scatter.show()  # Display the plot

print(f"Final Loss: {history['loss'][-1]:.4f}")  # Print the final loss value

Epoch [1/5000], Loss: 148.6393
Epoch [100/5000], Loss: 6.1764
Epoch [1000/5000], Loss: 3.3699
Epoch [2000/5000], Loss: 3.2312
Epoch [3000/5000], Loss: 3.1658
Epoch [4000/5000], Loss: 3.1241
Epoch [5000/5000], Loss: 3.0965


Final Loss: 3.0965


## [Optional Deep Dive] Training Optimization



### 1. Batch Processing
Instead of training on one example at a time, we use batches:

```python
class BatchTrainer:
    def __init__(self, model, batch_size=32):
        self.model = model
        self.batch_size = batch_size
    
    def train_epoch(self, data, targets):
        """Train on entire dataset in batches"""
        num_batches = len(data) // self.batch_size
        total_loss = 0
        
        for i in range(num_batches):
            # Get batch
            batch_start = i * self.batch_size
            batch_end = batch_start + self.batch_size
            batch_data = data[batch_start:batch_end]
            batch_targets = targets[batch_start:batch_end]
            
            # Train on batch
            loss = self.model.train_one_iteration(
                batch_data,
                batch_targets
            )
            total_loss += loss
        
        return total_loss / num_batches
```

### 2. Learning Rate Scheduling

Adjust learning rate during training:

```python
class LRScheduler:
    def __init__(self, initial_lr=0.01):
        self.initial_lr = initial_lr
        
    def step_decay(self, epoch):
        """Reduce learning rate at fixed points"""
        drop_rate = 0.5
        epochs_drop = 10.0
        return self.initial_lr * np.power(
            drop_rate,
            np.floor((1 + epoch) / epochs_drop)
        )
    
    def exponential_decay(self, epoch):
        """Exponentially decay learning rate"""
        decay_rate = 0.95
        return self.initial_lr * np.power(decay_rate, epoch)
```

### 3. Early Stopping

Prevent overfitting by stopping at the right time:

```python
class EarlyStopping:
    def __init__(self, patience=5, min_delta=0.001):
        self.patience = patience
        self.min_delta = min_delta
        self.best_loss = None
        self.counter = 0
        
    def should_stop(self, current_loss):
        """Determine if training should stop"""
        if self.best_loss is None:
            self.best_loss = current_loss
            return False
        
        if current_loss < self.best_loss - self.min_delta:
            # Loss is improving
            self.best_loss = current_loss
            self.counter = 0
            return False
        
        self.counter += 1
        return self.counter >= self.patience
```

### 4. Complete Training Pipeline

Putting it all together with best practices:

```python
class TrainingPipeline:
    def __init__(self, model, num_epochs=100, batch_size=32):
        self.model = model
        self.num_epochs = num_epochs
        self.batch_size = batch_size
        self.monitor = TrainingMonitor()
        self.scheduler = LRScheduler()
        self.early_stopping = EarlyStopping()
    
    def train(self, train_data, train_targets, val_data, val_targets):
        """Complete training process with monitoring"""
        for epoch in range(self.num_epochs):
            # Train one epoch
            train_loss = self.train_epoch(train_data, train_targets)
            
            # Validate
            val_loss = self.validate(val_data, val_targets)
            
            # Update learning rate
            current_lr = self.scheduler.step_decay(epoch)
            
            # Monitor progress
            self.monitor.track_progress(
                epoch, train_loss, val_loss, current_lr
            )
            
            # Check for early stopping
            if self.early_stopping.should_stop(val_loss):
                print(f"Training stopped at epoch {epoch}")
                break
    
    def train_epoch(self, data, targets):
        """Train on entire dataset"""
        # Implementation details...
        pass
    
    def validate(self, data, targets):
        """Validate on holdout data"""
        # Implementation details...
        pass
```