<div style="text-align: center;">

# **Spring 2026 &mdash; CIS 3813<br>Advanced Data Science<br>(Introduction to Machine Learning)**
### Week 2: How Models Learn

</div>

**Date:** 02 February 2026
**Time:** 6:00–9:00 PM  
**Instructor:** Dr. Patrick T. Marsh  
**Course Verse:** "He has shown you, O mortal, what is good. And what does the Lord require of you? To act justly and to love mercy and to walk humbly with your God."  &mdash; *Micah 6:8 (NIV)*

---
## **Week 2 Learning Objectives**

By the end of this lecture, you will be able to:

1. **Explain** the concept of a loss function and why it matters for model training
2. **Describe** how gradient descent uses derivatives to find optimal parameters
3. **Visualize** the gradient descent process in 1D and 2D
4. **Implement** a basic gradient descent algorithm from scratch
5. **Identify** the role of the learning rate in convergence
---


## **Today's Outline**
- Lecture  
    1. Review of Last Week  
    2. The Learning Problem  
    3. Gradient Descent  
    4. "Animated" Gradient Descent
    5. Connection to Scikit-Learn
    6. Reflection Question
- Break (10-15 Minutes)
- Lab (or Homework)
- Review
    1. Key Concepts Summary
    2. Preview: Next Week


---

## **Opening Reflection**

> *"I press on toward the goal for the prize of the upward call of God in Christ Jesus."*  
> — **Philippians 3:14 (ESV)**

Today we learn how machine learning models "press on toward the goal" of finding optimal solutions. Just as Paul describes his spiritual journey as a continuous pursuit—not arriving instantly but pressing forward step by step—gradient descent algorithms take iterative steps toward their objective. Each step brings the model closer to the goal, even when the path isn't immediately clear.

In our Christian walk, we don't achieve perfection instantaneously; we grow through persistent effort, guided by wisdom. Similarly, our models learn through persistent iteration, guided by mathematics. This week, we'll see how this process works under the hood.


---

## **1.1 Review of Last Week**

### **1.1.1 What is a Function?**

A **function** maps inputs to outputs. In machine learning:
- **Input**: Features (data) + Parameters (weights)
- **Output**: Predictions or loss values

For example, a simple linear model:

$$\hat{y} = mx + b$$

Where:
- $x$ is our input feature
- $m$ is the slope (weight)
- $b$ is the intercept (bias)
- $\hat{y}$ is our prediction

In [None]:
# Let's start with our imports
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D


In [None]:
# Example: A simple linear function
def linear_function(x, m=2, b=1):
    """A simple linear function: y = mx + b"""
    return m * x + b

# Let's visualize it
x = np.linspace(-5, 5, 100)
y = linear_function(x)

plt.figure(figsize=(10, 6))
plt.plot(x, y, 'b-', linewidth=2, label='$y = 2x + 1$')
plt.axhline(y=0, color='k', linewidth=0.5)
plt.axvline(x=0, color='k', linewidth=0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.title('A Simple Linear Function')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### **1.1.2 What is Slope? (The Derivative)**

The **slope** tells us how steep a function is at any point, and **in which direction** it's increasing.

For a linear function $y = mx + b$:
- The slope is constant: $\frac{dy}{dx} = m$

For non-linear functions, the slope changes at every point. The **derivative** gives us the instantaneous rate of change.

#### **Key Insight for Machine Learning:**
- **Positive slope**: Function is increasing → move left to decrease
- **Negative slope**: Function is decreasing → move right to decrease
- **Zero slope**: We're at a critical point! (Potentially a minimum or maximum or a saddle point <-- Thanks Travis!)!

In [None]:
# Let's visualize slope with a quadratic function
def quadratic(x):
    """A simple quadratic: f(x) = x^2"""
    return x ** 2

def quadratic_derivative(x):
    """Derivative of x^2 is 2x"""
    return 2 * x

x = np.linspace(-3, 3, 100)
y = quadratic(x)

# Plot the function
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: the function with tangent lines
axes[0].plot(x, y, 'b-', linewidth=2, label='$f(x) = x^2$')

# Add tangent lines at a few points
points = [-2, 0, 2]
colors = ['red', 'green', 'purple']
for pt, color in zip(points, colors):
    slope = quadratic_derivative(pt)
    y_pt = quadratic(pt)
    # Tangent line: y - y_pt = slope * (x - pt)
    tangent_x = np.linspace(pt - 1, pt + 1, 50)
    tangent_y = slope * (tangent_x - pt) + y_pt
    axes[0].plot(tangent_x, tangent_y, color=color, linestyle='--',
                 label=f'Tangent at x={pt}, slope={slope}')
    axes[0].scatter([pt], [y_pt], color=color, s=100, zorder=5)

axes[0].set_xlabel('x')
axes[0].set_ylabel('f(x)')
axes[0].set_title('Quadratic Function with Tangent Lines')
axes[0].legend()
axes[0].set_ylim(-2, 10)

# Right plot: the derivative
axes[1].plot(x, quadratic_derivative(x), 'r-', linewidth=2, label="$f'(x) = 2x$")
axes[1].axhline(y=0, color='k', linewidth=0.5)
axes[1].axvline(x=0, color='k', linewidth=0.5)
for pt, color in zip(points, colors):
    axes[1].scatter([pt], [quadratic_derivative(pt)], color=color, s=100, zorder=5)
axes[1].set_xlabel('x')
axes[1].set_ylabel("f'(x)")
axes[1].set_title('The Derivative (Slope at Each Point)')
axes[1].legend()

plt.tight_layout()
plt.show()

### Discussion Question

Look at the right plot above. At what x-value is the slope zero? What does this correspond to on the left plot?

---
## **1.2 The Learning Problem**

### **1.2.1 What Does "Learning" Mean?**

In machine learning, "learning" means finding the **best parameters** for our model.

**Example**: For a linear model $\hat{y} = mx + b$:
- We have data: $(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)$
- We want to find the values of $m$ and $b$ that make our predictions $\hat{y}$ as close to the actual $y$ values as possible

But how do we measure "close"?

In [None]:
# Let's create some sample data
np.random.seed(42)  # For reproducibility

# True relationship: y = 3x + 2 + noise
true_m = 3
true_b = 2
n_samples = 50

X = np.random.uniform(-5, 5, n_samples)
y = true_m * X + true_b + np.random.normal(0, 2, n_samples)

# Visualize the data
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.7, label='Data points')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Our Dataset (True relationship: y = 3x + 2 + noise)')
plt.legend()
plt.show()

### **1.2.2 The Loss Function (Cost Function)**

A **loss function** measures how wrong our predictions are. The most common for regression is **Mean Squared Error (MSE)**:

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - (mx_i + b))^2$$

#### **Why squared error?**
1. **Penalizes large errors more** than small errors
2. **Differentiable** everywhere (smooth curve)
3. **Always positive** (no cancellation of positive and negative errors)

In [None]:
def compute_mse(m, b, X, y):
    """Compute Mean Squared Error for given parameters."""
    predictions = m * X + b
    errors = y - predictions
    mse = np.mean(errors ** 2)
    return mse

# Let's see how MSE changes with different parameter choices
params = [(true_m, true_b, 'True Parameters'), (1, 0, 'Poor Fit #1'),
          (5, 5, 'Poor Fit #2'), (0, 0, 'Terrible Fit')]

print(f"True parameters: m={params[0][0]}, b={params[0][1]}")
print(f"MSE with true parameters: {compute_mse(params[0][0], params[0][1], X, y):.4f}")
print(f"MSE with m=1, b=0: {compute_mse(params[1][0], params[1][1], X, y):.4f}")
print(f"MSE with m=5, b=5: {compute_mse(params[2][0], params[2][1], X, y):.4f}")
print(f"MSE with m=0, b=0: {compute_mse(params[3][0], params[3][1], X, y):.4f}")

In [None]:
# Visualize different model fits
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
x_line = np.linspace(-5, 5, 100)

for ax, (m, b, title) in zip(axes.flat, params):
    ax.scatter(X, y, alpha=0.5, label='Data')
    ax.plot(x_line, m * x_line + b, 'r-', linewidth=2, label=f'y = {m}x + {b}')
    mse = compute_mse(m, b, X, y)
    ax.set_title(f'{title}\nMSE = {mse:.2f}')
    ax.set_xlabel('X')
    ax.set_ylabel('y')
    ax.legend()
    ax.set_xlim(-6, 6)
    ax.set_ylim(-20, 25)

plt.tight_layout()
plt.show()

### **1.2.3 The Loss Landscape**

For a model with parameters $m$ and $b$, the loss function creates a **surface** over the parameter space.

Our goal: Find the lowest point on this surface!

In [None]:
# Create a grid of m and b values
m_range = np.linspace(-2, 8, 100)
b_range = np.linspace(-5, 10, 100)
M, B = np.meshgrid(m_range, b_range)

# Compute MSE for each combination
Z = np.zeros_like(M)
for i in range(M.shape[0]):
    for j in range(M.shape[1]):
        Z[i, j] = compute_mse(M[i, j], B[i, j], X, y)

# Create 3D visualization
fig = plt.figure(figsize=(15, 5))

# 3D Surface
ax1 = fig.add_subplot(121, projection='3d')
surf = ax1.plot_surface(M, B, Z, cmap=cm.viridis, alpha=0.8)
ax1.scatter([true_m], [true_b], [compute_mse(true_m, true_b, X, y)],
            color='red', s=200, label='True minimum')
ax1.set_xlabel('m (slope)')
ax1.set_ylabel('b (intercept)')
ax1.set_zlabel('MSE', rotation=90)
ax1.set_title('Loss Landscape (3D View)')
# ax1.view_init(elev=0, azim=-30, roll=0)

# Contour plot
ax2 = fig.add_subplot(122)
contour = ax2.contour(M, B, Z, levels=30, cmap=cm.viridis)
ax2.scatter([true_m], [true_b], color='red', s=200, marker='*',
            label='True minimum', zorder=5)
ax2.set_xlabel('m (slope)')
ax2.set_ylabel('b (intercept)')
ax2.set_title('Loss Landscape (Contour View)')
ax2.legend()
plt.colorbar(contour, ax=ax2, label='MSE')

plt.tight_layout()
plt.show()

#### **Key Observation**

The loss surface looks like a **bowl**! There's a clear minimum at the bottom. Our goal is to find this minimum.

But how do we find it without trying every possible combination of parameters?

---

## **1.3 Gradient Descent**

### **1.3.1 The Core Idea**

Imagine you're blindfolded on a hilly landscape, trying to find the lowest point.

**Strategy**: Feel the slope beneath your feet. Take a step in the downhill direction. Repeat.

This is exactly what **gradient descent** does!

#### **The Algorithm**

1. Start with random parameter values
2. Compute the gradient (slope) of the loss function
3. Take a step in the **opposite** direction of the gradient (downhill)
4. Repeat until convergence

#### **The Math**

$$\theta_{\text{new}} = \theta_{\text{old}} - \alpha \cdot \nabla L(\theta)$$

Where:
- $\theta$ = parameters (m, b in our case)
- $\alpha$ = learning rate (step size)
- $\nabla L(\theta)$ = gradient of the loss function

### **1.3.2 Let's Start Simple: 1D Gradient Descent**

First, let's understand gradient descent with a simple function: $f(x) = x^2$

We want to find the value of $x$ that minimizes this function.

In [None]:
def f(x):
    """Our simple function to minimize: f(x) = x^2"""
    return x ** 2

def df(x):
    """Derivative: f'(x) = 2x"""
    return 2 * x

def gradient_descent_1d(start, learning_rate, n_iterations):
    """Perform gradient descent on f(x) = x^2"""
    x = start
    history = [(x, f(x))]  # Track our path

    for i in range(n_iterations):
        gradient = df(x)           # Compute the gradient
        x = x - learning_rate * gradient  # Take a step
        history.append((x, f(x)))

    return x, history

In [None]:
# Run gradient descent
start_point = 4.0
learning_rate = 0.1
n_iterations = 200

final_x, history = gradient_descent_1d(start_point, learning_rate, n_iterations)

print(f"Starting point: x = {start_point}")
print(f"Final point after {n_iterations} iterations: x = {final_x:.6f}")
print(f"Final f(x): {f(final_x):.10f}")
print(f"\nTrue minimum: x = 0, f(x) = 0")

In [None]:
# Visualize the gradient descent process
x_plot = np.linspace(-5, 5, 100)
y_plot = f(x_plot)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Show the path on the function
axes[0].plot(x_plot, y_plot, 'b-', linewidth=2, label='$f(x) = x^2$')
x_hist = [h[0] for h in history]
y_hist = [h[1] for h in history]
axes[0].scatter(x_hist, y_hist, c=range(len(history)), cmap='Reds', s=100, zorder=5)
axes[0].plot(x_hist, y_hist, 'r--', alpha=0.5)
axes[0].scatter([start_point], [f(start_point)], color='green', s=200,
                marker='o', label='Start', zorder=6)
axes[0].scatter([final_x], [f(final_x)], color='red', s=200,
                marker='*', label='End', zorder=6)
axes[0].set_xlabel('x')
axes[0].set_ylabel('f(x)')
axes[0].set_title('Gradient Descent on $f(x) = x^2$')
axes[0].legend()

# Right: Show convergence over iterations
axes[1].plot(range(len(history)), y_hist, 'b-o', linewidth=2, markersize=5)
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('f(x)')
axes[1].set_title('Loss Over Iterations')
axes[1].set_yscale('log')  # Log scale to see small values

plt.tight_layout()
plt.show()

## **1.3.3 The Importance of Learning Rate**

The **learning rate** ($\alpha$) controls how big our steps are.

- **Too small**: Takes forever to converge
- **Too large**: May overshoot and never converge
- **Just right**: Fast convergence to the minimum

In [None]:
# Compare different learning rates
learning_rates = [0.01, 0.1, 0.5, 0.95]
start = 4.0
n_iterations = 20

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

x_plot = np.linspace(-5, 5, 100)

for ax, lr in zip(axes.flat, learning_rates):
    final_x, history = gradient_descent_1d(start, lr, n_iterations)
    x_hist = [h[0] for h in history]
    y_hist = [h[1] for h in history]

    ax.plot(x_plot, f(x_plot), 'b-', linewidth=2, label='$f(x) = x^2$')
    ax.scatter(x_hist, y_hist, c=range(len(history)), cmap='Reds', s=80, zorder=5)
    ax.plot(x_hist, y_hist, 'r--', alpha=0.5)
    ax.scatter([start], [f(start)], color='green', s=150, marker='o', zorder=6)
    ax.scatter([final_x], [f(final_x)], color='red', s=150, marker='*', zorder=6)
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.set_title(f'Learning Rate = {lr}\nFinal x = {final_x:.4f}')
    ax.set_xlim(-5, 5)
    ax.set_ylim(-1, 20)

plt.tight_layout()
plt.show()

#### **Discussion Questions**

1. What happens with a learning rate of 0.01? Is it converging?
2. What happens with a learning rate of 0.95? Do you see oscillation?
3. Which learning rate seems "just right" for this problem?

### **1.3.4 Gradient Descent for Linear Regression**

Now let's apply this to our actual problem: finding the best $m$ and $b$ for linear regression.

#### **The Gradients**

For MSE loss: $L = \frac{1}{n} \sum_{i=1}^{n} (y_i - (mx_i + b))^2$

The partial derivatives are:

$$\frac{\partial L}{\partial m} = \frac{-2}{n} \sum_{i=1}^{n} x_i(y_i - (mx_i + b))$$

$$\frac{\partial L}{\partial b} = \frac{-2}{n} \sum_{i=1}^{n} (y_i - (mx_i + b))$$

In [None]:
def compute_gradients(m, b, X, y):
    """Compute the gradients of MSE with respect to m and b."""
    n = len(X)
    predictions = m * X + b
    errors = y - predictions

    # Partial derivatives
    dm = (-2/n) * np.sum(X * errors)
    db = (-2/n) * np.sum(errors)

    return dm, db

def gradient_descent_linear(X, y, learning_rate=0.01, n_iterations=1000):
    """Perform gradient descent to find optimal m and b."""
    # Initialize with random values
    m = np.random.randn()
    b = np.random.randn()

    history = []

    for i in range(n_iterations):
        # Compute gradients
        dm, db = compute_gradients(m, b, X, y)

        # Update parameters
        m = m - learning_rate * dm
        b = b - learning_rate * db

        # Track history
        mse = compute_mse(m, b, X, y)
        history.append({'m': m, 'b': b, 'mse': mse})

        # Print progress occasionally
        if (i + 1) % 100 == 0 or i == 0:
            print(f"Iteration {i+1:4d}: m = {m:.4f}, b = {b:.4f}, MSE = {mse:.4f}")

    return m, b, history

In [None]:
# Run gradient descent on our data
np.random.seed(42)
final_m, final_b, history = gradient_descent_linear(X, y, learning_rate=0.01, n_iterations=500)

print(f"\n" + "="*50)
print(f"Final parameters: m = {final_m:.4f}, b = {final_b:.4f}")
print(f"True parameters:  m = {true_m}, b = {true_b}")
print(f"Final MSE: {history[-1]['mse']:.4f}")

In [None]:
# Visualize the learning process
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Left: Final fit
axes[0].scatter(X, y, alpha=0.6, label='Data')
x_line = np.linspace(-5, 5, 100)
axes[0].plot(x_line, final_m * x_line + final_b, 'r-', linewidth=2,
             label=f'Learned: y = {final_m:.2f}x + {final_b:.2f}')
axes[0].plot(x_line, true_m * x_line + true_b, 'g--', linewidth=2,
             label=f'True: y = {true_m}x + {true_b}')
axes[0].set_xlabel('X')
axes[0].set_ylabel('y')
axes[0].set_title('Final Model Fit')
axes[0].legend()

# Middle: Loss over time
mse_history = [h['mse'] for h in history]
axes[1].plot(mse_history, 'b-', linewidth=2)
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('MSE')
axes[1].set_title('Loss Over Training')
axes[1].set_yscale('log')

# Right: Path through parameter space
m_hist = [h['m'] for h in history]
b_hist = [h['b'] for h in history]

# Create contour background
m_range = np.linspace(-1, 6, 100)
b_range = np.linspace(-3, 7, 100)
M, B = np.meshgrid(m_range, b_range)
Z = np.zeros_like(M)
for i in range(M.shape[0]):
    for j in range(M.shape[1]):
        Z[i, j] = compute_mse(M[i, j], B[i, j], X, y)

axes[2].contour(M, B, Z, levels=30, cmap='viridis', alpha=0.5)
axes[2].plot(m_hist, b_hist, 'r-', linewidth=1, alpha=0.7)
axes[2].scatter(m_hist[::50], b_hist[::50], c='red', s=50, zorder=5)
axes[2].scatter([m_hist[0]], [b_hist[0]], color='green', s=200,
                marker='o', label='Start', zorder=6)
axes[2].scatter([m_hist[-1]], [b_hist[-1]], color='red', s=200,
                marker='*', label='End', zorder=6)
axes[2].scatter([true_m], [true_b], color='blue', s=200,
                marker='X', label='True', zorder=5)
axes[2].set_xlabel('m (slope)')
axes[2].set_ylabel('b (intercept)')
axes[2].set_title('Path Through Parameter Space')
axes[2].legend()

plt.tight_layout()
plt.show()

---

## **1.4: "Animated" Gradient Descent**

Let's create a step-by-step visualization to really understand how gradient descent works.

In [None]:
# Show gradient descent step by step
def visualize_gd_steps(X, y, n_steps=10, learning_rate=0.05):
    """Create a multi-panel visualization of gradient descent steps."""

    # Initialize
    np.random.seed(123)
    m, b = 0.5, -2.0  # Start far from optimum

    # Create figure
    rows = np.ceil(n_steps / 5)
    fig, axes = plt.subplots(int(rows), 5, figsize=(18, rows*4))
    x_line = np.linspace(-5, 5, 100)

    for step in range(n_steps):
        if rows == 1:
            ax = axes[step % 5]
        else:
            ax = axes[step // 5, step % 5]

        # Plot data and current fit
        ax.scatter(X, y, alpha=0.4, s=20)
        ax.plot(x_line, m * x_line + b, 'r-', linewidth=2)
        ax.plot(x_line, true_m * x_line + true_b, 'g--', linewidth=1, alpha=0.5)

        # Compute and display info
        mse = compute_mse(m, b, X, y)
        dm, db = compute_gradients(m, b, X, y)

        ax.set_title(f'Step {step}\nm={m:.2f}, b={b:.2f}\nMSE={mse:.2f}')
        ax.set_xlim(-6, 6)
        ax.set_ylim(-20, 25)

        # Update parameters for next step
        m = m - learning_rate * dm
        b = b - learning_rate * db

    plt.suptitle('Gradient Descent: Step by Step\n(Red = Current Fit, Green Dashed = True Line)',
                 fontsize=14, y=1.02)
    plt.tight_layout()
    plt.show()

visualize_gd_steps(X, y, n_steps=10, learning_rate=0.05)

## **1.5 Connection to Scikit-Learn**

When you call `model.fit()` in scikit-learn, this is (roughly) what happens behind the scenes!

Different models use different optimization algorithms, but gradient descent is the foundation.

In [None]:
# Compare our result to scikit-learn
from sklearn.linear_model import LinearRegression, SGDRegressor

# Standard linear regression (uses closed-form solution, not gradient descent)
lr = LinearRegression()
lr.fit(X.reshape(-1, 1), y)
print(f"LinearRegression: m = {lr.coef_[0]:.4f}, b = {lr.intercept_:.4f}")

# SGDRegressor (uses stochastic gradient descent)
sgd = SGDRegressor(max_iter=1000, tol=1e-4, random_state=42)
sgd.fit(X.reshape(-1, 1), y)
print(f"SGDRegressor:     m = {sgd.coef_[0]:.4f}, b = {sgd.intercept_[0]:.4f}")

# Our implementation
print(f"Our GD:           m = {final_m:.4f}, b = {final_b:.4f}")

# True values
print(f"True:             m = {true_m}, b = {true_b}")

## **1.6 Reflection Question**

Today we saw how models learn through iterative improvement—taking small steps toward a goal. 

How does this relate to Philippians 3:14? In what ways is our own learning journey (academic, spiritual, or professional) similar to gradient descent?

---

## **BREAK (10-15 minutes)**

---

## **2.1 Lab Exercises** (new notebook)

---

## **3.1: Key Concepts Summary**

### **3.1.1 What We Learned Today**

1. **Loss Function**: Measures how wrong our model's predictions are (MSE for regression)

2. **Gradient**: The direction and magnitude of steepest ascent
   - Move in the **opposite** direction to descend

3. **Gradient Descent Algorithm**:
   1. Initialize parameters randomly
   2. Compute gradient of loss
   3. Update: params = params - learning_rate × gradient
   4. Repeat until convergence

4. **Learning Rate**: 
   - Too small → slow convergence
   - Too large → oscillation/divergence
   - Just right → efficient convergence

5. **Convergence**: We stop when the loss stops decreasing significantly

## **3.2 Preview: Next Week**

**Week 3: Linear Algebra for Data Science**
- Dot products and why they matter
- From DataFrames to Arrays
- Matrix operations for efficiency
