<div style="text-align: center;">

# **Spring 2026 &mdash; CIS 3813<br>Advanced Data Science<br>(Introduction to Machine Learning)**
### Week 3: Linear Algebra for Data Science

</div>

**Date:** 09 February 2026
**Time:** 6:00–9:00 PM  
**Instructor:** Dr. Patrick T. Marsh  
**Course Verse:** "He has shown you, O mortal, what is good. And what does the Lord require of you? To act justly and to love mercy and to walk humbly with your God."  &mdash; *Micah 6:8 (NIV)*

---
## **Week 3 Learning Objectives**

By the end of this lecture, you will be able to:

1. Distinguish between scalars, vectors, and matrices, and explain how each maps to data science concepts (individual measurements, samples, and datasets).
2. Compute the dot product of two vectors by hand and explain its role as the core operation behind linear model predictions.
3. Apply the matrix multiplication shape rule to predict whether an operation will succeed and determine the shape of the result.
4. Convert data between pandas DataFrames and NumPy arrays, recognizing when and why each representation is appropriate.
5. Use vectorized NumPy operations instead of Python loops to perform numerical computations efficiently.
6. Connect linear algebra operations (matrix-vector multiplication, transpose) to the vectorized gradient descent process introduced in Week 2.

---


## **Today's Outline**
- Lecture  
    1. Review of Last Week
    2. Why Linear Algebra
    3. Scalars, Vectors, and Matrices
    4. Vector Operations: The Foundation
    5. The Dot Product
    6. Conceptual Introduction to Matrix Operations
    7. Why NumPy? Speed Through Vectorization
    8. From DataFrames to Arrays and Back
    9. Connecting Back to Gradient Descent
- Break (10-15 Minutes)
- Lab (or Homework)
- Review
    1. Review & Wrap-Up
    2. Coming Up


---

## **Opening Reflection**

> *"The heavens declare the glory of God; the skies proclaim the work of his hands. Day after day they pour forth speech; night after night they reveal knowledge."*  
> — **Psalm 19:1–2 (NIV)**

Just as the heavens reveal patterns that declare God's glory, linear algebra gives us the language to uncover hidden patterns in data. Today we learn the mathematical foundation that powers nearly every machine learning algorithm — the ability to represent, transform, and compute with structured numerical data. As we explore dot products and matrix operations, remember that our capacity to discover order in apparent chaos reflects the *Imago Dei* — we are pattern-seekers made in the image of a God of order and wisdom.


---

## **1.1 Review of Last Week**

Last week we answered a fundamental question: once we define a model, how does it actually *learn* the right parameters? The answer is gradient descent — an iterative algorithm that adjusts model parameters step by step to minimize a loss function.

Here are the key ideas you should have coming into today:

- **The gradient** is just the slope of the loss function with respect to a parameter. It tells us two things: which *direction* to move (increase or decrease the parameter) and how *steeply* the loss is changing. We always move in the direction that reduces the loss — that is, opposite the sign of the gradient.

- **The update rule** follows a simple pattern: subtract the gradient (scaled by a learning rate) from the current parameter value. If the gradient is positive, the parameter decreases. If the gradient is negative, the parameter increases. Either way, the model's predictions get a little closer to the actual values.

- **The learning rate (α)** controls step size and requires a balancing act. Too large and the model overshoots the minimum, causing the loss to *increase* or oscillate wildly. Too small and the model inches toward the answer so slowly that training becomes impractical. Finding a good learning rate is one of the first practical skills in machine learning.

- **The loss function** (we used Mean Squared Error) gives us a single number that summarizes how wrong the model is across all training examples. Gradient descent's entire job is to make that number smaller, iteration by iteration.

- Last week we did all of this with a single feature and a single weight. Today we introduce the linear algebra that lets us scale this process to *many* features and *many* weights simultaneously — which is how real machine learning actually works.

---

## **1.2 Why Linear Algebra?**

Last week we learned how models learn through gradient descent. We saw how a model adjusts its parameters by following the slope of a loss function. But we worked with just **one feature** — a single input variable.

Real-world data has **many features**: age, income, square footage, temperature, dozens or hundreds of columns. To work with all of these features simultaneously, we need **linear algebra**.

### **Where linear algebra shows up in machine learning:**

- **Data representation**: Every dataset is a matrix (rows = samples, columns = features)
- **Model predictions**: Computing predictions for all samples at once uses matrix multiplication
- **Gradient descent**: Updating multiple weights simultaneously requires vector operations
- **Dimensionality reduction** (PCA — Week 12): Built entirely on matrix decomposition
- **Neural networks** (Week 13): Every layer is a matrix multiplication + activation

**Bottom line:** If you understand vectors, dot products, and the intuition behind matrix operations, you have the mathematical vocabulary for the rest of this course.

---

## **1.3 Scalars, Vectors, and Matrices**

Let's start with the building blocks.

| Term | What It Is | Example | NumPy Shape |
|------|-----------|---------|-------------|
| **Scalar** | A single number | `5`, `3.14`, `-2` | `()` |
| **Vector** | An ordered list of numbers | `[1, 2, 3]` | `(3,)` |
| **Matrix** | A 2D grid of numbers (rows × columns) | A spreadsheet of data | `(m, n)` |

### **How this maps to data science:**
- A single measurement (e.g., one person's age) → **scalar**
- One row of a dataset (one sample with all its features) → **vector**
- An entire dataset (all samples, all features) → **matrix**

In [None]:
import numpy as np
import pandas as pd

# --- Scalar ---
temperature = 72.5
print(f"Scalar: {temperature}")
print(f"  Type: {type(temperature)}")
print()

# --- Vector ---
# A single house: [sqft, bedrooms, age_years]
house_features = np.array([1500, 3, 15])
print(f"Vector: {house_features}")
print(f"  Shape: {house_features.shape}")
print(f"  Dimensions: {house_features.ndim}")
print()

# --- Matrix ---
# 4 houses, each with 3 features: [sqft, bedrooms, age_years]
houses = np.array([
    [1500, 3, 15],
    [2100, 4, 5],
    [900, 2, 30],
    [1800, 3, 10]
])
print(f"Matrix:\n{houses}")
print(f"  Shape: {houses.shape}  →  {houses.shape[0]} samples × {houses.shape[1]} features")
print(f"  Dimensions: {houses.ndim}")

### **Key Insight: Shape Matters!**

In NumPy (and in machine learning), the **shape** of your data tells you everything:

- `(n,)` — a 1D vector with `n` elements
- `(m, n)` — a matrix with `m` rows and `n` columns
- `(m, 1)` — a **column vector** (matrix with 1 column)
- `(1, n)` — a **row vector** (matrix with 1 row)

Many bugs in data science code come from **shape mismatches**. Always check `.shape`!

In [None]:
# Shape matters: 1D vector vs. column vector vs. row vector
v = np.array([1, 2, 3])
col = np.array([[1], [2], [3]])
row = np.array([[1, 2, 3]])

print(f"1D vector:     shape = {v.shape}")
print(f"Column vector: shape = {col.shape}")
print(f"Row vector:    shape = {row.shape}")
print()
print("They contain the same numbers, but behave differently in operations!")

---
## **1.4 Vector Operations: The Foundation**

Before we get to the star of today's lecture (the dot product), let's make sure we understand basic vector operations.

### **Element-wise Operations**

When you add, subtract, or multiply vectors of the same size, NumPy does it **element by element**.

In [None]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

print("Vector a:", a)
print("Vector b:", b)
print()

# Element-wise operations
print("Addition (a + b):      ", a + b)       # [1+4, 2+5, 3+6]
print("Subtraction (a - b):   ", a - b)       # [1-4, 2-5, 3-6]
print("Multiplication (a * b):", a * b)        # [1*4, 2*5, 3*6]
print("Division (a / b):      ", a / b)        # [1/4, 2/5, 3/6]
print()

# Scalar operations (broadcast the scalar to every element)
print("Scalar multiply (a * 3):", a * 3)       # [1*3, 2*3, 3*3]
print("Scalar add (a + 10):   ", a + 10)       # [1+10, 2+10, 3+10]

### **Why element-wise operations matter in data science**

When you **standardize** a feature (subtract the mean, divide by standard deviation), you're doing element-wise vector operations on an entire column at once. This is much faster than looping through each value.

In [None]:
# Example: Standardizing a feature (z-score normalization)
prices = np.array([150000, 210000, 90000, 180000, 320000])

mean_price = prices.mean()
std_price = prices.std()

# This is an element-wise operation on the entire vector!
standardized = (prices - mean_price) / std_price

print(f"Original prices:      {prices}")
print(f"Mean: ${mean_price:,.0f}   Std: ${std_price:,.0f}")
print(f"Standardized prices:  {np.round(standardized, 3)}")
print(f"\nStandardized mean: {standardized.mean():.6f}  (≈ 0)")
print(f"Standardized std:  {standardized.std():.6f}  (≈ 1)")

---

## **1.5 The Dot Product**

The **dot product** is the single most important operation in machine learning. It takes two vectors of the same length and produces a **single number** (a scalar).

### **Formula**

Given two vectors **a** = [a₁, a₂, ..., aₙ] and **b** = [b₁, b₂, ..., bₙ]:

$$\mathbf{a} \cdot \mathbf{b} = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n = \sum_{i=1}^{n} a_i b_i$$

**In words:** Multiply corresponding elements, then sum everything up.

### **Three ways to think about the dot product:**

1. **Algebraic:** Multiply pairs, add them up (the formula above)
2. **Geometric:** Measures how much two vectors point in the same direction
3. **Machine Learning:** It's how a linear model makes a prediction!

In [None]:
# The dot product: three equivalent ways in NumPy
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Method 1: Manual calculation
manual = a[0]*b[0] + a[1]*b[1] + a[2]*b[2]
print(f"Manual:      1×4 + 2×5 + 3×6 = {manual}")

# Method 2: Element-wise multiply then sum
multiply_then_sum = np.sum(a * b)
print(f"Sum of a*b:  {multiply_then_sum}")

# Method 3: np.dot() — the standard way
dot_result = np.dot(a, b)
print(f"np.dot(a,b): {dot_result}")

# Method 4: The @ operator (Python 3.5+) — most common in modern code
at_result = a @ b
print(f"a @ b:       {at_result}")

### **The Dot Product IS a Linear Model Prediction**

Remember from Week 1 how a linear model works? Here's the key insight:

$$\hat{y} = w_1 x_1 + w_2 x_2 + w_3 x_3 + b$$

This is just:

$$\hat{y} = \mathbf{w} \cdot \mathbf{x} + b$$

The dot product of the **weights** vector and the **features** vector, plus a bias term!

Let's see this concretely with our housing example:

In [None]:
# A simple house price model
# Features: [sqft, bedrooms, age_years]
# Weights learned by the model:
weights = np.array([150, 20000, -500])  # $ per sqft, $ per bedroom, $ per year of age
bias = 50000  # base price

# A house to predict: 1500 sqft, 3 bedrooms, 15 years old
house = np.array([1500, 3, 15])

# Prediction using a loop (slow, verbose)
prediction_loop = 0
for i in range(len(weights)):
    prediction_loop += weights[i] * house[i]
prediction_loop += bias

# Prediction using the dot product (fast, clean)
prediction_dot = weights @ house + bias

print("House features: [sqft=1500, beds=3, age=15]")
print(f"\nManual breakdown:")
print(f"  150 × 1500  = ${150*1500:>10,}  (sqft contribution)")
print(f"  20000 × 3   = ${20000*3:>10,}  (bedroom contribution)")
print(f"  -500 × 15   = ${-500*15:>10,}  (age penalty)")
print(f"  bias        = ${bias:>10,}")
print(f"  ─────────────────────────")
print(f"  Total       = ${prediction_dot:>10,}")
print(f"\nUsing dot product: weights @ house + bias = ${prediction_dot:,}")
print(f"Both methods agree: {prediction_loop == prediction_dot}")

In [None]:
# Now predict ALL houses at once using matrix-vector multiplication!
houses = np.array([
    [1500, 3, 15],
    [2100, 4, 5],
    [900, 2, 30],
    [1800, 3, 10]
])

# Matrix @ vector = vector of predictions
all_predictions = houses @ weights + bias

print("Houses matrix (4 houses × 3 features):")
print(houses)
print(f"\nWeights: {weights}")
print(f"Bias:    {bias}")
print(f"\nAll predictions at once: {all_predictions}")
print()
for i, pred in enumerate(all_predictions):
    print(f"  House {i+1}: ${pred:>10,}")

### **What just happened?**

We multiplied a **(4 × 3) matrix** by a **(3,) vector** and got a **(4,) vector** of predictions.

Each row of the matrix was dot-producted with the weights vector:

```
houses (4×3)  @  weights (3,)  =  predictions (4,)

[1500, 3, 15]  ·  [150, 20000, -500]  =  327,500
[2100, 4,  5]  ·  [150, 20000, -500]  =  442,000
[ 900, 2, 30]  ·  [150, 20000, -500]  =  220,000
[1800, 3, 10]  ·  [150, 20000, -500]  =  375,000
```

This is exactly what scikit-learn does under the hood when you call `.predict()` on a `LinearRegression` model!

---

## **1.6 Conceptual Introduction to Matrix Operations**

Now that we understand the dot product, let's zoom out to see the bigger picture of matrix operations.

### **Matrix Multiplication (Conceptual)**

Matrix multiplication is just **many dot products organized into a grid**.

If **A** is (m × n) and **B** is (n × p), then **A @ B** is (m × p).

**The inner dimensions must match!** The `n` in (m × **n**) must equal the `n` in (**n** × p).

Each element of the result is the dot product of a **row from A** with a **column from B**.

In [None]:
# Matrix multiplication: (m × n) @ (n × p) = (m × p)
A = np.array([
    [1, 2],
    [3, 4],
    [5, 6]
])  # Shape: (3, 2)

B = np.array([
    [7, 8, 9],
    [10, 11, 12]
])  # Shape: (2, 3)

C = A @ B  # Shape: (3, 3)

print(f"A shape: {A.shape}")
print(f"B shape: {B.shape}")
print(f"C = A @ B shape: {C.shape}")
print(f"\nResult:\n{C}")
print()

# Let's verify one element: C[0,0] = row 0 of A · column 0 of B
row0_A = A[0, :]       # [1, 2]
col0_B = B[:, 0]       # [7, 10]
print(f"Verifying C[0,0]:")
print(f"  Row 0 of A: {row0_A}")
print(f"  Col 0 of B: {col0_B}")
print(f"  Dot product: {row0_A[0]}×{col0_B[0]} + {row0_A[1]}×{col0_B[1]} = {row0_A @ col0_B}")
print(f"  C[0,0] = {C[0,0]}  ✓")

### **The Shape Rule (Memorize This!)**

```
(m × n) @ (n × p) = (m × p)  
     ↑     ↑  
These must match!
```

If the inner dimensions don't match, you get an error:

In [None]:
# What happens when shapes don't match?
X = np.array([[1, 2, 3]])       # (1, 3)
Y = np.array([[4, 5, 6]])       # (1, 3)

print(f"X shape: {X.shape}")
print(f"Y shape: {Y.shape}")
print()

try:
    result = X @ Y
except ValueError as e:
    print(f"Error: {e}")
    print("   (1×3) @ (1×3) fails because inner dimensions 3 ≠ 1")

print()

# Fix: transpose Y so shapes align
result = X @ Y.T  # (1×3) @ (3×1) = (1×1)
print(f"X @ Y.T = {result}  ← this is just the dot product of X and Y!")
print(f"  Shape: {result.shape}")

### **Transpose**

The **transpose** of a matrix flips it over its diagonal — rows become columns and columns become rows.

- Shape (m × n) → (n × m)
- In NumPy: `A.T` or `np.transpose(A)`

In [None]:
A = np.array([
    [1, 2, 3],
    [4, 5, 6]
])

print(f"A (shape {A.shape}):")
print(A)
print(f"\nA.T (shape {A.T.shape}):")
print(A.T)
print()
print("Notice: Row 0 of A [1,2,3] became Column 0 of A.T")

### **Where transpose shows up in ML:**

- **Normal equation** for linear regression: $\mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$
- **Covariance matrix**: $\frac{1}{n} \mathbf{X}^T \mathbf{X}$ (used in PCA — Week 12)
- **Batch predictions**: Sometimes you need to transpose your data to make shapes align

You don't need to memorize these formulas right now — just recognize that the transpose is everywhere in ML.

---
## **1.7 Why NumPy? Speed Through Vectorization**

A natural question: "Why not just use Python lists and for-loops?"

**Answer: Speed.** NumPy operations are **vectorized** — they run in optimized C code instead of slow Python loops. The difference is dramatic, especially with large datasets.

In [None]:
import time

# Create a large dataset: 100,000 samples, 50 features
np.random.seed(42)
n_samples = 100_000
n_features = 50

X = np.random.randn(n_samples, n_features)
w = np.random.randn(n_features)

print(f"Data: {n_samples:,} samples × {n_features} features")
print(f"X shape: {X.shape}")
print(f"w shape: {w.shape}")
print()

In [None]:
# Method 1: Pure Python loops
start = time.time()
predictions_loop = []
for i in range(n_samples):
    pred = 0
    for j in range(n_features):
        pred += X[i, j] * w[j]
    predictions_loop.append(pred)
loop_time = time.time() - start

print(f"Python loops: {loop_time:.4f} seconds")

In [None]:
# Method 2: NumPy dot product (vectorized)
start = time.time()
predictions_numpy = X @ w
numpy_time = time.time() - start

print(f"NumPy (X @ w): {numpy_time:.6f} seconds")
print(f"\nSpeedup: {loop_time / numpy_time:.0f}x faster!")
print(f"\nResults match: {np.allclose(predictions_loop, predictions_numpy)}")

### **Why is NumPy so fast?**

1. **Compiled C code**: NumPy's core is written in C, not Python
2. **Contiguous memory**: NumPy arrays are stored in a single block of memory (unlike Python lists)
3. **SIMD instructions**: Modern CPUs can process multiple numbers in a single instruction
4. **No type checking**: Python checks the type of every variable at every step; NumPy doesn't

**Rule of thumb:** If you find yourself writing a `for` loop over array elements in data science code, there's almost certainly a vectorized NumPy operation that does the same thing faster.

---

## **1.8 From DataFrames to Arrays and Back**

In practice, you'll receive data as a pandas DataFrame (from CSV, database, etc.) but scikit-learn models need NumPy arrays under the hood. Let's see how to move between them.

In [None]:
# Create a sample DataFrame
df = pd.DataFrame({
    'sqft': [1500, 2100, 900, 1800, 2500],
    'bedrooms': [3, 4, 2, 3, 5],
    'age': [15, 5, 30, 10, 2],
    'price': [327500, 442000, 220000, 375000, 510000]
})

print("Original DataFrame:")
print(df)
print(f"\nType: {type(df)}")

In [None]:
# DataFrame → NumPy array
X_df = df[['sqft', 'bedrooms', 'age']]     # Feature columns (still a DataFrame)
y_df = df['price']                         # Target column (still a Series)

# Convert to NumPy
X_array = X_df.to_numpy()  # or X_df.values (older style)
y_array = y_df.to_numpy()

print(f"X as DataFrame:\n{X_df}")
print(f"\nX as NumPy array:\n{X_array}")
print(f"\nShape: {X_array.shape}")
print(f"dtype: {X_array.dtype}")

print(f"y as DataFrame:\n{y_df}")
print(f"\ny as NumPy array:\n{y_array}")
print(f"\nShape: {y_array.shape}")
print(f"dtype: {y_array.dtype}")

In [None]:
# What scikit-learn does internally:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_df, y_df)  # Scikit-learn accepts DataFrames and converts internally

print(f"Model weights (coef_): {model.coef_}")
print(f"Model bias (intercept_): {model.intercept_:.2f}")
print(f"\nThese are NumPy arrays:")
print(f"  coef_ type:      {type(model.coef_)}")
print(f"  intercept_ type: {type(model.intercept_)}")
print()

# The prediction is literally: X @ coef_ + intercept_
manual_preds = X_array @ model.coef_ + model.intercept_
sklearn_preds = model.predict(X_df)

print("Manual (X @ w + b) vs sklearn .predict():")
print(f"  Match: {np.allclose(manual_preds, sklearn_preds)}")

---

## **1.9 Connecting Back to Gradient Descent (Week 2)**

Last week we updated a single weight using gradient descent. With linear algebra, we can update **all weights at once**.

### **Single feature (Week 2):**
```python
prediction = w * x + b
error = prediction - y
w = w - learning_rate * error * x
```

### **Multiple features (with linear algebra):**
```python
predictions = X @ w + b           # dot product: all predictions at once
errors = predictions - y          # vector of errors
gradient = X.T @ errors / n       # all gradients at once!
w = w - learning_rate * gradient  # update all weights simultaneously
```

The `X.T @ errors` line is computing the gradient for every weight simultaneously using matrix multiplication. This is **vectorized gradient descent**.

In [None]:
# Vectorized gradient descent for multiple linear regression
np.random.seed(42)

# Simple synthetic data: y = 3*x1 + 5*x2 + 7 + noise
n = 200
X_train = np.random.randn(n, 2)  # 200 samples, 2 features
true_weights = np.array([3.0, 5.0])
true_bias = 7.0
y_train = X_train @ true_weights + true_bias + np.random.randn(n) * 0.5

# Initialize weights
w = np.zeros(2)
b = 0.0
lr = 0.05

# Train with vectorized gradient descent
print("Epoch | w1     | w2     | bias   | MSE")
print("------+--------+--------+--------+--------")
for epoch in range(n+1):
    # Forward pass (dot product!)
    predictions = X_train @ w + b

    # Compute error
    errors = predictions - y_train
    mse = np.mean(errors ** 2)

    # Compute gradients (matrix transpose!)
    grad_w = (X_train.T @ errors) / n     # gradient for all weights at once
    grad_b = np.mean(errors)              # gradient for bias

    # Update
    w = w - lr * grad_w
    b = b - lr * grad_b

    if epoch % 40 == 0:
        print(f"{epoch:>5} | {w[0]:>6.3f} | {w[1]:>6.3f} | {b:>6.3f} | {mse:.4f}")

print(f"\nLearned:  w = {w},  b = {b:.3f}")
print(f"True:     w = {true_weights}, b = {true_bias}")

---

## **BREAK (10-15 minutes)**

---

## **2.1 Lab Exercises** (new notebook)

---

## **3.1: Review & Wrap-Up**

### **3.1.1 What We Learned Today**

| Concept | Why It Matters |
|---------|----------------|
| **Vectors** | A single data sample is a vector of features |
| **Matrices** | An entire dataset is a matrix (samples × features) |
| **Dot product** | This IS how linear models make predictions: **w · x + b** |
| **Matrix × vector** | Predict all samples at once: **X @ w + b** |
| **Transpose** | Needed for computing gradients: **X.T @ errors** |
| **Vectorization** | NumPy is 100–1000× faster than Python loops |
| **DataFrame ↔ Array** | Pandas for loading/exploring; NumPy for computing |

### **3.1.2 Discussion Questions**
1. In your own words, what does the dot product compute? Why is it useful in ML?
2. If your data matrix X has shape `(500, 10)` and your weight vector w has shape `(10,)`, what shape is `X @ w`?
3. Why would `X @ w` fail if w had shape `(5,)` instead?
4. When would you use `X.T` (transpose) in a machine learning context?

---

## **3.2 Coming Up**

### **3.2.1 Next Week Preview: Multiple Linear Regression & Regularization**
- We'll use today's linear algebra to build models with many features
- We'll learn about **overfitting** and how **L1 and L2 regularization** prevent it
- Key question: What happens when you have too many features?

### **3.2.2 Homework Reminder**
- **Lab notebook due:** Monday, 16 February @ 6:00 PM (grace period until Wed, 18 February @ 11:59 PM)
- Next week begins with a **mastery assessment** on this week's material (dot products, shapes, vectorization)
- Practice computing dot products by hand — you'll need to do this without AI!

### **3.2.3 Looking Ahead**
- **Week 4 (Multiple Linear Regression):** We'll use everything from today to build multi-feature regression models
- **Week 12 (PCA):** Matrix decomposition for dimensionality reduction
- **Week 13 (Neural Networks):** Every layer is a matrix multiplication + activation function
