<div style="text-align: center;">

# **Spring 2026 &mdash; CIS 3813<br>Advanced Data Science<br>(Introduction to Machine Learning)**
### Week 3: Linear Algebra for Data Science

</div>

---

## **Lab Instructions**

**Due Date**: Monday, February 16, 2026 @ 6:00 PM (with grace period until Wednesday, February 18 @ 11:59 PM)

In this lab, you will:
1. Convert a pandas DataFrame into NumPy arrays and back, practicing the data format transitions you'll use in every scikit-learn workflow
2. Write both a Python loop and a NumPy vectorized implementation of the dot product, then benchmark them across increasing data sizes to see why vectorization matters
3. Use the dot product to make predictions with a simple linear model — first for a single student, then for an entire dataset using matrix-vector multiplication
4. Compute Mean Squared Error two ways (loop vs. vectorized) to evaluate model performance
5. Visualize your results with a log-scale timing plot and a predicted-vs-actual scatter plot
6. Solve shape puzzles that build the intuition you'll need to debug real matrix operation errors

**AI Usage**: 
- You may use AI tools for this lab
- **REQUIRED**: Include AI attribution using the format shown in the syllabus
- For B/A level credit, include detailed attribution in markdown cells

## **Grading**

| Component | Points |
|-----------|--------|
| Exercise 1: DataFrames to Arrays | 10 |
| Exercise 2: Timing Comparisons | 25 |
| Exercise 3: Dot Products in Action | 30 |
| Exercise 4: Shape Puzzles | 5 |
| Exercise 5: Reflection | 20 |
| In-Class Mastery Assessment (Week 4) | 10 |
| **Total** | **100** |


---

## **AI Assistance Declaration**

**Tools used:** [ChatGPT-4 / GitHub Copilot / Claude / None / Other — update this]

**Sections with AI help:** [e.g., "Part 2, Question 3" — update this]

**What I learned:** [Brief description — update this]

**What I did independently:** [Sections completed without AI — update this]

---

In [None]:
# Run this cell first to import required libraries
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set up plotting defaults
# plt.style.use('seaborn-v0_8-whitegrid') (Optional)
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

# Set random seed for reproducibility
np.random.seed(42)

---

## **Exercise 1: DataFrames to Arrays (10 points)**

In this section you'll practice moving data between pandas DataFrames and NumPy arrays — a critical skill since you'll typically load data with pandas but compute with NumPy/scikit-learn.

We'll work with a small student exam dataset.

In [None]:
# Run this cell to create the dataset (do NOT modify)
student_data = pd.DataFrame({
    'hours_studied': [2.5, 5.1, 3.2, 8.5, 3.5, 1.5, 9.2, 5.5, 8.3, 2.7,
                      7.7, 5.9, 4.5, 3.3, 1.1, 8.9, 2.5, 1.9, 6.1, 7.4],
    'hours_slept':   [7.0, 6.5, 8.0, 5.5, 7.5, 8.5, 5.0, 7.0, 6.0, 7.5,
                      6.5, 7.0, 7.5, 8.0, 9.0, 5.5, 8.0, 8.5, 6.5, 6.0],
    'practice_exams': [1, 3, 2, 5, 2, 0, 4, 3, 4, 1,
                       4, 3, 2, 1, 0, 5, 1, 0, 3, 4],
    'exam_score':    [65, 80, 72, 95, 70, 55, 97, 82, 92, 63,
                      90, 83, 76, 68, 50, 96, 64, 58, 84, 88]
})

print(student_data)
print(f"\nShape: {student_data.shape}")

### **Question 1.1 (2.5 points)**

Create two variables:
- `X` — a NumPy array containing the **feature columns** (hours_studied, hours_slept, practice_exams)
- `y` — a NumPy array containing the **target column** (exam_score)

Print the shape of each and verify `X` has shape `(20, 3)` and `y` has shape `(20,)`.

In [None]:
# YOUR CODE HERE
X = ...
y = ...

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"X dtype: {X.dtype}")
print(f"y dtype: {y.dtype}")

### **Question 1.2 (2.5 points)**

Using **NumPy array operations only** (no pandas), compute the following for each feature column in `X`:
- Mean
- Standard deviation
- Minimum
- Maximum

**Hint:** Use `axis=0` to compute along columns. Store the means in a variable called `feature_means`.

In [None]:
# YOUR CODE HERE
feature_names = ['hours_studied', 'hours_slept', 'practice_exams']

feature_means = ...  # Compute the mean of each column
feature_stds = ...   # Compute the std of each column
feature_mins = ...   # Compute the min of each column
feature_maxs = ...   # Compute the max of each column

for i, name in enumerate(feature_names):
    print(f"{name:>18}: mean={feature_means[i]:.2f}, std={feature_stds[i]:.2f}, "
          f"min={feature_mins[i]:.1f}, max={feature_maxs[i]:.1f}")

### **Question 1.3 (2.5 points)**

Create a **standardized** version of X called `X_std` where each feature has mean ≈ 0 and standard deviation ≈ 1.

Use the formula: $X_{std} = \frac{X - \mu}{\sigma}$

Do this using **vectorized NumPy operations** (no loops!).

Then verify that the column means are approximately 0 and column standard deviations are approximately 1.

In [None]:
# YOUR CODE HERE
X_std = ...

# Verification (don't modify)
print(f"X_std shape: {X_std.shape}")
print(f"Column means (should be ≈ 0): {X_std.mean(axis=0).round(6)}")
print(f"Column stds (should be ≈ 1):  {X_std.std(axis=0).round(6)}")

### **Question 1.4 (2.5 points)**

Convert your standardized NumPy array `X_std` **back** into a pandas DataFrame with the original column names (`hours_studied`, `hours_slept`, `practice_exams`). Store it in a variable called `df_std`.

Display the first 5 rows using `.head()`.

In [None]:
# YOUR CODE HERE
df_std = ...

print(type(df_std))  # Should be <class 'pandas.core.frame.DataFrame'>
df_std.head()

---
## **Exercise 2: Timing Comparisons (25 points)**

In this section you'll empirically measure the speed difference between Python loops and NumPy vectorized operations.

### **Question 2.1 (10 points)**

Write **two functions** that each compute the dot product of two vectors:

1. `dot_product_loop(a, b)` — Uses a Python `for` loop to multiply corresponding elements and sum them up. **Do not use any NumPy functions inside the loop** (no `np.sum`, etc.).

2. `dot_product_numpy(a, b)` — Uses NumPy's `@` operator or `np.dot()`. Should be a single line.

Both functions should return a single number (the dot product).

In [None]:
def dot_product_loop(a, b):
    """Compute the dot product of vectors a and b using a Python for loop."""
    # YOUR CODE HERE
    pass


def dot_product_numpy(a, b):
    """Compute the dot product of vectors a and b using NumPy."""
    # YOUR CODE HERE
    pass


# Test with small vectors (don't modify)
test_a = np.array([1, 2, 3, 4, 5])
test_b = np.array([6, 7, 8, 9, 10])

result_loop = dot_product_loop(test_a, test_b)
result_numpy = dot_product_numpy(test_a, test_b)
expected = 1*6 + 2*7 + 3*8 + 4*9 + 5*10  # = 130

print(f"Loop result:     {result_loop}")
print(f"NumPy result:    {result_numpy}")
print(f"Expected result: {expected}")
print(f"Both correct:    {result_loop == expected and result_numpy == expected}")

### **Question 2.2 (10 points)**

Now let's time both functions across **increasing vector sizes** to see how the performance gap grows.

Complete the code below to:
1. Time both `dot_product_loop` and `dot_product_numpy` for each size in `sizes`
2. Store the times in the `loop_times` and `numpy_times` lists
3. Compute the speedup ratio (loop time / numpy time)

**Hint:** Use `time.time()` before and after each function call to measure elapsed time.

In [None]:
sizes = [100, 1_000, 10_000, 100_000, 1_000_000]
loop_times = []
numpy_times = []

print(f"{'Size':>12} | {'Loop (sec)':>12} | {'NumPy (sec)':>12} | {'Speedup':>10}")
print("-" * 55)

for size in sizes:
    # Create random vectors of the given size
    a = np.random.randn(size)
    b = np.random.randn(size)

    # YOUR CODE HERE: Time dot_product_loop
    t_loop = ...  # elapsed time in seconds

    # YOUR CODE HERE: Time dot_product_numpy
    t_numpy = ...  # elapsed time in seconds

    loop_times.append(t_loop)
    numpy_times.append(t_numpy)

    speedup = t_loop / t_numpy if t_numpy > 0 else float('inf')
    print(f"{size:>12,} | {t_loop:>12.6f} | {t_numpy:>12.6f} | {speedup:>9.1f}x")

### **Question 2.3 (5 points)**

Create a **log-scale plot** showing the timing comparison. Plot both `loop_times` and `numpy_times` against `sizes`.

Requirements:
- Use `plt.loglog()` or set both axes to log scale
- Label both lines (use `label=` and `plt.legend()`)
- Add axis labels: "Vector Size" and "Time (seconds)"
- Add a title: "Dot Product: Python Loop vs NumPy"
- Use a grid for readability

In [None]:
# YOUR CODE HERE
plt.figure(figsize=(10, 6))

# Plot both lines on a log-log scale
# ...

plt.show()

---

## **Exercise 3: Dot Products in Action (30 points)**

Now let's apply the dot product to a real machine learning context.

### **Question 3.1 (5 points)**

Suppose you have a trained linear model with the following weights and bias for predicting exam scores:

- Weight for `hours_studied`: **6.5**
- Weight for `hours_slept`: **3.2**
- Weight for `practice_exams`: **4.8**
- Bias: **20.0**

Create a NumPy array `weights` containing the three weights and a variable `bias` for the bias term.

Then use the dot product to predict the exam score for a student who studied **6 hours**, slept **7 hours**, and took **3 practice exams**.

In [None]:
# YOUR CODE HERE
weights = ...
bias = ...

student = ...  # [hours_studied, hours_slept, practice_exams]

prediction = ...  # Use the dot product!

print(f"Weights:    {weights}")
print(f"Student:    {student}")
print(f"Prediction: {prediction}")

### **Question 3.2 (5 points)**

Now use **matrix-vector multiplication** to predict exam scores for **all 20 students** in `X` (from Part 1) using the same weights and bias.

Store the result in `all_predictions`. Verify it has shape `(20,)`.

In [None]:
# YOUR CODE HERE (should be ONE line of code for the prediction)
all_predictions = ...

print(f"Predictions shape: {all_predictions.shape}")
print(f"\nFirst 5 predictions: {all_predictions[:5].round(1)}")
print(f"First 5 actual:      {y[:5]}")

### **Question 3.3 (10 points)**

Compute the **Mean Squared Error (MSE)** between your predictions and the actual exam scores (`y`).

The formula is:

$$MSE = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2$$

Do this **two ways**:
1. Using a Python `for` loop
2. Using vectorized NumPy operations (no loops)

Verify both give the same answer.

In [None]:
# Method 1: For loop
# YOUR CODE HERE
mse_loop = ...

# Method 2: Vectorized NumPy
# YOUR CODE HERE
mse_numpy = ...

print(f"MSE (loop):     {mse_loop:.4f}")
print(f"MSE (numpy):    {mse_numpy:.4f}")
print(f"Results match:  {np.isclose(mse_loop, mse_numpy)}")

### **Question 3.4 (10 points)**

Create a scatter plot with:
- **X-axis:** Actual exam scores (`y`)
- **Y-axis:** Predicted exam scores (`all_predictions`)
- A **diagonal dashed line** from (50, 50) to (100, 100) representing "perfect predictions"
- Title: "Predicted vs Actual Exam Scores"
- Appropriate axis labels

Then, in a **markdown cell below your plot**, answer this question in 2–3 sentences:  
*How well does this simple model predict exam scores? Where does it tend to over-predict or under-predict?*

In [None]:
# YOUR CODE HERE
plt.figure(figsize=(8, 8))

# Scatter plot of actual vs predicted
# ...

# Diagonal line for perfect predictions
# ...

plt.show()

*YOUR WRITTEN RESPONSE HERE (replace this text)*

---

# **Exercise 4: Shape Puzzles (5 points)**

These questions test your understanding of matrix shapes and compatibility. For each question, **predict the result first**, then run the code to verify.

### **Question 4.1 (1 points each × 5 = 5 points)**

For each operation below:
1. **Predict** whether it will succeed or raise an error
2. If it succeeds, **predict the shape** of the result
3. Write your prediction in the markdown cell, then run the code to check

**Puzzle A:** `A` has shape (5, 3) and `B` has shape (3, 2). What is the shape of `A @ B`?

YOUR PREDICTION: *(replace this — e.g., "(5, 2)" or "Error")*

In [None]:
A = np.random.randn(5, 3)
B = np.random.randn(3, 2)
try:
    result = A @ B
    print(f"Success! Shape: {result.shape}")
except ValueError as e:
    print(f"Error: {e}")

**Puzzle B:** `A` has shape (4, 5) and `B` has shape (4, 5). What is the shape of `A @ B`?

YOUR PREDICTION: *(replace this)*

In [None]:
A = np.random.randn(4, 5)
B = np.random.randn(4, 5)
try:
    result = A @ B
    print(f"Success! Shape: {result.shape}")
except ValueError as e:
    print(f"Error: {e}")

**Puzzle C:** `A` has shape (4, 5) and `B` has shape (4, 5). What is the shape of `A @ B.T`?

YOUR PREDICTION: *(replace this)*

In [None]:
A = np.random.randn(4, 5)
B = np.random.randn(4, 5)
try:
    result = A @ B.T
    print(f"Success! Shape: {result.shape}")
except ValueError as e:
    print(f"Error: {e}")

**Puzzle D:** `v` has shape (5,) and `w` has shape (5,). What is the shape of `v @ w`?

YOUR PREDICTION: *(replace this)*

In [None]:
v = np.random.randn(5)
w = np.random.randn(5)
try:
    result = v @ w
    print(f"Success! Result: {result:.4f}  (shape: scalar)")
except ValueError as e:
    print(f"Error: {e}")

**Puzzle E:** `X` has shape (100, 5) and `w` has shape (3,). What is the shape of `X @ w`?

YOUR PREDICTION: *(replace this)*

In [None]:
X = np.random.randn(100, 5)
w = np.random.randn(3)
try:
    result = X @ w
    print(f"Success! Shape: {result.shape}")
except ValueError as e:
    print(f"Error: {e}")

---
## **Exercise 5: Reflection (20 points)**

Answer the following questions in **your own words** (2–4 sentences each). These are the kinds of conceptual questions that may appear on the mastery assessment next week.

### **Question 5.1 (3 points)**
In your own words, explain what the dot product computes and why it is the core operation behind linear model predictions.

*YOUR ANSWER HERE*

### **Question 5.2 (3 points)**
Why is NumPy so much faster than Python for-loops for numerical operations? Name at least two reasons.

*YOUR ANSWER HERE*

### **Question 5.3 (4 points)**
A classmate says: "I don't need to understand shapes — scikit-learn handles all of that for me." Do you agree or disagree? Why? Give one example of when understanding shapes would help you debug a problem.

*YOUR ANSWER HERE*

### **Question 5.4 (5 points)**

In Part 3, you used the weights `[6.5, 3.2, 4.8]` to predict exam scores. Notice that `hours_studied` has the largest weight (6.5) while `hours_slept` has the smallest (3.2). Does this necessarily mean that hours studied is the most important feature for predicting exam scores? Consider what you learned about standardization in Part 1 — how might the different scales of the features affect the weight values, and why might you want to standardize your data before comparing feature importance?

*YOUR ANSWER HERE*

### **Question 5.5 (5 points)**

This week's verse tells us *"The heavens declare the glory of God; the skies proclaim the work of his hands. Day after day they pour forth speech; night after night they reveal knowledge"* (Psalm 19:1–2, NIV). In this lab you used linear algebra to uncover patterns hidden in raw numbers — relationships that aren't obvious until you compute them. In 3–5 sentences, reflect: How does the practice of finding patterns in data connect to the idea that creation reveals knowledge to those who look carefully? What responsibility do we have when the patterns we find (or miss) affect real people's lives?

*YOUR ANSWER HERE*

---

### **Submission Checklist:**

Before uploading your notebook to Canvas, confirm the following:

- [ ] AI Assistance Declaration at the top of the notebook is filled out completely
- [ ] All code cells run without errors (Kernel → Restart & Run All)
- [ ] Part 1: `X` has shape `(20, 3)`, `y` has shape `(20,)`, `X_std` column means ≈ 0 and stds ≈ 1, `df_std` is a DataFrame
- [ ] Part 2: Both dot product functions return 130 on the test vectors, timing table prints for all 5 sizes, log-scale plot displays with labels and legend
- [ ] Part 3: Single prediction computed, all 20 predictions computed in one line, MSE computed both ways with matching results, scatter plot displays with diagonal reference line and written response
- [ ] Part 4: All 5 shape puzzles have your written prediction filled in *before* the code cell
- [ ] Part 5: All 3 reflection questions answered in your own words (2–4 sentences each)
- [ ] File saved as `.ipynb` and uploaded to Canvas before the deadline

### **Submission Instructions**

1. Save this notebook
2. **Restart kernel and run all cells** (Kernel → Restart & Run All)
3. Verify all outputs appear correctly (especially visualizations)
4. Check that all written responses are complete
5. Submit the `.ipynb` file to Canvas before Monday, 16 February @ 6:00 PM
   - Grace period until Wednesday, 18 February @ 11:59 PM

**Remember:** This notebook submission is worth 90% of your Week 3 Lab grade. The remaining 10% comes from next week's in-class mastery assessment.

---

### **Mastery Assessment Preparation Tips**

The mastery assessment at the start of Week 4 will test your ability to (without AI):
- Compute a dot product by hand for small vectors
- Predict the output shape of a matrix operation given input shapes
- Explain why vectorized NumPy code is faster than Python loops

**Practice:** Try computing $\begin{bmatrix}2, 3, 1\end{bmatrix} \cdot \begin{bmatrix}4, -1, 5\end{bmatrix}$ by hand. Can you do it in under 30 seconds?