<div style="text-align: center;">

# **Spring 2026 &mdash; CIS 3813<br>Advanced Data Science<br>(Introduction to Machine Learning)**
### Week 1: The Machine Learning Workflow

</div>

**Date:** 26 January 2026
**Time:** 6:00–9:00 PM  
**Instructor:** Dr. Patrick T. Marsh  
**Course Verse:** "He has shown you, O mortal, what is good. And what does the Lord require of you? To act justly and to love mercy and to walk humbly with your God."  &mdash; *Micah 6:8 (NIV)*

---
## **Welcome and Course Orientation**
Below are some highlights from the course syllabus. You can find the full details including a course schedule on Canvas.  

- **Class Time:** Mondays 6–9 PM
- **Office Hours:** Mondays & Wednesdays 4:30–5:50 PM (and by appointment / virtual)
- **Graded Components:** 
    - Labs/Homework (35%)
    - Mini-Kaggle Competition (5%)
    - Exams (10% each; 20% comulative)
    - Final Project (40%)
        - Final Project Check-in (5%)
        - Final Project Report (25%)
        - Final Project Presentation (10%) 
- **AI Use:** Allowed with attribution and full understanding; Please see the syllabus for the full AI policy
- **Grading Rubric Summary:** (Details in the syllabus)
    * **D/F**: Incomplete/Deficient Work
    * **C**: Working Solution (Baseline Competence)
    * **B**: Demonstrated Understanding
    * **A**: Independent Mastery

---
## **Week 1 Learning Objectives**

By the end of today's session, you will be able to:
1. Describe the stages of the machine learning workflow
2. Explain why data leakage occurs and how to prevent it (the Golden Rule)
3. Define bias and variance in the context of machine learning
4. Build a complete ML pipeline using scikit-learn

---


## **Today's Outline**
- Lecture
    1. What is Machine Learning
    2. The Machine Learning Workflow
    3. How Models Fail
    4. The Mathematical Perspective
    5. Review: Functions & Slope from Calculus
    6. Introduction to Scikit-Learn
- Break (10-15 Minutes)
- Lab (or Homework)
- Review
    1. Key Takeaways from Week 1
    2. Common Pitfalls to Avoid


---

## **Opening Reflection**

> *"It is the glory of God to conceal a matter; to search out a matter is the glory of kings."*  
> — Proverbs 25:2 (NIV)

As we begin our journey into advanced data science, remember that uncovering patterns in data is a sacred task. God has embedded order and structure into creation, and our role as data scientists is to discover and interpret these patterns for the good of others. The work we do this semester—finding insights hidden in data—reflects the image of God in us.

---
## **1.1 What is Machine Learning?**

Machine learning is the practice of using algorithms to parse data, learn from it, and make predictions or decisions. Unlike traditional programming where we explicitly code rules, in ML we let the algorithm discover the rules from the data.

**Questions:**
* How does a baby learn to identify a cat?
* How does a scientist discover a new species?

**Three Types of Machine Learning:**
1. **Supervised Learning**: Learning from labeled data (aka with an "Answer Key")
    - **Regression:** Predicting a continuous quantity
    - **Classification:** Predicting a label
2. **Unsupervised Learning**: Finding patterns in unlabeled data (aka without an "Answer Key")
    - **Clustering:** Grouping like data
    - **Dimensionality Reduction:** Compressing Data
3. **Reinforcement Learning**: Learning through trial and error (rewards/punishments)

---

## **1.2 The Machine Learning Workflow**

Below is a commonly used Data Science workflow. We referenced it several times last semester. Understanding this workflow is crucial to becoming an effective data scientist.

**The Complete Data Science Workflow:**
<div style="text-align: center;">

**Problem $\longleftrightarrow$ Data Acquistion $\longleftrightarrow$ Cleaning/Preparation $\longleftrightarrow$ Exploration/Visualization $\longleftrightarrow$ Modeling/Inference $\longleftrightarrow$ Evaluation $\longleftrightarrow$ Communication/Deployment**

</div>

**Notice that the arrows point in both directions.** In practice the arrows are a bit more muddled. You can end up jumping from any point of the process to any other point of the process &mdash; multiple times &mdash; as you discover issues or new questions. 

Every machine learning project follows this same workflow. This semester we will focus on the *Modeling/Inference* and *Evaluation* steps. Specifically, we will expand the *Modeling/Inference* and *Evaluation* steps.

**Expanded ML Workflow (our focus this semester):**

<div style="text-align: center;">

**... $\longleftrightarrow$ [Data] Cleaning/Preparation $\longleftrightarrow$ Model Selection $\longleftrightarrow$ Data Splitting $\longleftrightarrow$ Model Training $\longleftrightarrow$ Model Evaluation $\longleftrightarrow$ Model Refinement $\longleftrightarrow$ ...**

</div>

This is essentially a "zoom in" on the middle portion of the complete workflow. Each step has purpose to avoid model "memorization" and allow actual "learning". 

---

## **1.3 How Models Fail**

### **The Exam Analogy**

Think of your time in college. In a lot of your classes you are given a set of homework problems each week to help you learn the content for the course. Now, let's think of your final exam. If your final exam consists solely of the homework problems, you can earn a 100% on the exam simply by memorizing the answers to the homework problems. *There is no guarantee of learning!* If the final exam is similar, but not identical, to the homework problems, studying the homework problems should lead to a good final exam score. This demonstrates you learned the material. 

In the above scenario, the homework problems are analogous to a machine learning model's training dataset, whereas the final exam is analogous to the test dataset. If the final exam solely consists of the homework problems, the model should do extremely well. But the model has not demonstrated it's ability to learn. There is no telling how well it will do against questions it has not already seen before.

This concept is summed up in the ***<u>Golden Rule of Data Science</u>***: **Never train on your test data**.

### **What Happens When You Break the Golden Rule?**

Let's see the danger in action:

You are working for a biotech startup trying to develop a diagnostic test for a rare autoimmune disease. You have blood samples from 100 patients (50 sick; 50 healthy) and you run a gene sequencing panel that measures the expression levels of 50,000 genes for each patient. The company founder has asked you to build a model using these data to predict whether a patient has the autoimmune disease.

You recognize that you cannot test all 50,000 gene expression levels due to performance reasons. Instead you select the top 10 gene expressesion levels that correlate most strongly to the disease. Next you train a classifier using only those 10 gene expression levels and evaluate your results. 

In [None]:
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# ==========================================
# 0. GENERATE DATA (PURE NOISE)
# ==========================================
# 100 Patients, 50,000 Genes. Pure random numbers.
np.random.seed(42)
n_samples = 100
n_features = 50000
n_best_features = 10

X = np.random.normal(size=(n_samples, n_features))
y = np.random.randint(0, 2, size=n_samples) # Random Sick (1) or Healthy (0)

print(f"Dataset: {n_samples} samples, {n_features} features (ALL NOISE)")
print("-" * 60)

# ==========================================
# 1. THE WRONG WAY
# ==========================================
# CRIME: Selecting features BEFORE splitting
selector = SelectKBest(f_classif, k=n_best_features)
X_selected_leak = selector.fit_transform(X, y) # <--- The model sees the test answers here!

X_train_l, X_test_l, y_train_l, y_test_l = train_test_split(
    X_selected_leak, y, test_size=0.3, random_state=42
)

model_leak = LogisticRegression()
model_leak.fit(X_train_l, y_train_l)
acc_leak = accuracy_score(y_test_l, model_leak.predict(X_test_l))

print(f"LEAKY Approach:   {acc_leak*100:.1f}% Accuracy")

Congratulations! You just built a model that currectly identified the disease with 93.3% accurancy. It's time to celebrate!

Or did you ...

It turns out that when you deployed your model in a real hostpital, the accuracy drops to around 50%, which is no different than a random guess.

#### **Why did your model fail?**

Your results looked good, but they were artificially inflated! The reason is that with 50,000 gene expression levels and only 100 patients, probability theory dictates that *some* gene expression levels will correlate perfectly with the target variable purely by random chance. By selecting features on the whoel dataset, you cherry-picked these statistical coincidences for your model. This is known as **data leakage**. Data from the test dataset *leaked* into your training dataset. Your model didn't learn biology; it memorized the noise of those specific 100 patients.

**Real-world consequences of data leakage:**
- Your model performs great in testing but fails in production
- A medical diagnosis model that seems 95% accurate but fails on real patients (our case)
- A fraud detection system that looks perfect but misses actual fraud
- A loan approval model that appears fair but makes biased decisions

In all cases, you've lied to yourself about your model's true performance.

**The correct approach**: Always split FIRST, then fit your preprocessing ONLY on training data:

In [None]:
print(f"Dataset: {n_samples} samples, {n_features} features (ALL NOISE)")
print("-" * 60)

# ==========================================
# 1. THE CORRECT WAY
# ==========================================
# Split FIRST
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Step A: Fit the selector ONLY on Training data
selector_manual = SelectKBest(f_classif, k=n_best_features)
selector_manual.fit(X_train, y_train)

# Step B: Transform both Train AND Test separately
# RISK: It is very easy to accidentally type 'fit_transform' on X_test here!
X_train_selected = selector_manual.transform(X_train)
X_test_selected = selector_manual.transform(X_test)

# Step C: Train model
model_manual = LogisticRegression()
model_manual.fit(X_train_selected, y_train)
acc_manual = accuracy_score(y_test, model_manual.predict(X_test_selected))

print(f"LEAKY Approach:   {acc_leak*100:.1f}% Accuracy")
print(f"MANUAL Approach:  {acc_manual*100:.1f}% Accuracy")


### **From Data Leakage to a Deeper Question: Why Do Models Fail?**

The gene expression example teaches us about data leakage, but it also reveals something deeper: **even with perfect methodology, we can still build models that fail.**

Let's imagine we fixed the leakage problem. We split our data first:
- Training: 70 patients (35 sick, 35 healthy)
- Testing: 30 patients (15 sick, 15 healthy)

Now we select the top 10 genes using ONLY the training data, then evaluate on the test set.

**We might still have a problem.** Why?

### **The Statistical Reality: 50,000 Features vs. 70 Patients**

Think about what we're asking the model to do:
- Search through 50,000 genes
- Find patterns in only 70 patients
- Hope those patterns work on new patients

This is like studying for an exam by:
1. Reading a 50,000-page textbook
2. Doing only 70 practice problems
3. Hoping you can answer any question on the final exam

**What's likely to happen?** You'll memorize specific details from those 70 practice problems instead of learning general principles. When the final exam asks slightly different questions, you're lost.

#### **Two Ways Models Fail**

This brings us to a fundamental insight in machine learning: **models can fail in two completely different ways.**

##### **Failure Mode 1: Too Simple (Missing the Pattern)**

Imagine you're trying to predict whether it will rain tomorrow. Your model is:
- **"It will rain if it rained today"**

This model is too simple! It ignores:
- Temperature, humidity, pressure
- Cloud cover, wind patterns
- Seasonal trends

**Problem**: The model is so simple it misses important patterns.  
**Technical term**: This is called **high bias** or **underfitting**.  
**Analogy**: A student who only memorized one formula and tries to use it for every problem.

##### **Failure Mode 2: Too Complex (Memorizing Noise)**

Now imagine your rain prediction model is:
- **Track the exact temperature, humidity, pressure, wind speed, cloud cover, barometric pressure rate of change, lunar phase, and 42 other variables at 100 locations**
- **Remember the exact weather pattern from every single day you've seen**
- **If tomorrow looks even slightly like October 23, 2019, predict whatever happened on October 24, 2019**

**Problem**: The model memorized specific instances instead of learning general patterns.  
**Technical term**: This is called **high variance** or **overfitting**.  
**Analogy**: A student who memorized all homework problems exactly but can't solve new problems.

**Our gene expression model had Failure Mode 2**: With 50,000 genes and only 70 patients, it memorized which genes happened to correlate in those specific 70 people, not which genes actually cause the disease.

### **The Bias-Variance Tradeoff: The Central Challenge of Machine Learning**

Here's the fundamental problem we face as data scientists:

**We want models that generalize to new data, but:**
- Simple models miss important patterns (high bias)
- Complex models memorize noise (high variance)
- We need to find the sweet spot in between

This tension is called the **Bias-Variance Tradeoff**.

### **Defining Bias and Variance**

Before we visualize the tradeoff, we need to define what **bias** and **variance** actually mean in machine learning. These terms have specific technical meanings that are different from everyday usage.

#### **What is Bias?**

**Bias** measures how far off your model's **average predictions** are from the **true values**.

Think of it like this:
- You're throwing darts at a dartboard
- **High bias** = Your darts consistently land away from the bullseye (systematically off-target)
- You might hit the same wrong spot repeatedly
- The problem is your **aim** is wrong, not that your throws are inconsistent

**In machine learning:**
- High bias means your model is **too simple** to capture the true relationship
- It makes systematic errors because it's missing important patterns
- Like using a straight line to fit data that curves

**Example from gene expression:**
- If we only use 2 genes to predict disease, we might have high bias
- We're systematically wrong because we're ignoring 48 other important genes
- Our model is too simple to capture the biological complexity

#### **What is Variance?**

**Variance** measures how much your model's predictions **change** when trained on **different datasets**.

Back to the dartboard:
- **High variance** = Your darts land all over the place (inconsistent)
- Sometimes left, sometimes right, sometimes up, sometimes down
- The problem is your **consistency**, not necessarily your average aim

**In machine learning:**
- High variance means your model is **too complex** and sensitive to training data specifics
- Train on slightly different data → Get wildly different predictions
- Like fitting a wiggly curve that passes through every single training point

**Example from gene expression:**
- With 50,000 genes and only 70 patients, we have high variance
- Train on 70 different patients → Model picks completely different genes
- Our model memorized noise specific to the training set
### **Visual Intuition: The Dartboard Analogy**

![Bias-Variance Tradeoff](./week01-lecture_files/bias-variance-dart-board.png)

Source Image from https://blogs.alisterluiz.com/understanding-the-bias-variance-tradeoff-a-comprehensive-guide/

**The goal:** Low bias AND low variance (top-left quadrant)  
**The reality:** We usually have to trade one for the other

Now that we understand bias and variance, let's revisit our examples:

#### **Gene Expression Model (50,000 genes, 70 patients)**
- **Bias:** LOW - Model is complex enough to fit any pattern
- **Variance:** HIGH - Model changes dramatically with different training data
- **Problem:** Overfitting (memorizing noise)
- **Prediction:** Fails on new patients

#### **"Rain = Yesterday's Weather" Model**
- **Bias:** HIGH - Model is too simple, ignores important factors
- **Variance:** LOW - Model is consistent (always uses same rule)
- **Problem:** Underfitting (missing patterns)
- **Prediction:** Consistently wrong

#### **Homework Memorization**
- **Bias:** LOW - You can recall every homework answer perfectly
- **Variance:** HIGH - Only works for those exact problems
- **Problem:** Overfitting (didn't learn concepts)
- **Prediction:** Fails on exam (slightly different problems)

#### **Not Studying**
- **Bias:** HIGH - Don't understand the material
- **Variance:** LOW - Consistently guessing wrong
- **Problem:** Underfitting (didn't learn enough)
- **Prediction:** Fails on exam (and homework)

### **Now We're Ready for the Tradeoff**

With bias and variance defined, we can understand the fundamental tradeoff:

**As model complexity increases:**
- Bias ↓ (model can fit more complex patterns)
- Variance ↑ (model becomes more sensitive to training data)

**The question:** Where do we stop? When is the model complex enough to capture patterns but not so complex that it memorizes noise?

This is the **Bias-Variance Tradeoff**:


#### **Visualizing the Tradeoff**

| **Model Complexity →** | **Too Simple (Underfit)** | **Sweet Spot (Just Right)** | **Too Complex (Overfit)** |
|---|---|---|---|
| **Bias-Variance** | High Bias, Low Variance | Balanced Bias-Variance | Low Bias, High Variance |
| **Pattern Recognition** | Misses patterns, ignores signal | Learns patterns, captures signal | Memorizes noise, captures noise |
| **Training Performance** | Poor | Good | Perfect |
| **Test Performance** | Poor | Good | Poor |
| **Problem** | Model too simple to capture real relationships | Model complexity matches problem complexity | Model too complex, fits random noise |

#### **Connecting Back to Our Examples (Again)**

**Gene Expression Example:**
- **50,000 features, 70 patients** → Way too complex → High variance → Overfitting
- Model memorized random correlations specific to those 70 patients
- Failed on new patients because those correlations don't generalize

**Homework vs. Exam Analogy:**
- **Memorizing answers** → High variance → You've overfit to the homework
- **Not studying at all** → High bias → You underfit (too simple)
- **Understanding concepts** → Balanced → You generalize to the exam

**Rain Prediction:**
- **"Rain = Yesterday's weather"** → High bias → Underfit
- **Track 100 variables at 100 locations** → High variance → Overfit
- **Temperature + humidity + pressure** → Balanced → Just right


---

## **1.4 The Mathematical Perspective**

For those interested in the theory, here's what's happening mathematically:

**Total Error = Bias² + Variance + Irreducible Error**

Where:
- **Bias² = Error from wrong assumptions** (model too simple)
- **Variance = Error from sensitivity to training data** (model too complex)
- **Irreducible Error = Natural randomness** (we can't fix this)

As we increase model complexity:
- Bias ↓ (model can capture more patterns)
- Variance ↑ (model becomes more sensitive to training data specifics)

**The sweet spot**: Where total error is minimized.



In [None]:
# Conceptual visualization (we'll code this properly later)
import numpy as np
import matplotlib.pyplot as plt

complexity = np.linspace(0, 10, 100)
bias_squared = 10 / (1 + complexity)  # Decreases with complexity
variance = complexity ** 1.5 / 5       # Increases with complexity
irreducible = np.ones_like(complexity) * 2

total_error = bias_squared + variance + irreducible

plt.figure(figsize=(12, 6))
plt.plot(complexity, bias_squared, label='Bias²', linewidth=2)
plt.plot(complexity, variance, label='Variance', linewidth=2)
plt.plot(complexity, total_error, label='Total Error', linewidth=3, color='purple')
plt.axvline(x=complexity[np.argmin(total_error)], color='green',
            linestyle='--', label='Optimal Complexity')
plt.xlabel('Model Complexity →', fontsize=12)
plt.ylabel('Error', fontsize=12)
plt.title('The Bias-Variance Tradeoff', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()


### **How Do We Find the Sweet Spot?**

Great question! Throughout this semester, we'll learn strategies:

#### **Strategy 1: Use More Training Data**
- More data → Harder to overfit
- Gene example: 50,000 genes would need 100,000+ patients
- Not always possible (data is expensive!)

#### **Strategy 2: Reduce Model Complexity**
- Fewer features (feature selection done correctly)
- Simpler algorithms
- Gene example: Test only 100 most promising genes

#### **Strategy 3: Regularization (Coming in Week 4!)**
- Add penalty for complexity
- Forces model to focus on strongest patterns
- Ridge, Lasso, Elastic Net

#### **Strategy 4: Cross-Validation (Week 5!)**
- Test on multiple train/test splits
- Get honest estimate of generalization
- Tune complexity to minimize test error

#### **Strategy 5: Ensemble Methods (Weeks 11-12!)**
- Combine many models
- Average out the variance
- Random Forests, Gradient Boosting

### **Connecting to Our Faith Integration**

Remember our opening verse:

> *"It is the glory of God to conceal a matter; to search out a matter is the glory of kings."*  
> — Proverbs 25:2

The Bias-Variance Tradeoff teaches us about **humility in our models**:

**High Bias (Underfitting)** = **Intellectual laziness**
- We assume the world is simpler than it is
- We ignore important complexities
- Like reducing humans to a single dimension

**High Variance (Overfitting)** = **Intellectual pride**
- We think we can model everything perfectly
- We confuse memorization with understanding
- Like claiming to know exactly why each person makes each choice

**The Sweet Spot** = **Humble wisdom**
- Recognize real patterns God has embedded in creation
- Admit we cannot perfectly model every detail
- Build models that serve others without claiming omniscience

As Micah 6:8 reminds us to "walk humbly," the Bias-Variance Tradeoff reminds us to model humbly—capturing what we can learn while admitting what we cannot.

### **Questions for Discussion**

1. Can you think of real-world decisions where you've seen "underfitting" (too simple thinking)?
2. Can you think of situations where you've seen "overfitting" (over-interpreting limited data)?
3. In the gene expression example, what would you do if you had to deploy a model with limited data?
4. How does the Bias-Variance Tradeoff apply to other fields (medicine, law, theology)?

---

## **1.5 Review: Functions & Slope from Calculus**

Before we dive into machine learning models, we need to review some fundamental concepts from calculus that underpin how models learn.

### **Functions**

A **function** is a relationship between inputs and outputs where each input maps to exactly one output.

In [None]:
# Example: A simple linear function
def f(x):
    return 2 * x + 3

# Test it
print(f(0))   # Output: 3
print(f(1))   # Output: 5
print(f(5))   # Output: 13


In machine learning, we're trying to find the function that best describes the relationship between our input features (X) and our target variable (y).

### **Slope (The Derivative)**

The **slope** tells us how much the output changes when we change the input. In calculus, we use the derivative to find the slope at any point.

For a linear function: `y = mx + b`
- `m` is the slope
- `b` is the y-intercept

**Why does slope matter in ML?**
- Models learn by finding the slope (gradient) of the error
- They adjust parameters to reduce error
- This process is called **gradient descent** (we'll cover this in Week 2)

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Visualizing slope
x = np.linspace(-5, 5, 100)
y = 2 * x + 3

plt.figure(figsize=(10, 6))
plt.plot(x, y, 'b-', linewidth=2, label='y = 2x + 3')
plt.axhline(y=0, color='k', linewidth=0.5)
plt.axvline(x=0, color='k', linewidth=0.5)
plt.grid(True, alpha=0.3)
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Linear Function: Slope = 2', fontsize=14)
plt.legend()
plt.show()

# The slope is 2: for every 1 unit increase in x, y increases by 2 units


### **Key Calculus Concept: Rate of Change**

In [None]:
# Approximate derivative (slope at a point)
def approximate_derivative(f, x, h=0.0001):
    """
    Calculate the approximate derivative of function f at point x
    Using the definition: f'(x) ≈ [f(x+h) - f(x)] / h
    """
    return (f(x + h) - f(x)) / h

# Example function
def f(x):
    return x**2  # f(x) = x²

# Find slope at x = 3
# The derivative of x² is 2x, so at x=3, slope should be 6
deltas = [1, 0.1, 0.01, 0.001, 0.0001, 1e-5, 1e-6]
for delta in deltas:
    slope = approximate_derivative(f, 3, delta)
    print(f"Approximate slope of x² at x=3 with h={delta}: {slope:.4f}")
slope_at_3 = approximate_derivative(f, 3, 1)
print(f"Actual slope (derivative = 2x): {2*3}")

---

## **1.6 Introduction to Scikit-Learn**

**Scikit-learn** is Python's premier machine learning library. It provides:
- Simple and efficient tools for data mining and data analysis
- Accessible to everybody and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable (BSD license)

### **Basic Scikit-Learn Workflow**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Scikit-learn follows a consistent API:
# 1. Import the model
# 2. Instantiate the model
# 3. Fit the model to training data
# 4. Predict on new data
# 5. Evaluate the model

### **Scikit-Learn Pipeline: A Practical Example**

Let's work through a complete example using the California housing.

In [None]:
# Import necessary libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for better-looking plots
# sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

#### **Step 1: Load and Explore Data**

In [None]:
# Load the California Housing dataset
housing = fetch_california_housing()

# Create a DataFrame for easier manipulation
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['MedHouseValue'] = housing.target

print("Dataset Shape:", df.shape)
print("\nFirst few rows:")
print(df.head())

print("\nDataset Description:")
print(df.describe())

print("\nFeature names:")
for i, name in enumerate(housing.feature_names):
    print(f"{i+1}. {name}")

#### **Step 2: Visualize the Data**

In [None]:
# Visualize the relationship between features and target
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.ravel()

for i, col in enumerate(housing.feature_names):
    axes[i].scatter(df[col], df['MedHouseValue'], alpha=0.3, s=1)
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Median House Value')
    axes[i].set_title(f'{col} vs House Value')

plt.tight_layout()
plt.show()

#### **Step 3: Prepare the Data**

In [None]:
# Separate features (X) and target (y)
X = df[housing.feature_names]
y = df['MedHouseValue']

# Split into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")

#### **Step 4: Build a Pipeline (without Pipeline object first)**

In [None]:
# Method 1: Manual approach (step-by-step)
print("=" * 50)
print("METHOD 1: MANUAL APPROACH")
print("=" * 50)

# Step 1: Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 2: Train the model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Step 3: Make predictions
y_pred = model.predict(X_test_scaled)

# Step 4: Evaluate
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"\nManual Approach Results:")
print(f"RMSE: ${rmse:.4f} (in hundreds of thousands)")
print(f"R² Score: {r2:.4f}")

#### **Step 5: Build a Pipeline (with Pipeline object)**

In [None]:
# Method 2: Using Pipeline (RECOMMENDED)
print("\n" + "=" * 50)
print("METHOD 2: USING PIPELINE (RECOMMENDED)")
print("=" * 50)

# Create a pipeline that scales then applies linear regression
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

# Fit the entire pipeline
pipeline.fit(X_train, y_train)

# Make predictions (scaling happens automatically!)
y_pred_pipeline = pipeline.predict(X_test)

# Evaluate
mse_pipeline = mean_squared_error(y_test, y_pred_pipeline)
rmse_pipeline = np.sqrt(mse_pipeline)
r2_pipeline = r2_score(y_test, y_pred_pipeline)

print(f"\nPipeline Results:")
print(f"RMSE: ${rmse_pipeline:.4f} (in hundreds of thousands)")
print(f"R² Score: {r2_pipeline:.4f}")

print("\n✓ Results are identical! But Pipeline is cleaner and less error-prone.")

#### **Step 6: Understand the Model**

In [None]:
# Extract the linear regression model from the pipeline
lr_model = pipeline.named_steps['regressor']

# Show feature importance (coefficients)
feature_importance = pd.DataFrame({
    'Feature': housing.feature_names,
    'Coefficient': lr_model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)

print("\nFeature Importance (Absolute Coefficients):")
print(feature_importance)

# Visualize coefficients
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Coefficient'])
plt.xlabel('Coefficient Value')
plt.title('Feature Importance in Linear Regression Model')
plt.axvline(x=0, color='black', linestyle='--', linewidth=0.8)
plt.tight_layout()
plt.show()

#### **Step 7: Visualize Predictions**

In [None]:
# Plot actual vs predicted
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_pipeline, alpha=0.5, s=10)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
         'r--', linewidth=2, label='Perfect Prediction')
plt.xlabel('Actual House Value')
plt.ylabel('Predicted House Value')
plt.title('Actual vs Predicted House Values')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate and visualize residuals (prediction errors)
residuals = y_test - y_pred_pipeline

plt.figure(figsize=(10, 6))
plt.scatter(y_pred_pipeline, residuals, alpha=0.5, s=10)
plt.axhline(y=0, color='r', linestyle='--', linewidth=2)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals (Actual - Predicted)')
plt.title('Residual Plot')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### **Why Use Pipelines?**

**Benefits of Scikit-Learn Pipelines:**

1. **Cleaner Code**: All steps in one object
2. **Prevents Data Leakage**: Ensures test data isn't seen during training
3. **Easier Deployment**: Save one object instead of multiple
4. **Reproducibility**: Same preprocessing steps every time
5. **Cross-Validation Ready**: Works seamlessly with cross-validation

---

## **BREAK (10-15 minutes)**

---

## **2.1 Lab Exercises** (new notebook)

---
## **3.1 Key Takeaways from Week 1**

### **1. The Machine Learning Workflow is Your Roadmap**
Every ML project follows these 6 essential steps:
- [Data] Cleaning/Preparation ↔ Model Selection ↔ Data Splitting ↔ Model Training ↔ Model Evaluation ↔ Model Refinement
- Don't skip steps! Each one is crucial
- Arrows point both directions—iteration is normal and expected

### **2. Calculus Concepts Are the Foundation**
- Functions map inputs to outputs (what we're trying to learn)
- Slope (derivative) tells us rate of change
- Models use gradients to learn from errors (more on this next week!)

### **3. Scikit-Learn Is the Standard ML Library in Python**
- Consistent API across all models
- Extensive documentation and community support
- Built-in tools for preprocessing, modeling, and evaluation

### **4. Pipelines Are Essential for Professional ML**
- Prevent data leakage by ensuring proper train/test separation
- Make code cleaner and more maintainable
- Ensure reproducibility across different runs
- Chain preprocessing and modeling steps together seamlessly

### **5. Data Leakage Causes Overly Optimistic Performance Estimates**
- Training on test data (or using test data in preprocessing) inflates metrics
- Real-world performance will be worse than reported
- **Golden Rule**: Always split FIRST, then preprocess

### **6. Models Can Fail in Two Fundamentally Different Ways**
Even with perfect methodology and no data leakage:
- **Too simple** (high bias, underfitting): Misses important patterns
- **Too complex** (high variance, overfitting): Memorizes noise instead of learning patterns

### **7. The Bias-Variance Tradeoff Is the Central Challenge of ML**
- As model complexity increases: bias ↓ but variance ↑
- Finding the right complexity is both an art and a science
- This semester, we'll learn techniques to navigate this tradeoff (regularization, cross-validation, ensemble methods)

### 8. **Linear Regression Is Your Baseline**
- Simple but powerful first model
- Easy to interpret (feature coefficients show importance)
- Great starting point for any regression problem
- Provides a benchmark for more complex models

**Next week**: We'll see how models actually learn through **gradient descent**, which will help us understand how to control model complexity during training and why choosing the right learning rate matters.

---

## **3.2 Common Pitfalls to Avoid**

### **Pitfall 1: Data Leakage**
```python
# BAD
X_scaled = scaler.fit_transform(X)  # Sees test data!
X_train, X_test = train_test_split(X_scaled)
```

#### **Solution: Use Pipeline or fit only on training data**
```python
# GOOD
X_train, X_test = train_test_split(X)
pipeline = Pipeline([('scaler', StandardScaler()), ('model', LinearRegression())])
pipeline.fit(X_train, y_train)
```

### **Pitfall 2: Not understanding what metrics mean**
- Don't just report numbers—interpret them!
- Consider the context of your problem
- RMSE of 0.5 means prediction is off by $50,000 for this dataset

### **Pitfall 3: Overfitting to your test set**
- Test set is sacred—use it only once at the end
- Don't repeatedly adjust your model based on test performance
- Use cross-validation instead (coming in Week 5!)