<div style="text-align: center;">

# **Spring 2026 &mdash; CIS 3813<br>Advanced Data Science<br>(Introduction to Machine Learning)**
### Week 1: The Machine Learning Workflow

</div>

---

## **Lab Instructions**

**Due Date**: Monday, 02 February @ 6:00 PM (with grace period until Wednesday, 04 February @ 11:59 PM)

In this lab, you will:  
1. Load, Explore, and Prepare Data
2. Build a Machine Learning Pipeline
3. Evaluate Performance
4. Visualize Results

**AI Usage**: 
- You may use AI tools for this lab
- **REQUIRED**: Include AI attribution using the format shown in the syllabus
- For B/A level credit, include detailed attribution in markdown cells

## **Grading**

| Component | Points |
|-----------|--------|
| Exercise 1: Load and Explore Data | 15 |
| Exercise 2: Train-Test-Split | 15 |
| Exercise 3: Build Your First Pipeline | 20 |
| Exercise 4: Make Predictions and Evaluate | 20 |
| Exercise 5: Visualize and Analyze | 25 |
| Faith Integration | 5 |
| Bonus Challenge (Optional) | 5 |
| In-Class Mastery Assessment (Week 1) | No Grade |
| **Total** | **100** |


---

## **AI Assistance Declaration**

**Tools used:** [e.g., ChatGPT-4 / GitHub Copilot / Claude / None]

**Sections with AI help:** [e.g., "Exercise 3: Pipeline Creation"]

**What I learned:** [Brief description of key concepts AI helped you understand]

**What I did independently:** [Sections you completed without AI assistance]

---
## **Exercise 1: Load and Explore Data (15 points)**

Load the California Housing dataset and answer the questions below. 

Note, this dataset is from the 1990 Census. The target variable (MedianHomeValue) is capped at $500,000 (represented as 5.0). You may see a "ceiling" effect in your plots. This is expected.

### **Code (8 points)**

In [None]:
# Your code here
from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np

# 1. Load the data
# Hint: Look at the Lecture Notes


# 2. Create a DataFrame with proper column names


# 3. Add the target variable to the DataFrame


### **Answer these questions: (7 Points)**
**Question 1a:** How many samples (rows) are in the dataset? (1 Point)  
**Your answer:**



**Question 1b:** How many features (excluding target) are in the dataset? (1 Point)  
**Your answer:**



**Question 1c:** What is the range of the target variable (min and max)? (1 Point)  
**Your answer:**



**Question 1d:** What is the mean and median of MedInc (Median Income)? (2 Points)  
**Your answer:**



**Reflection Question**: Write a brief explanation of what each feature represents. Look up the dataset documentation if needed. (2 Points)  
**Your answer:**



---

## **Exercise 2: Train-Test Split (15 Points)**

Properly split your data into training and testing sets.

### **Code (12 Points)**

In [None]:
# Your code here
from sklearn.model_selection import train_test_split

# 1. Separate X (features) and y (target)


# 2. Split into train/test with 80/20 ratio, random_state=42


# 3. Print the shapes of all four resulting arrays


# 4. Calculate and print what percentage of the data is in training vs testing

### **Answer this question: (3 Points)**

**Question**: Why do we need to split data into training and testing sets? Write your answer in a markdown cell below.  
**Your Answer:**

---

## **Exercise 3: Build Your First Pipeline (20 Points)**

Create a pipeline that scales the data and fits a linear regression model.

### **Code (18 Points)**

In [None]:
# Your code here
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# 1. Create a pipeline with StandardScaler and LinearRegression


# 2. Fit the pipeline to the training data


# 3. Print confirmation that training is complete
print("Model training complete!")



**Challenge Question**: What would happen if you forgot to include the StandardScaler in your pipeline? Try it and observe the results. Write your findings below. (2 Points)  
**Your Answer:**

---

## **Exercise 4: Make Predictions and Evaluate (20 Points)**

Use your trained pipeline to make predictions and evaluate performance.

### **Code (17 Points)**

In [None]:
# Your code here
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np

# 1. Make predictions on the test set


# 2. Calculate the following metrics:
#    - Mean Squared Error (MSE)
#    - Root Mean Squared Error (RMSE)
#    - Mean Absolute Error (MAE)
#    - R² Score


# 3. Print all metrics in a clear format


# 4. Interpret the R² score
print("\nInterpretation of R² score:")
print("An R² of [your_value] means that approximately [X]% of the variance")
print("in house prices is explained by our features.")


**Question**: What does each metric tell us about our model's performance? Write a brief explanation for each below. (3 Points)  
**Your Answer:**

---

## **Exercise 5: Visualize and Analyze (25 Points)**

Create visualizations to understand your model's performance.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

### **Actual vs. Predicted Plot (8 Points)**

* **Create a scatter plot of Actual vs. Predicted values.**
    * Include a diagonal line showing perfect predictions  
    * Label axes appropriately  
    * Add a title  

In [None]:
# Your code here


### **Residual Histogram (7 Points)**

* **Create a histogram of prediction errors (residuals)**
    * residuals = y_test - y_pred
    * What pattern do you observe?
    

In [None]:
# Your code here


### **Feature Importance (7 Points)**

* **Extract and visualize feature coefficients from your model**
    * Since we scaled our data, these coefficients can be interpreted as feature importances
    * Create a horizontal bar plot
    * Sort by absolute value
    * Which features are most important?

In [1]:
# Your code here



**Analysis Questions (3 Points)**: 
1. Looking at your actual vs predicted plot, does the model perform equally well across all price ranges?
2. What does the residual histogram tell you about your model's errors?
3. Which feature has the strongest positive effect on house prices? The strongest negative effect?

Write your analysis in markdown below.

**Your Answer:**  


---

## **Faith Integration: The Glory of Discovery (5 Points)**

As we close today's session, let's return to our opening verse:

> *"It is the glory of God to conceal a matter; to search out a matter is the glory of kings."*  
> — Proverbs 25:2

This week you've learned to:
- Uncover patterns hidden in housing data
- Build models that make predictions
- Evaluate how well those predictions match reality

Remember: Every data point represents a real person, a real home, a real life. Our models should be built with:
- **Justice**: Fair treatment across all groups
- **Mercy**: Compassion for those affected by our predictions
- **Humility**: Recognizing our models are simplifications, not ultimate truth

**Answer the following reflection questions using markdown in the cell in which the question is asked.**

1. **How does discovering patterns in data reflect the image of God in us? (1.5 Points)**  




2. **What ethical responsibilities do we have when making predictions about housing prices (which affect real people and communities)? (2 Points)**




3. **How can we use these tools to serve others and promote human flourishing? (1.5 Points)**




---

## **Bonus Challenge (Optional; 5 Points Max)**

Try to improve the model's performance by:

1. Adding polynomial features (hint: look up `PolynomialFeatures` from sklearn)
2. Trying a different model (e.g., `Ridge` or `Lasso` regression)
3. Feature engineering: create a new feature by combining existing ones



In [None]:
# Your bonus code here (if attempting)


Document your approach and results in markdown below. Did performance improve? Why or why not?

**Your Answer:**  


---

## **Submission Checklist**

Before submitting, make sure you have:

- [ ] Completed the AI Assistance Declaration at the top
- [ ] Exercise 1: Data loaded and explored with all questions answered
- [ ] Exercise 2: Train-test split implemented with written explanation
- [ ] Exercise 3: Pipeline created and fitted with challenge question answered
- [ ] Exercise 4: Predictions made and all metrics calculated with interpretations
- [ ] Exercise 5: All three visualizations created (Actual vs Predicted, Residuals, Feature Importance) with analysis questions answered
- [ ] Faith Integration: All three reflection questions answered thoughtfully
- [ ] Bonus Challenge: Attempted (if applicable) with documentation
- [ ] All code cells run without errors
- [ ] All visualizations display correctly
- [ ] Restarted kernel and run all cells to verify everything works

### **Submission Instructions**

1. Save this notebook
2. **Restart kernel and run all cells** (Kernel → Restart & Run All)
3. Verify all outputs appear correctly (especially visualizations)
4. Check that all written responses are complete
5. Submit the `.ipynb` file to Canvas before Monday, 02 February @ 6:00 PM
   - Grace period until Wednesday, 04 February @ 11:59 PM

**Remember:** This notebook submission is worth 100% of your Week 1 Lab grade. However normally, the notebook submission is worth 90% of your Week 1 Lab grade, with the remaining 10% coming from next week's in-class mastery assessment. (The first week mastery quiz will be for practice.)

---

## **Next Week Preview**

**Mastery Assessment (Week 2)**: Be prepared to answer 1-2 questions about the machine learning workflow without AI assistance. Focus on:
- What are the 5 stages of the machine learning workflow?
- Why do we split data into training and testing sets?
- What is data leakage and why is it problematic?
- Define bias and variance in your own words
- What is the bias-variance tradeoff?
- What does R² tell us about model performance?
- How do you interpret RMSE in context?
- What is the purpose of StandardScaler?