# Assignment 1: Your First ML Exploration
## Predicting Student Performance

**Due:** 1/23/2026

---

### Learning Objectives

By completing this assignment, you will:
- Gain hands-on experience with Python data analysis libraries
- Explore and visualize a real dataset
- Build and evaluate linear regression models
- Practice interpreting model performance
- Begin thinking about model design decisions

---

### The Problem

You work for a university that wants to predict student final exam scores based on study habits and academic history. You have data on students' **study time**, **previous exam scores**, and **attendance**, and you need to build a predictive model.

Your goal: Build a model that can predict a student's final exam score so advisors can identify at-risk students early in the semester.

---

### Dataset Description

The dataset `student_performance.csv` contains 50 student records with the following columns:

| Column | Description |
|--------|-------------|
| `student_id` | Unique student identifier |
| `study_hours` | Average hours studied per week |
| `previous_score` | Score on the midterm exam (0-100) |
| `attendance` | Percentage of classes attended |
| `final_score` | Final exam score (0-100) — **this is what we're predicting!** |

---

## Part 0: Setup

Run the cell below to import the libraries we'll need. If you get an error, you may need to install the library (ask me or check the course discussion board for general tech questions).

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Make plots look nice
plt.style.use('seaborn-v0_8-whitegrid')

print("All libraries imported successfully!")

---

## Part 1: Data Exploration (8 points)

Before building any model, we need to understand our data. This is a **critical** step that many beginners skip!

### 1.1 Load the Data

Use `pd.read_csv()` to load the dataset into a DataFrame called `data`.

In [None]:
# TODO: Load the CSV file into a DataFrame called 'data'
# Hint: data = pd.read_csv('filename.csv')

data = # YOUR CODE HERE

### 1.2 First Look at the Data

Use `.head()` to see the first few rows and `.info()` to see the data types and check for missing values.

In [None]:
# TODO: Display the first 5 rows of the data



In [None]:
# TODO: Display info about the DataFrame (data types, missing values)



### 1.3 Summary Statistics

Use `.describe()` to get summary statistics for all numeric columns.

In [None]:
# TODO: Display summary statistics



### 1.4 Visualize Relationships

Create scatter plots to see how each feature relates to `final_score`. I've started the first one for you.

In [None]:
# Create a figure with 3 subplots side by side
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Plot 1: study_hours vs final_score
axes[0].scatter(data['study_hours'], data['final_score'], alpha=0.6)
axes[0].set_xlabel('Study Hours')
axes[0].set_ylabel('Final Score')
axes[0].set_title('Study Hours vs Final Score')

# TODO: Plot 2: previous_score vs final_score
# axes[1].scatter(...)


# TODO: Plot 3: attendance vs final_score
# axes[2].scatter(...)


plt.tight_layout()
plt.show()

### ✏️ Question 1.1

Based on your exploration, which feature do you think will be the **best** predictor of final score? Why?

**Your Answer:** *(Double-click to edit this cell)*



### ✏️ Question 1.2

Do you notice any missing values or obvious outliers in the data? How might these affect a model?

**Your Answer:**



---

## Part 2: Building Your First Model (8 points)

Let's start with a simple **linear regression** model using just one feature: `study_hours`.

### The Linear Regression Equation

Linear regression finds the line of best fit:

$$\hat{y} = \beta_0 + \beta_1 x$$

Where:
- $\hat{y}$ is the predicted final score
- $x$ is the input feature (study hours)
- $\beta_0$ is the intercept (the predicted score when study_hours = 0)
- $\beta_1$ is the slope (how much the score changes for each additional hour of study)

### 2.1 Prepare the Data

We need to separate our **features** (inputs) from our **target** (what we're predicting).

In [None]:
# X is our feature (input) - note the double brackets to keep it as a DataFrame
X = data[['study_hours']]

# y is our target (what we're predicting)
y = data['final_score']

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

### 2.2 Create and Train the Model

Now we'll create a `LinearRegression` object and "fit" it to our data.

In [None]:
# TODO: Create a LinearRegression model and fit it to the data
# Hint: model = LinearRegression()
#       model.fit(X, y)

model = # YOUR CODE HERE

# Fit the model


### 2.3 Examine the Model

Let's look at the coefficients our model learned.

In [None]:
# Get the model parameters
intercept = model.intercept_
slope = model.coef_[0]

print(f"Intercept (β₀): {intercept:.2f}")
print(f"Slope (β₁): {slope:.2f}")
print(f"\nOur model equation: final_score = {intercept:.2f} + {slope:.2f} × study_hours")

### 2.4 Visualize the Model

Let's plot our regression line on top of the data.

In [None]:
# TODO: Create a scatter plot of the data and add the regression line
# Hint: Use model.predict(X) to get the predicted values for the line

plt.figure(figsize=(8, 6))

# Scatter plot of actual data
plt.scatter(data['study_hours'], data['final_score'], alpha=0.6, label='Actual Data')

# TODO: Add the regression line
# plt.plot(data['study_hours'], model.predict(X), color='red', label='Regression Line')


plt.xlabel('Study Hours')
plt.ylabel('Final Score')
plt.title('Linear Regression: Study Hours vs Final Score')
plt.legend()
plt.show()

### ✏️ Question 2.1

Interpret the slope coefficient. In plain English, what does this number tell us about the relationship between study hours and final score?

**Your Answer:**



### ✏️ Question 2.2

The intercept represents the predicted score when study_hours = 0. Does this value make practical sense? Why or why not?

**Your Answer:**



---

## Part 3: Evaluating Your Model (8 points)

How do we know if our model is any good? We need **metrics** to measure performance.

### Common Regression Metrics

- **MAE (Mean Absolute Error):** Average absolute difference between predicted and actual values
- **MSE (Mean Squared Error):** Average squared difference (penalizes large errors more)
- **RMSE (Root Mean Squared Error):** Square root of MSE (same units as the target)

### 3.1 Generate Predictions

In [None]:
# Generate predictions for all data points
y_pred = model.predict(X)

# Look at the first few predictions vs actual values
comparison = pd.DataFrame({
    'Actual': y.head(10).values,
    'Predicted': y_pred[:10].round(2),
    'Error': (y.head(10).values - y_pred[:10]).round(2)
})
print(comparison)

### 3.2 Calculate Error Metrics

In [None]:
# TODO: Calculate MAE, MSE, and RMSE
# Hint: Use mean_absolute_error(y, y_pred) and mean_squared_error(y, y_pred)
#       RMSE is just np.sqrt(MSE)

mae = # YOUR CODE HERE
mse = # YOUR CODE HERE
rmse = # YOUR CODE HERE

print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")

### 3.3 Visualize the Errors

Let's create a plot showing the prediction errors.

In [None]:
# TODO: Create a histogram of the errors (y - y_pred)
# Hint: plt.hist(y - y_pred, bins=15)

plt.figure(figsize=(8, 5))

# YOUR CODE HERE

plt.xlabel('Prediction Error (Actual - Predicted)')
plt.ylabel('Frequency')
plt.title('Distribution of Prediction Errors')
plt.axvline(x=0, color='red', linestyle='--', label='Zero Error')
plt.legend()
plt.show()

### ✏️ Question 3.1

What does an MAE of your calculated value mean in practical terms? Is this acceptable for predicting exam scores?

**Your Answer:**



### ✏️ Question 3.2

Look at the error histogram. Is it roughly centered around zero? What would it mean if the errors were mostly positive or mostly negative?

**Your Answer:**



---

## Part 4: Train/Test Split (8 points)

**Important concept:** We've been evaluating our model on the same data we used to train it. This can give us an overly optimistic view of how well our model will perform on **new, unseen data**.

### The Problem with Training Error

A model can "memorize" the training data without learning the underlying pattern. To get a realistic estimate of performance, we split our data:

- **Training set:** Used to train (fit) the model
- **Test set:** Held out and only used to evaluate the final model

### 4.1 Create the Split

In [None]:
# TODO: Split the data into training and test sets
# Use test_size=0.2 (20% for testing) and random_state=42 (for reproducibility)
# Hint: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train, X_test, y_train, y_test = # YOUR CODE HERE

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

### 4.2 Train a New Model on Training Data Only

In [None]:
# TODO: Create a new LinearRegression model and train it on ONLY the training data

model_split = # YOUR CODE HERE

# Fit on training data only


### 4.3 Evaluate on Both Training and Test Sets

In [None]:
# TODO: Calculate MAE for both training and test sets
# Hint: First predict, then calculate MAE

# Predictions on training set
y_train_pred = # YOUR CODE HERE

# Predictions on test set
y_test_pred = # YOUR CODE HERE

# Calculate MAE for each
train_mae = # YOUR CODE HERE
test_mae = # YOUR CODE HERE

print(f"Training MAE: {train_mae:.2f}")
print(f"Test MAE: {test_mae:.2f}")

### ✏️ Question 4.1

Compare the training MAE to the test MAE. What do you observe? Why might these be different?

**Your Answer:**



### ✏️ Question 4.2

If the test MAE was much larger than the training MAE, what might that indicate about your model?

**Your Answer:**



---

## Part 5: Adding More Features (10 points)

We've only been using `study_hours` so far. Let's see if we can build a better model by including more features.

### 5.1 Multiple Linear Regression

The equation becomes:

$$\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3$$

Where $x_1$ = study_hours, $x_2$ = previous_score, $x_3$ = attendance

In [None]:
# TODO: Create X with multiple features
# Hint: X_multi = data[['study_hours', 'previous_score', 'attendance']]

X_multi = # YOUR CODE HERE

print(f"Features shape: {X_multi.shape}")

### 5.2 Train/Test Split with Multiple Features

In [None]:
# TODO: Split the multi-feature data

X_train_multi, X_test_multi, y_train_multi, y_test_multi = # YOUR CODE HERE

### 5.3 Train and Evaluate the Multi-Feature Model

In [None]:
# TODO: Create, train, and evaluate a model with multiple features

model_multi = # YOUR CODE HERE

# Fit the model


# Make predictions
y_test_pred_multi = # YOUR CODE HERE

# Calculate test MAE
test_mae_multi = # YOUR CODE HERE

print(f"Multi-feature Test MAE: {test_mae_multi:.2f}")
print(f"Single-feature Test MAE: {test_mae:.2f}")

### 5.4 Examine the Coefficients

In [None]:
# Look at what the model learned
print(f"Intercept: {model_multi.intercept_:.4f}")
print("\nCoefficients:")
for feature, coef in zip(['study_hours', 'previous_score', 'attendance'], model_multi.coef_):
    print(f"  {feature}: {coef:.4f}")

### ✏️ Question 5.1

Did adding more features improve the model? Compare the test MAE of the multi-feature model to the single-feature model.

**Your Answer:**



### ✏️ Question 5.2

Look at the coefficients. Which feature has the largest impact on the prediction? Does this match your intuition from the data exploration?

**Your Answer:**



---

## Part 6: Making Predictions (4 points)

Now let's use our model to make predictions for new students.

### 6.1 Predict for New Students

Suppose we have three new students:
- **Student A:** Studies 6 hours/week, previous score 75, attendance 85%
- **Student B:** Studies 2 hours/week, previous score 55, attendance 60%
- **Student C:** Studies 8 hours/week, previous score 90, attendance 95%

In [None]:
# TODO: Create a DataFrame with the new students' data and make predictions
# Hint: new_students = pd.DataFrame({'study_hours': [6, 2, 8], ...})

new_students = pd.DataFrame({
    'study_hours': # YOUR CODE HERE,
    'previous_score': # YOUR CODE HERE,
    'attendance': # YOUR CODE HERE
})

# Make predictions
predictions = model_multi.predict(new_students)

# Display results
new_students['predicted_score'] = predictions.round(1)
new_students['student'] = ['A', 'B', 'C']
print(new_students[['student', 'study_hours', 'previous_score', 'attendance', 'predicted_score']])

### ✏️ Question 6.1

Which student(s) might need academic intervention based on your predictions? What threshold would you use to identify "at-risk" students?

**Your Answer:**



---

## Part 7: Reflection (4 points)

Take a step back and think critically about what you've built.

### ✏️ Question 7.1

What are some limitations of this model, i.e., what sort of behavior could be missed? What factors that might affect student performance are NOT included in our dataset?

**Your Answer:**



### ✏️ Question 7.2

If you were deploying this model in a real university setting, what ethical considerations should you think about?

**Your Answer:**



### ✏️ Question 7.3

On a scale of 1-10, how confident are you in this model's predictions? Why?

**Your Answer:**



---

## Submission Instructions

Submit TWO files to Canvas:

1. **Your Jupyter Notebook** (`.ipynb` file)
   - Must include all code cells run (with outputs visible)
   - Must include all your answers to questions in markdown cells
   - Must be clearly organized with section headers

2. **A PDF export** of your notebook (File → Download → PDF)
   - Backup in case notebook doesn't load properly

**File naming:** `Assignment1_LastName_FirstName.ipynb`

---

## Optional Challenge

Try one or more of these extensions:

1. **Create a new feature:** Build a feature that combines existing features (e.g., `study_hours × attendance`). Does it improve your model?

2. **Find the worst prediction:** Identify the student where your model was most wrong. Can you explain why?

3. **Different train/test splits:** Try different split ratios (50/50, 90/10). How does it affect performance?