# Chapter 3.1-3.2: Supervised Learning and Classification

Goal: Distinguish supervised from unsupervised learning, and identify problems as regression or classification.

### Topics:
- Supervised vs unsupervised learning paradigms
- Classification vs regression
- Binary vs multi-class classification
- Exploring class distributions

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split

## Quick Recap

- **Supervised learning** uses labeled data — you have both features (X) and a target (y)
- **Unsupervised learning** finds patterns in data without labels — no target variable
- **Classification** predicts a category (survived/died, spam/not spam)
- **Regression** predicts a continuous number (price, temperature)
- Binary classification = 2 classes; multi-class = 3+ classes

## Data

We'll use the **Titanic** dataset (supervised) and **Iris** (for an unsupervised demo).

In [None]:
# Load Titanic data
titanic = pd.read_csv('../../Textbook/data/titanic.csv')
titanic.head()

## Practice

### 1. Inspect the Titanic data

Look at the columns. What is the **target** column (what we want to predict)? What columns could be useful **features**?

In [None]:
# Inspect the columns and data types
...

**Your answer:** What is the target column? What features might be useful?

(Write your answer here)

### 2. Check the class distribution of `Survived`

Use `.value_counts()` on the target column. Is the dataset balanced (roughly equal numbers of each class) or imbalanced?

In [None]:
# Check class distribution
...

**Your interpretation:** Is the dataset balanced or imbalanced? What percentage survived?

(Write your answer here)

### 3. By hand — Classify these problems

For each scenario below, identify whether it is **regression**, **binary classification**, or **multi-class classification**:

1. Predicting whether an email is spam or not spam
2. Predicting tomorrow's high temperature in Houston
3. Predicting which genre a movie belongs to (action, comedy, drama, horror, etc.)
4. Predicting a student's final exam score
5. Predicting whether a credit card transaction is fraudulent

**Your answers:**

1. 
2. 
3. 
4. 
5. 

### 4. Use AI — Unsupervised learning demo with Iris

Load the Iris dataset from this url: `https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv`. Fit `KMeans(n_clusters=3)` on the features and create a scatterplot of the clusters (using sepal length and sepal width). Then make a second scatterplot colored by the **actual** species to see how well the clustering worked. How well do the features (sepal length and width) explain the different species?

In [None]:
# Step 1: Load Iris data (features only, no labels)


# Step 2: Fit KMeans with 3 clusters


# Step 3: Plot clusters using sepal length (col 0) and sepal width (col 1)


# Step 4: Plot again, but color by actual species labels for comparison


(Write your conclusion here)

### 5. Use AI — Prepare Titanic features for modeling

Select numeric columns (`Pclass`, `Age`, `SibSp`, `Parch`, `Fare`), fill missing `Age` values with the median, and split into train/test sets (80/20).

In [None]:
# Step 1: Select numeric features and target


# Step 2: Fill missing Age with median


# Step 3: Train/test split (80/20, random_state=42)


### 6. Interpretation — Supervised vs unsupervised

In your own words, what's the key difference between supervised and unsupervised learning? When would you use each?

**Your answer:**

(Write your answer here)

## Discussion

Think of a problem from your daily life. Is it regression or classification? How do you know?

(Discuss with a neighbor)