# Class 1: Introduction to Supervised Learning and Logistic Regression

**Week 6: Supervised Learning Algorithms**

## Overview
Welcome to Class 1! Today, we'll explore the basics of **supervised learning**, understand the difference between **classification** and **regression**, and dive into **logistic regression** for classification tasks. By the end, you'll train your first logistic regression model using scikit-learn on a simple dataset.

## Objectives
- Define supervised learning and its role in machine learning.
- Differentiate between classification and regression problems.
- Understand how logistic regression works for binary classification.
- Train and make predictions with a logistic regression model.

## Agenda
1. What is supervised learning?
2. Classification vs. regression
3. Introduction to logistic regression
4. Hands-on: Train a logistic regression model

Let's get started!

## 1. What is Supervised Learning?

**Supervised learning** is a type of machine learning where we train a model on **labeled data**. This means we have input features (X) and corresponding outputs (y), and the model learns to predict y from X.

**Examples**:
- Predicting house prices based on size and location (y = price, X = size, location).
- Classifying emails as spam or not spam (y = spam/not spam, X = email content).

The goal is to learn a function that maps inputs to outputs, then use it to predict on new, unseen data.

## 2. Classification vs. Regression

Supervised learning problems fall into two categories:

- **Regression**: Predicting a **continuous** output.
  - Example: Predicting someone's house price (e.g., $300,000).
  - Output: A number.
- **Classification**: Predicting a **categorical** output.
  - Example: Predicting if a flower is a specific species (e.g., setosa or versicolor).
  - Output: A class label.

**Question**: Can you think of one regression and one classification problem from real life? (Pause and discuss!)

## 3. Introduction to Logistic Regression

**Logistic regression** is a classification algorithm (despite the name!) used for **binary classification** (two classes, e.g., yes/no, 0/1).

**How it works** (simplified):
- Takes input features (e.g., petal length, width).
- Computes a weighted sum of features.
- Passes it through a **sigmoid function** to output a probability (0 to 1).
- Predicts class 1 if probability > 0.5, else class 0.

**Example**: Predicting if a flower is *Iris setosa* (class 1) or not (class 0) based on its features.

Today, we'll use logistic regression on the Iris dataset to classify two species.

## 4. Hands-On: Train a Logistic Regression Model

Let's train a logistic regression model on a subset of the Iris dataset to classify two species (*setosa* vs. *versicolor*). We'll use scikit-learn, a powerful Python library for machine learning.

**Steps**:
1. Load and prepare the Iris dataset.
2. Create a binary classification problem (two classes).
3. Split data into training and testing sets.
4. Train a logistic regression model.
5. Make predictions and check results.

Follow along with the code below!

In [None]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Create binary classification: setosa (0) vs. versicolor (1)
# We'll exclude class 2 (virginica) for simplicity
mask = y < 2  # Select only setosa and versicolor
X_binary = X[mask]
y_binary = y[mask]

# Use only two features for visualization (petal length, petal width)
X_binary = X_binary[:, 2:4]  # Columns 2 and 3

# Check the data
print("Feature names:", iris.feature_names[2:4])
print("Target names:", iris.target_names[:2])
print("Shape of X:", X_binary.shape)
print("Shape of y:", y_binary.shape)

**What did we do?**
- Loaded the Iris dataset (150 samples, 4 features).
- Selected two classes (setosa, versicolor) for binary classification.
- Kept only petal length and petal width for simplicity.
- Checked the data dimensions (100 samples, 2 features).

Now, let's visualize the data to see how the classes look.

In [None]:
# Visualize the data
plt.scatter(X_binary[y_binary == 0, 0], X_binary[y_binary == 0, 1], label="Setosa", color="blue")
plt.scatter(X_binary[y_binary == 1, 0], X_binary[y_binary == 1, 1], label="Versicolor", color="orange")
plt.xlabel("Petal Length (cm)")
plt.ylabel("Petal Width (cm)")
plt.title("Iris: Setosa vs. Versicolor")
plt.legend()
plt.show()

**Question**: Do the classes look separable? (Hint: Are the blue and orange dots mostly separate?)

Next, we'll split the data into **training** (to learn) and **testing** (to evaluate) sets.

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_binary, y_binary, test_size=0.2, random_state=42)

print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)

**What’s happening?**
- We split the data: 80% for training (80 samples), 20% for testing (20 samples).
- `random_state=42` ensures reproducibility.

Now, let's train the logistic regression model!

In [None]:
# Initialize and train the model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Show predictions
print("Test set predictions:", y_pred)
print("Actual labels:", y_test)

**Your turn!**
- Compare `y_pred` and `y_test`. How many predictions look correct?
- Try predicting on a new sample. Run the cell below and change the values for `petal_length` and `petal_width`.

In [None]:
# Predict on a new sample
new_sample = np.array([[4.0, 1.2]])  # Example: petal length=4.0, petal width=1.2
prediction = model.predict(new_sample)
print("Predicted class:", iris.target_names[prediction][0])

**Discussion**:
- What class did the model predict for the new sample?
- Does the prediction make sense based on the scatter plot?

Finally, let's visualize the decision boundary to see how the model separates the classes.

In [None]:
# Visualize decision boundary
x_min, x_max = X_binary[:, 0].min() - 0.5, X_binary[:, 0].max() + 0.5
y_min, y_max = X_binary[:, 1].min() - 0.5, X_binary[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3, cmap="coolwarm")
plt.scatter(X_binary[y_binary == 0, 0], X_binary[y_binary == 0, 1], label="Setosa", color="blue")
plt.scatter(X_binary[y_binary == 1, 0], X_binary[y_binary == 1, 1], label="Versicolor", color="orange")
plt.xlabel("Petal Length (cm)")
plt.ylabel("Petal Width (cm)")
plt.title("Logistic Regression Decision Boundary")
plt.legend()
plt.show()

**What’s this?**
- The shaded areas show the model's decision regions (blue for setosa, red for versicolor).
- The line is the **decision boundary** where the model switches predictions.

**Question**: Does the boundary make sense? Are most points correctly classified?

## Wrap-Up

Today, you:
- Learned what **supervised learning** is.
- Understood the difference between **classification** and **regression**.
- Explored **logistic regression** and how it predicts classes.
- Trained a model on the Iris dataset and visualized its decisions.

**Homework**:
- Explore the [scikit-learn documentation for LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) (5-10 min).
- Try changing the `new_sample` values above and predict again. What happens?

**Next Class**:
- We'll explore two more algorithms: k-nearest neighbors (KNN) and decision trees.
- Get ready to compare them with logistic regression!

Any questions? Feel free to ask!