# Lesson 4: Machine Learning Basics

Learn how to build real machine learning models using Python's scikit-learn library.

## What You'll Learn
- Introduction to scikit-learn
- Supervised vs unsupervised learning
- Building your first ML model
- Model evaluation and accuracy

## Types of Machine Learning

- **Supervised Learning**: Learning from labeled examples (like a teacher showing correct answers)
  - Classification: Categorizing things (spam vs not spam)
  - Regression: Predicting numbers (house prices)

- **Unsupervised Learning**: Finding patterns without labels
  - Clustering: Grouping similar items
  - Dimensionality reduction: Simplifying complex data

## Your First ML Model: Classification

Let's build a model to classify flowers based on their measurements:

In [None]:
# You'll need to install scikit-learn:
# pip install scikit-learn

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the famous iris dataset
iris = load_iris()
X = iris.data  # Features: measurements
y = iris.target  # Labels: flower species

print("Dataset info:")
print(f"Number of samples: {len(X)}")
print(f"Features: {iris.feature_names}")
print(f"Species: {iris.target_names}")
print(f"\nFirst sample: {X[0]}")
print(f"Its species: {iris.target_names[y[0]]}")

## Training the Model

Split data into training and testing sets:

In [None]:
# Split data: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

# Create and train the model
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

print("\nModel trained successfully!")

## Making Predictions

Now use the trained model to predict flower species:

In [None]:
# Make predictions on test data
predictions = model.predict(X_test)

# Show first 10 predictions vs actual
print("Predictions vs Actual:")
for i in range(10):
    pred_name = iris.target_names[predictions[i]]
    actual_name = iris.target_names[y_test[i]]
    match = "✓" if predictions[i] == y_test[i] else "✗"
    print(f"{match} Predicted: {pred_name:12} | Actual: {actual_name}")

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print(f"\nModel Accuracy: {accuracy * 100:.2f}%")

## Regression: Predicting Numbers

Let's predict house prices based on features:

In [None]:
from sklearn.linear_model import LinearRegression
import numpy as np

# Simple dataset: house size -> price
# Sizes in square feet
house_sizes = np.array([1000, 1500, 2000, 2500, 3000, 3500]).reshape(-1, 1)
# Prices in thousands of dollars
prices = np.array([200, 280, 350, 420, 500, 580])

# Create and train the model
regression_model = LinearRegression()
regression_model.fit(house_sizes, prices)

# Predict price for new houses
new_houses = np.array([1800, 2700, 4000]).reshape(-1, 1)
predicted_prices = regression_model.predict(new_houses)

print("Price Predictions:")
for size, price in zip(new_houses.flatten(), predicted_prices):
    print(f"House size: {size} sq ft -> Predicted price: ${price:.2f}k")

## Clustering: Finding Groups

Unsupervised learning to find natural groupings:

In [None]:
from sklearn.cluster import KMeans

# Customer data: [age, annual_spending_k]
customers = np.array([
    [25, 40], [27, 45], [30, 50], [35, 48],  # Young spenders
    [45, 80], [48, 85], [50, 90], [52, 88],  # Middle-age high spenders
    [60, 30], [62, 28], [65, 32], [67, 35]   # Seniors low spenders
])

# Find 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(customers)

print("Customer Segments:")
for i, (customer, cluster) in enumerate(zip(customers, clusters)):
    print(f"Customer {i+1}: Age {customer[0]}, Spending ${customer[1]}k -> Group {cluster}")

## Model Evaluation Metrics

Different ways to measure model performance:

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

# Using our iris model from earlier
print("Detailed Classification Report:")
print(classification_report(y_test, predictions, target_names=iris.target_names))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, predictions))
print("\n(Rows = Actual, Columns = Predicted)")

## Exercise

Build a spam classifier:
1. Create a simple dataset of emails (represented as word counts)
2. Label them as spam (1) or not spam (0)
3. Train a DecisionTreeClassifier
4. Test on new "emails" and calculate accuracy

In [None]:
# Your code here
# Hint: Features could be [word_count, has_money_words, has_urgent_words]
# Example: [50, 0, 0] = 50 words, no money words, no urgent words -> Not spam
#          [30, 1, 1] = 30 words, has money words, has urgent words -> Spam

