# Class 2 Practice: Key Machine Learning Concepts

Welcome to Week 5, Class 2! Today, we’ll dive into the foundational concepts of machine learning: **features**, **labels**, **training/testing data**, and **overfitting/underfitting**. This notebook will guide you through these ideas with examples and exercises to prepare you for building ML models.

## Objectives
- Understand **features** and **labels** in a dataset.
- Learn why we split data into **training** and **testing** sets.
- Explore **overfitting** and **underfitting** and their impact on models.
- Practice identifying these concepts in a sample dataset.

## 1. Features and Labels

In machine learning, data is organized into **features** and **labels**:
- **Features**: The input variables used to make predictions. Think of them as the "questions" the model uses.
  - Example: For predicting house prices, features might be size (sqft), number of bedrooms, and location.
- **Labels**: The output variable we want to predict. This is the "answer" the model learns.
  - Example: For house prices, the label is the actual price ($).

**Analogy**: Imagine baking a cake. Features are the ingredients (flour, sugar, eggs), and the label is the cake’s taste (good or bad).

**Question**: Suppose you’re predicting whether a student passes an exam. What might be one feature and the label?

**Your Answer**:  
- Feature: [Type here, e.g., 'Hours studied']  
- Label: [Type here, e.g., 'Pass or fail']

## 2. Training and Testing Data

To build a reliable model, we split our dataset into:
- **Training Data**: Used to teach the model (e.g., 80% of the data).
- **Testing Data**: Used to evaluate the model’s performance on unseen data (e.g., 20% of the data).

**Why split?** Without a test set, the model might just memorize the training data and fail on new data—like studying only the exact questions on a practice test but failing the real exam.

**Example**: If you have 100 house price records, you might use 80 to train and 20 to test.

**Question**: Why is it bad to test a model on the same data it was trained on?

**Your Answer**: [Type here, e.g., 'It might memorize the data and not generalize to new examples.']

## 3. Overfitting and Underfitting

Models can fail in two ways:
- **Overfitting**: The model learns the training data *too well*, including noise, and performs poorly on new data.
  - Example: A model that predicts house prices perfectly for your 80 training houses but fails on new houses.
- **Underfitting**: The model is too simple and doesn’t learn enough from the training data, performing poorly on both training and testing.
  - Example: Predicting house prices using only the number of windows (not enough info).

**Analogy**: Overfitting is like memorizing answers without understanding; underfitting is like not studying enough.

**Question**: If a model gets 100% accuracy on training data but only 50% on testing data, is it overfitting or underfitting? Why?

**Your Answer**: [Type here, e.g., 'Overfitting, because it performs too well on training but poorly on testing.']

## 4. Exercise: Analyze a Toy Dataset

Let’s practice with a small dataset about **used cars** and their prices:

| Car ID | Mileage (miles) | Age (years) | Brand | Price ($) |
|--------|-----------------|-------------|-------|-----------|
| 1      | 50000           | 3           | Toyota| 15000     |
| 2      | 80000           | 5           | Honda | 12000     |
| 3      | 20000           | 1           | Ford  | 20000     |
| 4      | 100000          | 7           | Toyota| 9000      |
| 5      | 30000           | 2           | Honda | 18000     |

**Tasks**:
1. **Identify Features and Label**:
   - Features: [List them, e.g., 'Mileage, Age, Brand']
   - Label: [Type here, e.g., 'Price']
2. **Plan a Train/Test Split**:
   - If you have only 5 records, how many would you use for training and testing? Why?
   - Your Answer: [Type here, e.g., '4 for training, 1 for testing, because we need enough data to train but at least one to test.']
3. **Think About Overfitting**:
   - What might cause overfitting in this dataset? (Hint: Think about features or data size.)
   - Your Answer: [Type here, e.g., 'Using too many specific features like car color or having too few data points.']

In [None]:
# Optional: Inspect the dataset using Python (run this cell if pandas is installed)
import pandas as pd

# Create the toy dataset
data = {
    'Mileage': [50000, 80000, 20000, 100000, 30000],
    'Age': [3, 5, 1, 7, 2],
    'Brand': ['Toyota', 'Honda', 'Ford', 'Toyota', 'Honda'],
    'Price': [15000, 12000, 20000, 9000, 18000]
}
df = pd.DataFrame(data)

# Display the dataset
df

## 5. Reflection

**Question**: How could overfitting affect a real-world ML project, like predicting car prices for a dealership? Write 1-2 sentences.

**Your Answer**: [Type here, e.g., 'Overfitting might make the model predict perfectly for known cars but fail for new ones, costing the dealership money.']

## Next Steps

Great work! You’re building a solid ML foundation. For homework:
- Read a short article on overfitting (e.g., search 'overfitting Towards Data Science').
- Write 1-2 sentences on how to prevent overfitting and bring to class.

Save this notebook and share if requested. In Class 3, we’ll start using scikit-learn to load and split datasets!