# Class 3 Practice: Getting Started with scikit-learn

Welcome to Week 5, Class 3! Today, we’ll get hands-on with **scikit-learn**, a powerful Python library for machine learning. You’ll learn how to load datasets and split them into **training** and **testing** sets, a crucial step for building ML models.

## Objectives
- Understand what scikit-learn is and how it’s used in ML.
- Load a dataset using scikit-learn.
- Split a dataset into training and testing sets using `train_test_split`.
- Verify the split to ensure it’s correct.

## 1. Introduction to scikit-learn

**scikit-learn** is a beginner-friendly Python library that provides tools for machine learning, like loading datasets, training models, and evaluating results. Think of it as a toolbox for ML experiments!

Today, we’ll use scikit-learn to:
- Load a dataset (like the Diabetes dataset, which predicts disease progression).
- Split it into training and testing sets to prepare for model training.

**Question**: Why do we need a library like scikit-learn instead of writing all the ML code ourselves?

**Your Answer**: [Type here, e.g., 'It saves time and provides tested tools for common ML tasks.']

## 2. Loading a Dataset

scikit-learn comes with built-in datasets, like the **Diabetes dataset**, which includes features (e.g., age, BMI) and a label (disease progression score). Let’s load it and explore its structure.

Run the cell below to load the dataset and see its shape.

In [None]:
# Import scikit-learn and load the Diabetes dataset
from sklearn.datasets import load_diabetes

# Load the dataset
diabetes = load_diabetes()

# Get features (X) and labels (y)
X = diabetes.data  # Features (e.g., age, BMI)
y = diabetes.target  # Label (disease progression)

# Print the shape of the dataset
print(f"Features shape: {X.shape}")  # Rows, Columns
print(f"Labels shape: {y.shape}")   # Rows

# Optional: Print feature names
print(f"Feature names: {diabetes.feature_names}")

**Explanation**:
- `X` contains the features (inputs), like age and BMI, in a matrix (rows = samples, columns = features).
- `y` contains the labels (outputs), like disease progression, in a vector.
- The `.shape` shows the dataset size (e.g., 442 samples, 10 features).

**Question**: Based on the output above, how many samples (rows) are in the Diabetes dataset? How many features?

**Your Answer**: [Type here, e.g., '442 samples, 10 features']

## 3. Splitting the Dataset

To train and test a model, we split the data using `train_test_split`:
- **Training set**: Used to teach the model (e.g., 80% of data).
- **Testing set**: Used to evaluate the model (e.g., 20% of data).

The function randomly splits the data to ensure fairness.

Run the cell below to split the Diabetes dataset.

In [None]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Split the data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the splits
print(f"Training features shape: {X_train.shape}")
print(f"Testing features shape: {X_test.shape}")
print(f"Training labels shape: {y_train.shape}")
print(f"Testing labels shape: {y_test.shape}")

**Explanation**:
- `test_size=0.2` means 20% of the data is for testing.
- `random_state=42` ensures the split is reproducible (same split every time).
- The output shows how many samples are in each set.

**Question**: If the Diabetes dataset has 442 samples, how many should be in the training set (80%)? The testing set (20%)?

**Your Answer**: [Type here, e.g., 'Training: 353, Testing: 89']

## 4. Exercise: Practice Splitting a Dataset

Now it’s your turn! Modify the code below to split the Diabetes dataset with **70% training** and **30% testing**. Then, verify the split by checking the shapes.

**Instructions**:
1. Change `test_size` to 0.3.
2. Run the cell.
3. Check the shapes to confirm the split.

In [None]:
# Your code here
from sklearn.model_selection import train_test_split

# Split the data (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # Change test_size!

# Print the shapes
print(f"Training features shape: {X_train.shape}")
print(f"Testing features shape: {X_test.shape}")
print(f"Training labels shape: {y_train.shape}")
print(f"Testing labels shape: {y_test.shape}")

# Calculate the percentage of test samples
test_percentage = X_test.shape[0] / (X_train.shape[0] + X_test.shape[0]) * 100
print(f"Percentage of data in test set: {test_percentage:.1f}%")

**Task**:
- Did the split work correctly? (Check if test set is ~30%.)
- Write the number of samples in the training and testing sets below.

**Your Answer**:
- Training samples: [Type here, e.g., '309']
- Testing samples: [Type here, e.g., '133']

## 5. Reflection

**Question**: Why is splitting data into training and testing sets important for machine learning? Write 1-2 sentences.

**Your Answer**: [Type here, e.g., 'It ensures the model is evaluated on unseen data, preventing overfitting and checking its ability to generalize.']

## Next Steps

Awesome job! You’ve learned how to use scikit-learn to prepare data for ML. For homework:
- Practice splitting the **Iris dataset** (use `load_iris` from `sklearn.datasets`).
- Share your code snippet (e.g., print the shapes of the split).

Save this notebook and submit if requested. In Class 4, we’ll use these skills to build a linear regression model!