# **Lab 2.2**: Data Preparation Without Data Leakage

In this lesson, we will walk through the correct order for data preparation in supervised machine learning and practice how to learn from the training set but apply the same cleaning steps to the test set.

**Key order of operations:**
- Train-test split ***first!***
- Learn cleaning and transformation rules from the training data
- Apply those rules to both training and test data
- Use the test set only for final evaluation

## Step 1: Import Libraries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## Step 2: Create Example Data

We create a small dataset with:
- One numeric feature (`income`)
- Missing values in `income`
- A target variable (`y`)

In [None]:
data = pd.DataFrame({
    "income": [50000, 60000, None, 80000, None, 120000, 70000, None],
    "y": [0, 1, 0, 1, 0, 1, 0, 1]
})

data

## Step 3: Split First

We will perform the train/test split before any cleaning or transformation. There are instances in ML where this is not the case or not necessary, but for the purposes of an introductory class, we will follow this process.

From this point on:
- The training set is used to learn rules
- The test set is treated as unseen data

Check out the documentation to fill in the blanks for the `train_test_split` function.

```python
# define the feature(s) and target
X = XXX
y = XXX

# split the data
X_train, X_test, y_train, y_test = train_test_split(
    XXX, XXX, XXX=XXX, random_state=26
)

# view the split X values
print(f"X_train =\n{X_train}\n X_test =\n{X_test}")
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Perform the test-train split with a 75-25 ratio.

## Step 4: Learn Cleaning Rules From Training Data

We decide **how** to clean the data using the training set only.

Here, we learn the mean income to use for missing-value imputation.

```python
income_mean = X_train["income"].XXX
print(income_mean)
```
#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Take the mean of the training set.

## Step 5: Apply Cleaning to Both Training and Test Data

The same rule learned from the training data is applied to:
- Training data
- Test data

No statistics are recomputed on the test set.

```python
X_train_clean = XXX.XXX(XXX)
X_test_clean = XXX.XXX(XXX)

# view the cleaned X values
print(f"X_train_clean =\n{X_train_clean}\n X_test_clean =\n{X_test_clean}")
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Fill in the missing values with the mean from the training set.

## Step 6: Learn Scaling Parameters From Training Data

Scaling parameters (mean and standard deviation) are learned using the training data only.

```python
scaler = XXX()
scaler.XXX(XXX)
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Select the scalar and fit it to the cleaned training data.

## Step 7: Apply Scaling to Both Sets

The fitted scaler is applied to both the training and test data.

The test set does not influence the scaling parameters.

```python
X_train_scaled = XXX.XXX(X_train_clean)
X_test_scaled = XXX.XXX(X_test_clean)

print(f"X_train_scaled =\n{X_train_scaled}\n X_test_scaled =\n{X_test_scaled}")
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Apply this learned scaling to both the (cleaned) testing and the training sets.

## Summary

- The train/test split happens first
- Cleaning and scaling rules are learned from training data
- The same rules are applied to both training and test data
- The test set is never used to make preprocessing decisions

This workflow prevents data leakage and produces honest model evaluation.