# üìò Day 2: Data Preprocessing

**üéØ Goal:** Learn to clean and prepare data for machine learning models

**‚è±Ô∏è Time:** 45-60 minutes

**üåü Why This Matters for AI:**
- "Garbage in, garbage out" - Bad data = Bad models
- 80% of ML work is data preparation!
- RAG systems need properly processed documents for accurate retrieval
- Real-world data is ALWAYS messy - you must learn to clean it

---

## üßπ Why Data Preprocessing?

Real-world data has problems:
- **Missing values** (blank cells) ‚ùå
- **Different scales** (age: 25, salary: 50000) üìè
- **Text categories** ("Red", "Blue") that models can't read üé®
- **Outliers** (extreme values) üìä

**Machine Learning models need:**
- ‚úÖ No missing values
- ‚úÖ Numbers on similar scales
- ‚úÖ Categories converted to numbers
- ‚úÖ Clean, consistent data

**Today, we'll fix all these issues!** üëá

In [None]:
# Install required libraries
import sys
!{sys.executable} -m pip install scikit-learn pandas numpy matplotlib --quiet

print("‚úÖ Libraries installed!")

In [None]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

print("üìö All libraries loaded!")

## üìä Let's Create a Messy Dataset

This dataset represents customer data with **real-world problems**:
- Missing values (some data is blank)
- Different scales (age vs income)
- Text categories (country names)

In [None]:
# Create a messy dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'Age': [25, 30, np.nan, 28, 35, 29],  # Missing value!
    'Country': ['USA', 'UK', 'USA', 'Canada', 'UK', np.nan],  # Missing value!
    'Salary': [50000, 60000, 55000, np.nan, 70000, 58000],  # Missing value!
    'Purchased': [0, 1, 0, 1, 1, 0]
}

df = pd.DataFrame(data)

print("üìä Our Messy Dataset:")
print(df)
print("\n‚ö†Ô∏è Notice the 'NaN' values? That's missing data!")

## üîç Step 1: Explore the Data

Always understand your data first!

In [None]:
# Get information about the dataset
print("üìã Dataset Info:")
print(df.info())
print("\n" + "="*50)

# Check for missing values
print("\n‚ùì Missing Values:")
print(df.isnull().sum())
print("\n" + "="*50)

# Statistical summary
print("\nüìä Statistical Summary:")
print(df.describe())

## üõ†Ô∏è Step 2: Handle Missing Values

**Three strategies:**
1. **Delete rows** with missing values (lose data ‚ùå)
2. **Fill with mean/median** for numbers ‚úÖ
3. **Fill with mode** (most common value) for categories ‚úÖ

We'll use strategy 2 and 3!

In [None]:
# Create a copy to work with
df_clean = df.copy()

# Handle missing numerical values (Age, Salary)
# Strategy: Fill with MEAN (average)

# For Age
imputer_age = SimpleImputer(strategy='mean')
df_clean['Age'] = imputer_age.fit_transform(df_clean[['Age']])

# For Salary
imputer_salary = SimpleImputer(strategy='mean')
df_clean['Salary'] = imputer_salary.fit_transform(df_clean[['Salary']])

print("‚úÖ Filled missing Age with mean age")
print("‚úÖ Filled missing Salary with mean salary")
print("\nüìä After filling numerical values:")
print(df_clean)

In [None]:
# Handle missing categorical values (Country)
# Strategy: Fill with MOST FREQUENT value

imputer_country = SimpleImputer(strategy='most_frequent')
df_clean['Country'] = imputer_country.fit_transform(df_clean[['Country']]).ravel()

print("‚úÖ Filled missing Country with most frequent country")
print("\nüìä After filling ALL missing values:")
print(df_clean)
print("\n‚ú® No more NaN values!")

## üé® Step 3: Encode Categorical Variables

**Problem:** ML models only understand numbers, not text!

**Solution:** Convert categories to numbers

**Label Encoding:** USA=0, UK=1, Canada=2

In [None]:
# Encode 'Country' column
label_encoder = LabelEncoder()
df_clean['Country_Encoded'] = label_encoder.fit_transform(df_clean['Country'])

print("üé® Country Encoding:")
print(df_clean[['Country', 'Country_Encoded']])
print("\nüìù Encoding mapping:")
for i, country in enumerate(label_encoder.classes_):
    print(f"  {country} ‚Üí {i}")

## üìè Step 4: Feature Scaling

**Problem:** Features have different scales!
- Age: 25-35
- Salary: 50,000-70,000

**Why this matters:** 
- Models think Salary is MORE important (bigger numbers!)
- We need to put everything on the same scale

**Two methods:**
1. **Standardization** (mean=0, std=1) ‚Üê Most common
2. **Normalization** (scale to 0-1)

### Method 1: Standardization (StandardScaler)

**Formula:** `(value - mean) / standard_deviation`

**Result:** Mean = 0, Standard Deviation = 1

In [None]:
# Select numerical features to scale
features_to_scale = ['Age', 'Salary']

# Create scaler
scaler = StandardScaler()

# Fit and transform
df_clean[['Age_Scaled', 'Salary_Scaled']] = scaler.fit_transform(
    df_clean[features_to_scale]
)

print("üìè Before and After Standardization:")
print(df_clean[['Age', 'Age_Scaled', 'Salary', 'Salary_Scaled']])
print("\n‚úÖ Now Age and Salary are on the same scale!")

### Method 2: Normalization (MinMaxScaler)

**Formula:** `(value - min) / (max - min)`

**Result:** All values between 0 and 1

In [None]:
# Create MinMax scaler
minmax_scaler = MinMaxScaler()

# Normalize Age and Salary
df_clean[['Age_Normalized', 'Salary_Normalized']] = minmax_scaler.fit_transform(
    df_clean[['Age', 'Salary']]
)

print("üìè Normalized Values (0 to 1):")
print(df_clean[['Age', 'Age_Normalized', 'Salary', 'Salary_Normalized']])
print("\n‚úÖ All values now between 0 and 1!")

## üéØ When to Use Each Scaling Method?

**StandardScaler (Standardization):**
- ‚úÖ Most algorithms (Linear Regression, SVM, Neural Networks)
- ‚úÖ When features follow normal distribution
- ‚úÖ Default choice!

**MinMaxScaler (Normalization):**
- ‚úÖ Neural networks with bounded activation functions
- ‚úÖ Image processing (pixels already 0-255)
- ‚úÖ When you need specific range (0-1)

**No Scaling Needed:**
- ‚ùå Tree-based models (Decision Trees, Random Forest)
- ‚ùå Already same scale

## üîÄ Step 5: Train/Test Split

**Golden Rule:** NEVER test on training data!

**Process:**
1. Split data FIRST (before scaling!)
2. Fit scaler on training data
3. Transform BOTH training and test data

**Why?** To prevent data leakage!

In [None]:
# Prepare features (X) and target (y)
X = df_clean[['Age', 'Salary', 'Country_Encoded']].values
y = df_clean['Purchased'].values

print("üìä Features (X):")
print(X)
print("\nüéØ Target (y):")
print(y)

In [None]:
# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("üîÄ Data Split Complete!")
print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

print("\nüìä Training data:")
print(X_train)
print("\nüìä Testing data:")
print(X_test)

## ‚ö†Ô∏è IMPORTANT: The Right Way to Scale

**WRONG ‚ùå:**
```python
# Scale all data, then split
X_scaled = scaler.fit_transform(X)
X_train, X_test = split(X_scaled)  # DATA LEAKAGE!
```

**RIGHT ‚úÖ:**
```python
# Split first, then scale
X_train, X_test = split(X)
scaler.fit(X_train)  # Learn from training only
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
```

In [None]:
# The RIGHT way to scale
scaler_final = StandardScaler()

# Fit on training data ONLY
scaler_final.fit(X_train)

# Transform both sets
X_train_scaled = scaler_final.transform(X_train)
X_test_scaled = scaler_final.transform(X_test)

print("‚úÖ Scaled correctly!")
print("\nüìä Scaled Training Data:")
print(X_train_scaled)
print("\nüìä Scaled Testing Data:")
print(X_test_scaled)

## üéØ Real AI Example: Preparing Data for RAG Systems

**RAG (Retrieval-Augmented Generation)** systems like ChatGPT with documents need clean data!

**Scenario:** You're building a RAG system to answer questions about products.

**Your data has:**
- Missing descriptions
- Different price scales
- Category names (Electronics, Books, etc.)

In [None]:
# Product data for RAG system
products_data = {
    'Product': ['Laptop', 'Book', 'Phone', 'Tablet', 'Headphones'],
    'Category': ['Electronics', 'Books', 'Electronics', 'Electronics', np.nan],
    'Price': [1200, 25, np.nan, 800, 150],
    'Rating': [4.5, 4.8, 4.2, np.nan, 4.0],
    'Stock': [50, 200, 100, 75, 150]
}

products_df = pd.DataFrame(products_data)

print("üõçÔ∏è RAG System - Product Data (BEFORE cleaning):")
print(products_df)
print("\n‚ö†Ô∏è Issues: Missing Category, Price, and Rating!")

In [None]:
# Clean the data for RAG system
products_clean = products_df.copy()

# 1. Fill missing Category with most frequent
cat_imputer = SimpleImputer(strategy='most_frequent')
products_clean['Category'] = cat_imputer.fit_transform(
    products_clean[['Category']]
).ravel()

# 2. Fill missing Price with median (better for prices with outliers)
price_imputer = SimpleImputer(strategy='median')
products_clean['Price'] = price_imputer.fit_transform(
    products_clean[['Price']]
)

# 3. Fill missing Rating with mean
rating_imputer = SimpleImputer(strategy='mean')
products_clean['Rating'] = rating_imputer.fit_transform(
    products_clean[['Rating']]
)

# 4. Encode Category
cat_encoder = LabelEncoder()
products_clean['Category_Encoded'] = cat_encoder.fit_transform(
    products_clean['Category']
)

# 5. Scale numerical features
scaler_products = StandardScaler()
products_clean[['Price_Scaled', 'Rating_Scaled', 'Stock_Scaled']] = scaler_products.fit_transform(
    products_clean[['Price', 'Rating', 'Stock']]
)

print("‚ú® RAG System - Product Data (AFTER cleaning):")
print(products_clean)
print("\n‚úÖ Ready for RAG system!")
print("\nüìä This clean data can now be:")
print("  1. Converted to embeddings (vector representations)")
print("  2. Stored in a vector database")
print("  3. Retrieved when user asks questions")
print("  4. Used by LLM to generate accurate answers!")

## üéØ YOUR TURN: Interactive Exercise

**Challenge:** Prepare this messy employee dataset for ML!

**Tasks:**
1. Handle missing values
2. Encode the 'Department' column
3. Scale 'Age' and 'Salary'
4. Split into train/test sets

In [None]:
# Employee dataset
employee_data = {
    'Name': ['John', 'Sarah', 'Mike', 'Emily', 'David'],
    'Age': [28, np.nan, 35, 42, 31],
    'Department': ['Sales', 'IT', np.nan, 'Sales', 'IT'],
    'Salary': [50000, 70000, 60000, np.nan, 65000],
    'Promoted': [0, 1, 0, 1, 0]  # Target variable
}

employee_df = pd.DataFrame(employee_data)

print("üìä Employee Dataset (MESSY):")
print(employee_df)
print("\nüéØ YOUR TASK: Clean this data!")

In [None]:
# YOUR CODE HERE!

# Step 1: Handle missing Age (use mean)
# TODO: Create imputer and fill missing Age

# Step 2: Handle missing Department (use most_frequent)
# TODO: Create imputer and fill missing Department

# Step 3: Handle missing Salary (use median)
# TODO: Create imputer and fill missing Salary

# Step 4: Encode Department
# TODO: Use LabelEncoder

# Step 5: Scale Age and Salary
# TODO: Use StandardScaler

# Step 6: Create X and y, then split
# TODO: train_test_split

print("Complete the TODOs above!")

### ‚úÖ Solution (Try on your own first!)

In [None]:
# SOLUTION
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

# Create copy
employee_clean = employee_df.copy()

# Step 1: Fill missing Age
age_imputer = SimpleImputer(strategy='mean')
employee_clean['Age'] = age_imputer.fit_transform(employee_clean[['Age']])

# Step 2: Fill missing Department
dept_imputer = SimpleImputer(strategy='most_frequent')
employee_clean['Department'] = dept_imputer.fit_transform(
    employee_clean[['Department']]
).ravel()

# Step 3: Fill missing Salary
salary_imputer = SimpleImputer(strategy='median')
employee_clean['Salary'] = salary_imputer.fit_transform(employee_clean[['Salary']])

# Step 4: Encode Department
dept_encoder = LabelEncoder()
employee_clean['Department_Encoded'] = dept_encoder.fit_transform(
    employee_clean['Department']
)

# Step 5: Scale Age and Salary
scaler_emp = StandardScaler()
employee_clean[['Age_Scaled', 'Salary_Scaled']] = scaler_emp.fit_transform(
    employee_clean[['Age', 'Salary']]
)

print("‚ú® Cleaned Employee Data:")
print(employee_clean)

# Step 6: Prepare for ML
X = employee_clean[['Age_Scaled', 'Salary_Scaled', 'Department_Encoded']].values
y = employee_clean['Promoted'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"\n‚úÖ Data ready for ML!")
print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

## üìã Data Preprocessing Checklist

**Before training ANY ML model:**

**1. Explore Data** üîç
- [ ] Check data types (`df.info()`)
- [ ] Look for missing values (`df.isnull().sum()`)
- [ ] Check statistics (`df.describe()`)

**2. Handle Missing Values** üõ†Ô∏è
- [ ] Numerical: Use mean/median
- [ ] Categorical: Use most_frequent
- [ ] Or drop rows (if very few missing)

**3. Encode Categories** üé®
- [ ] Use LabelEncoder for ordinal data
- [ ] Use OneHotEncoder for nominal data

**4. Scale Features** üìè
- [ ] StandardScaler for most algorithms
- [ ] MinMaxScaler for neural networks
- [ ] Skip for tree-based models

**5. Split Data** üîÄ
- [ ] Split BEFORE scaling
- [ ] Fit scaler on training only
- [ ] Transform both sets

**6. Verify** ‚úÖ
- [ ] No missing values
- [ ] All numerical features
- [ ] Similar scales
- [ ] Ready for ML!

## üéâ Congratulations!

**You just learned:**
- ‚úÖ Why data preprocessing is crucial (80% of ML work!)
- ‚úÖ How to handle missing values (imputation)
- ‚úÖ How to encode categorical variables
- ‚úÖ Feature scaling (Standardization vs Normalization)
- ‚úÖ Proper train/test split workflow
- ‚úÖ How to prepare data for RAG systems

**üéØ Practice Exercise (Do this before Day 3!):**

Download a real dataset from Kaggle and clean it:
1. Titanic dataset (classification)
2. House prices (regression)

Practice the entire preprocessing pipeline!

---

**üìö Next Lesson:** Day 3 - Building Your First ML Model (Linear Regression)

**üí¨ Key Takeaway:**

*"Garbage in, garbage out" - Even the best ML algorithm can't fix bad data. Spend time on preprocessing, and your models will thank you!* üöÄ

---

**üîó Connections to Modern AI:**
- **RAG Systems**: Clean document data ‚Üí Better retrieval ‚Üí Accurate answers
- **LLMs**: Massive text preprocessing before training
- **Multimodal AI**: Normalize images, scale audio, encode text
- **Agentic AI**: Clean sensor data for decision making