# üîß Week 2, Day 3: Data Manipulation & Cleaning

**üéØ Goal:** Master data cleaning and transformation - The real AI work!

**‚è±Ô∏è Time:** 60-90 minutes

**üåü Why This Matters for AI:**
- **80% of AI work is data preparation** - Not modeling!
- Dirty data = Bad AI models (Garbage in, Garbage out)
- Feature engineering often beats fancy algorithms
- These skills separate junior from senior data scientists

---

## üî• 2024-2025 AI Trend Alert!

**RAG (Retrieval-Augmented Generation)** powers modern AI:
- Combines LLMs with your company's data
- **Pandas cleans and processes knowledge bases!**
- Used by ChatGPT Enterprise, Claude for Work

**Transformer Models** (BERT, GPT, LLaMA) require:
- Clean, tokenized text data
- **Pandas handles millions of training examples!**
- Proper preprocessing = 20-30% accuracy improvement

**Foundation Models** trained on trillions of tokens:
- Data quality > Data quantity
- **Pandas filters and deduplicates at scale!**

**You'll learn the exact techniques used to prepare GPT-4 and Claude training data!** üöÄ

---

## üì¶ Setup

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

print("Pandas version:", pd.__version__)
print("‚úÖ Ready to clean some data!")

## üßπ Part 1: Data Cleaning - Handling Missing Values

**Real AI Problem:** Missing data is EVERYWHERE!
- User didn't fill out a form field
- Sensor malfunction
- API timeout
- Data corruption

**You MUST handle it correctly or your AI will fail!**

In [None]:
# Create a messy AI training dataset (realistic!)
np.random.seed(42)

messy_data = pd.DataFrame({
    'user_id': range(1, 21),
    'age': [25, np.nan, 35, 28, np.nan, 45, 32, 29, np.nan, 38,
            42, 31, np.nan, 27, 36, 33, np.nan, 41, 30, 34],
    'income': [50000, 60000, np.nan, 55000, 48000, np.nan, 62000, 58000, 51000, np.nan,
               70000, np.nan, 54000, 56000, 59000, 61000, 53000, np.nan, 57000, 63000],
    'purchase_amount': [100, 250, 150, np.nan, 200, 180, np.nan, 220, 190, 210,
                        np.nan, 170, 160, 240, np.nan, 200, 185, 195, 230, np.nan],
    'email': ['user1@email.com', None, 'user3@email.com', 'user4@email.com', None,
              'user6@email.com', None, 'user8@email.com', 'user9@email.com', None,
              'user11@email.com', 'user12@email.com', None, 'user14@email.com', 'user15@email.com',
              None, 'user17@email.com', 'user18@email.com', None, 'user20@email.com']
})

print("üîç Messy Dataset (like real AI data!):")
print(messy_data)
print(f"\nShape: {messy_data.shape}")

### üîç Step 1: Detect Missing Values

In [None]:
# isnull() or isna() - Find missing values
print("‚ùì Missing Values (True = Missing):")
print(messy_data.isnull().head(10))

In [None]:
# Count missing values per column
print("üìä Missing Values Summary:")
missing_count = messy_data.isnull().sum()
print(missing_count)

print("\nüìä Percentage Missing:")
missing_pct = (messy_data.isnull().sum() / len(messy_data) * 100).round(2)
print(missing_pct)

In [None]:
# Total missing values in entire dataset
total_missing = messy_data.isnull().sum().sum()
total_cells = messy_data.size
print(f"üîç Total missing: {total_missing} out of {total_cells} cells ({total_missing/total_cells*100:.1f}%)")

### üõ†Ô∏è Step 2: Handle Missing Values

**4 Main Strategies:**
1. **Drop** - Remove rows/columns with missing values
2. **Fill with constant** - Use 0, "Unknown", etc.
3. **Fill with statistics** - Mean, median, mode
4. **Forward/backward fill** - Use previous/next value

In [None]:
# Strategy 1: Drop rows with ANY missing values
clean_dropany = messy_data.dropna()
print(f"üìä Original: {len(messy_data)} rows")
print(f"üìä After dropna(): {len(clean_dropany)} rows")
print(f"‚ùå Lost {len(messy_data) - len(clean_dropany)} rows ({(len(messy_data) - len(clean_dropany))/len(messy_data)*100:.1f}%)")
print("\n‚ö†Ô∏è Too aggressive! Lost most of our data!")

In [None]:
# Strategy 2: Drop rows where ALL values are missing
clean_dropall = messy_data.dropna(how='all')
print(f"üìä Original: {len(messy_data)} rows")
print(f"üìä After dropna(how='all'): {len(clean_dropall)} rows")
print("‚úÖ Only drops completely empty rows")

In [None]:
# Strategy 3: Drop rows with missing values in SPECIFIC columns
# Keep rows with valid user_id and age only
clean_subset = messy_data.dropna(subset=['age', 'income'])
print(f"üìä Original: {len(messy_data)} rows")
print(f"üìä After dropna(subset=['age', 'income']): {len(clean_subset)} rows")
print("‚úÖ More targeted approach!")

In [None]:
# Strategy 4: Fill with constant value
df_filled = messy_data.copy()
df_filled['email'] = df_filled['email'].fillna('no_email@unknown.com')
print("üìß Email column after filling:")
print(df_filled['email'].tail(10))

In [None]:
# Strategy 5: Fill with MEAN (for numerical data)
df_filled = messy_data.copy()
age_mean = df_filled['age'].mean()
df_filled['age'] = df_filled['age'].fillna(age_mean)

print(f"üìä Mean age: {age_mean:.2f}")
print("\nüìä Age column after filling with mean:")
print(df_filled['age'].head(10))
print(f"\n‚úÖ Missing values in age: {df_filled['age'].isnull().sum()}")

In [None]:
# Strategy 6: Fill with MEDIAN (better for outliers!)
df_filled = messy_data.copy()
income_median = df_filled['income'].median()
df_filled['income'] = df_filled['income'].fillna(income_median)

print(f"üìä Median income: ${income_median:,.0f}")
print(f"‚úÖ Missing values in income: {df_filled['income'].isnull().sum()}")

In [None]:
# Strategy 7: Fill ALL numeric columns at once!
df_clean = messy_data.copy()

# Fill each column with its median
for col in ['age', 'income', 'purchase_amount']:
    df_clean[col] = df_clean[col].fillna(df_clean[col].median())

# Fill email with placeholder
df_clean['email'] = df_clean['email'].fillna('unknown@email.com')

print("üéâ Cleaned Dataset:")
print(df_clean)
print(f"\n‚úÖ Missing values: {df_clean.isnull().sum().sum()}")

**üß† AI Best Practices:**
```python
# Numerical features:
- Mean: When data is normally distributed
- Median: When you have outliers (SAFER!)
- Mode: For categorical data

# Categorical features:
- "Unknown" or "Missing" category
- Mode (most common value)

# Time series:
- Forward fill (use previous value)
- Backward fill (use next value)
```

## üîÑ Part 2: Data Transformation - GroupBy & Aggregation

**Think SQL GROUP BY, but more powerful!**

In [None]:
# Create realistic e-commerce dataset
np.random.seed(42)

sales_data = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=100, freq='D'),
    'product': np.random.choice(['AI Course', 'ML Book', 'GPU Cloud', 'ChatGPT Plus'], 100),
    'category': np.random.choice(['Education', 'Hardware', 'Software'], 100),
    'region': np.random.choice(['North America', 'Europe', 'Asia'], 100),
    'sales': np.random.randint(100, 1000, 100),
    'quantity': np.random.randint(1, 20, 100)
})

print("üõí E-commerce Sales Data:")
print(sales_data.head(10))

In [None]:
# GroupBy single column
print("üìä Total Sales by Product:")
product_sales = sales_data.groupby('product')['sales'].sum().sort_values(ascending=False)
print(product_sales)

In [None]:
# Multiple aggregations at once
print("üìä Sales Statistics by Region:")
region_stats = sales_data.groupby('region')['sales'].agg([
    'count',   # Number of transactions
    'sum',     # Total sales
    'mean',    # Average sale
    'min',     # Minimum sale
    'max'      # Maximum sale
]).round(2)
print(region_stats)

In [None]:
# GroupBy multiple columns
print("üìä Sales by Region AND Product:")
region_product = sales_data.groupby(['region', 'product'])['sales'].sum()
print(region_product)

In [None]:
# Custom aggregations
print("üìä Custom Aggregation by Category:")
category_stats = sales_data.groupby('category').agg({
    'sales': ['sum', 'mean', 'max'],
    'quantity': ['sum', 'mean'],
    'product': 'count'  # Count of transactions
}).round(2)
print(category_stats)

In [None]:
# Add calculated column: Revenue per unit
sales_data['revenue_per_unit'] = sales_data['sales'] / sales_data['quantity']

print("üí∞ Sales Data with Revenue per Unit:")
print(sales_data[['product', 'sales', 'quantity', 'revenue_per_unit']].head(10))

## üîÄ Part 3: Pivot Tables - Excel-style Analysis

In [None]:
# Create pivot table: Region vs Product
print("üìä Pivot Table - Total Sales by Region and Product:")
pivot = sales_data.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    fill_value=0
)
print(pivot)

In [None]:
# Add row and column totals
print("üìä Pivot Table with Margins (Totals):")
pivot_margins = sales_data.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    fill_value=0,
    margins=True,
    margins_name='Total'
)
print(pivot_margins)

In [None]:
# Multiple aggregations in pivot table
print("üìä Advanced Pivot - Multiple Metrics:")
pivot_advanced = sales_data.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc=['sum', 'mean', 'count'],
    fill_value=0
).round(2)
print(pivot_advanced)

## üîó Part 4: Merging Datasets - Combining Multiple Data Sources

**Real AI Scenario:** 
- User data in one table
- Purchase history in another
- Product details in third
- **You need to combine them!**

In [None]:
# Create related datasets (like real databases)

# Table 1: Users
users = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [25, 30, 35, 28, 32],
    'country': ['USA', 'UK', 'Canada', 'USA', 'Germany']
})

# Table 2: Purchases
purchases = pd.DataFrame({
    'purchase_id': [101, 102, 103, 104, 105, 106],
    'user_id': [1, 2, 1, 3, 2, 6],  # Note: user_id 6 doesn't exist!
    'product': ['AI Course', 'ML Book', 'GPU Cloud', 'AI Course', 'ChatGPT Plus', 'ML Book'],
    'amount': [299, 49, 150, 299, 20, 49]
})

# Table 3: Product Details
products = pd.DataFrame({
    'product': ['AI Course', 'ML Book', 'GPU Cloud', 'ChatGPT Plus'],
    'category': ['Education', 'Education', 'Hardware', 'Software'],
    'rating': [4.8, 4.5, 4.9, 4.7]
})

print("üë• Users:")
print(users)
print("\nüõí Purchases:")
print(purchases)
print("\nüì¶ Products:")
print(products)

### üîó Join Types Explained:

- **INNER JOIN** - Only matching rows (most common)
- **LEFT JOIN** - All from left table + matches from right
- **RIGHT JOIN** - All from right table + matches from left
- **OUTER JOIN** - All rows from both tables

In [None]:
# INNER JOIN - Only users who made purchases
print("üîó INNER JOIN (Users + Purchases):")
inner = pd.merge(users, purchases, on='user_id', how='inner')
print(inner)
print(f"\nüìä Rows: {len(inner)} (only matching user_ids)")

In [None]:
# LEFT JOIN - All users, even without purchases
print("üîó LEFT JOIN (All Users + Purchases):")
left = pd.merge(users, purchases, on='user_id', how='left')
print(left)
print(f"\nüìä Rows: {len(left)} (all users kept)")
print("\n‚ö†Ô∏è Diana and Eve have NaN - they didn't purchase anything!")

In [None]:
# RIGHT JOIN - All purchases, even orphaned ones
print("üîó RIGHT JOIN (Users + All Purchases):")
right = pd.merge(users, purchases, on='user_id', how='right')
print(right)
print(f"\nüìä Rows: {len(right)} (all purchases kept)")
print("\n‚ö†Ô∏è Last purchase has NaN user info - user_id 6 doesn't exist!")

In [None]:
# OUTER JOIN - Everything!
print("üîó OUTER JOIN (All Users + All Purchases):")
outer = pd.merge(users, purchases, on='user_id', how='outer')
print(outer)
print(f"\nüìä Rows: {len(outer)} (everything kept)")

In [None]:
# Chain multiple merges - Real AI workflow!
print("üîó Multi-Table Join (Users + Purchases + Products):")

# Step 1: Join users and purchases
user_purchases = pd.merge(users, purchases, on='user_id', how='inner')

# Step 2: Join with product details
complete_data = pd.merge(user_purchases, products, on='product', how='left')

print(complete_data)
print("\n‚úÖ Complete dataset ready for AI modeling!")

## üõ†Ô∏è Part 5: Feature Engineering for Machine Learning

**Feature engineering = Creating new features from existing data**

**This is where AI magic happens!** ü™Ñ

In [None]:
# Create AI customer dataset
np.random.seed(42)

customers = pd.DataFrame({
    'customer_id': range(1, 101),
    'signup_date': pd.date_range('2023-01-01', periods=100, freq='3D'),
    'total_purchases': np.random.randint(1, 50, 100),
    'total_spent': np.random.randint(100, 5000, 100),
    'support_tickets': np.random.randint(0, 10, 100),
    'days_since_last_purchase': np.random.randint(1, 180, 100),
    'email_opened': np.random.randint(0, 100, 100),
    'email_sent': np.random.randint(50, 150, 100)
})

print("üë• Customer Dataset:")
print(customers.head(10))

In [None]:
# Feature 1: Average Order Value (AOV)
customers['avg_order_value'] = (customers['total_spent'] / customers['total_purchases']).round(2)

print("üí∞ Feature: Average Order Value")
print(customers[['customer_id', 'total_purchases', 'total_spent', 'avg_order_value']].head())

In [None]:
# Feature 2: Customer Lifetime (days)
customers['customer_lifetime_days'] = (pd.Timestamp('2024-11-17') - customers['signup_date']).dt.days

print("üìÖ Feature: Customer Lifetime")
print(customers[['customer_id', 'signup_date', 'customer_lifetime_days']].head())

In [None]:
# Feature 3: Email Engagement Rate
customers['email_open_rate'] = (customers['email_opened'] / customers['email_sent'] * 100).round(2)

print("üìß Feature: Email Engagement Rate")
print(customers[['customer_id', 'email_sent', 'email_opened', 'email_open_rate']].head())

In [None]:
# Feature 4: Purchase Frequency (purchases per month)
customers['purchase_frequency'] = (customers['total_purchases'] / (customers['customer_lifetime_days'] / 30)).round(2)

print("üõí Feature: Purchase Frequency (per month)")
print(customers[['customer_id', 'total_purchases', 'customer_lifetime_days', 'purchase_frequency']].head())

In [None]:
# Feature 5: Customer Segment (based on spending)
def categorize_customer(spent):
    if spent < 1000:
        return 'Low-Value'
    elif spent < 3000:
        return 'Medium-Value'
    else:
        return 'High-Value'

customers['customer_segment'] = customers['total_spent'].apply(categorize_customer)

print("üèÜ Feature: Customer Segment")
print(customers['customer_segment'].value_counts())

In [None]:
# Feature 6: Churn Risk (Binary classification target!)
# If customer hasn't purchased in 90 days = High churn risk
customers['churn_risk'] = (customers['days_since_last_purchase'] > 90).astype(int)

print("‚ö†Ô∏è Feature: Churn Risk (TARGET for ML model!)")
print(customers['churn_risk'].value_counts())
print(f"\nChurn rate: {customers['churn_risk'].mean() * 100:.1f}%")

In [None]:
# Feature 7: RFM Score (Recency, Frequency, Monetary)
# Industry-standard for customer analysis!

# Recency score (lower days = better)
customers['recency_score'] = pd.qcut(customers['days_since_last_purchase'], 
                                      q=5, labels=[5, 4, 3, 2, 1])

# Frequency score
customers['frequency_score'] = pd.qcut(customers['total_purchases'].rank(method='first'), 
                                        q=5, labels=[1, 2, 3, 4, 5])

# Monetary score
customers['monetary_score'] = pd.qcut(customers['total_spent'].rank(method='first'), 
                                       q=5, labels=[1, 2, 3, 4, 5])

# Combined RFM score
customers['rfm_score'] = (customers['recency_score'].astype(int) + 
                          customers['frequency_score'].astype(int) + 
                          customers['monetary_score'].astype(int))

print("üéØ RFM Scores (Used by Amazon, Netflix, Spotify!):")
print(customers[['customer_id', 'recency_score', 'frequency_score', 'monetary_score', 'rfm_score']].head(10))

In [None]:
# Final dataset ready for ML!
print("üéâ Final Feature-Engineered Dataset:")
print(customers.head())
print(f"\nüìä Shape: {customers.shape}")
print(f"üìä Features: {customers.shape[1]} columns")
print(f"üìä Original: 8 columns ‚Üí Engineered: {customers.shape[1]} columns!")
print("\n‚úÖ Ready for machine learning!")

## üöÄ Part 6: Complete AI Preprocessing Pipeline

**Let's build a real AI data pipeline from scratch!**

In [None]:
# Create messy, realistic AI training dataset
np.random.seed(100)

raw_data = pd.DataFrame({
    'user_id': range(1, 201),
    'age': np.random.randint(18, 70, 200),
    'income': np.random.randint(30000, 150000, 200),
    'credit_score': np.random.randint(300, 850, 200),
    'loan_amount': np.random.randint(5000, 100000, 200),
    'employment_years': np.random.randint(0, 40, 200),
    'previous_defaults': np.random.randint(0, 5, 200),
    'approved': np.random.choice([0, 1], 200, p=[0.3, 0.7])  # TARGET
})

# Introduce missing values (realistic!)
raw_data.loc[np.random.choice(raw_data.index, 20), 'income'] = np.nan
raw_data.loc[np.random.choice(raw_data.index, 15), 'credit_score'] = np.nan
raw_data.loc[np.random.choice(raw_data.index, 10), 'employment_years'] = np.nan

print("üìä RAW DATA (Messy!):")
print(raw_data.head(10))
print(f"\nMissing values:\n{raw_data.isnull().sum()}")

In [None]:
# STEP 1: Handle missing values
print("üßπ STEP 1: Cleaning missing values...")

df_clean = raw_data.copy()
df_clean['income'] = df_clean['income'].fillna(df_clean['income'].median())
df_clean['credit_score'] = df_clean['credit_score'].fillna(df_clean['credit_score'].median())
df_clean['employment_years'] = df_clean['employment_years'].fillna(df_clean['employment_years'].median())

print(f"‚úÖ Missing values: {df_clean.isnull().sum().sum()}")

In [None]:
# STEP 2: Feature Engineering
print("\nüõ†Ô∏è STEP 2: Engineering new features...")

# Debt-to-Income Ratio
df_clean['debt_to_income'] = (df_clean['loan_amount'] / df_clean['income']).round(3)

# Credit score category
def categorize_credit(score):
    if score < 580:
        return 'Poor'
    elif score < 670:
        return 'Fair'
    elif score < 740:
        return 'Good'
    elif score < 800:
        return 'Very Good'
    else:
        return 'Excellent'

df_clean['credit_category'] = df_clean['credit_score'].apply(categorize_credit)

# Risk score (custom formula)
df_clean['risk_score'] = (
    (df_clean['previous_defaults'] * 2) + 
    (df_clean['debt_to_income'] * 10) - 
    (df_clean['credit_score'] / 100)
).round(2)

print(f"‚úÖ New features created: 3")
print(f"‚úÖ Total features: {df_clean.shape[1]}")

In [None]:
# STEP 3: Encode categorical features
print("\nüî¢ STEP 3: Encoding categorical features...")

# One-hot encoding for credit_category
credit_dummies = pd.get_dummies(df_clean['credit_category'], prefix='credit')
df_clean = pd.concat([df_clean, credit_dummies], axis=1)

print(f"‚úÖ One-hot encoded: {len(credit_dummies.columns)} dummy variables")
print(f"‚úÖ Total features now: {df_clean.shape[1]}")

In [None]:
# STEP 4: Normalize numerical features (0-1 scale)
print("\nüìè STEP 4: Normalizing features...")

features_to_normalize = ['age', 'income', 'credit_score', 'loan_amount', 'employment_years']

for col in features_to_normalize:
    min_val = df_clean[col].min()
    max_val = df_clean[col].max()
    df_clean[f'{col}_normalized'] = (df_clean[col] - min_val) / (max_val - min_val)

print(f"‚úÖ Normalized {len(features_to_normalize)} features")

In [None]:
# STEP 5: Final dataset preparation
print("\nüéØ STEP 5: Preparing final ML dataset...")

# Select features for ML model
feature_columns = [
    'age_normalized', 'income_normalized', 'credit_score_normalized',
    'loan_amount_normalized', 'employment_years_normalized',
    'debt_to_income', 'risk_score', 'previous_defaults',
    'credit_Excellent', 'credit_Fair', 'credit_Good', 'credit_Poor', 'credit_Very Good'
]

X = df_clean[feature_columns]  # Features
y = df_clean['approved']        # Target

print("‚úÖ PIPELINE COMPLETE!")
print(f"\nüìä Feature matrix (X): {X.shape}")
print(f"üìä Target vector (y): {y.shape}")
print(f"\nüéâ Dataset ready for machine learning!")

In [None]:
# Preview final dataset
print("üëÄ Final ML Dataset Preview:")
print(X.head())
print("\nüìä Target Distribution:")
print(y.value_counts())
print(f"\nApproval rate: {y.mean() * 100:.1f}%")

## üéØ Practice Exercise: Build Your Own Pipeline

**Scenario:** You're building a customer churn prediction model!

**Your Task:**
1. Load the data below
2. Handle missing values
3. Create 3 new features
4. Prepare for ML modeling

In [None]:
# Exercise dataset
np.random.seed(42)

churn_data = pd.DataFrame({
    'customer_id': range(1, 151),
    'tenure_months': np.random.randint(1, 72, 150),
    'monthly_charges': np.random.uniform(20, 120, 150).round(2),
    'total_charges': np.random.uniform(100, 8000, 150).round(2),
    'support_calls': np.random.randint(0, 15, 150),
    'contract_type': np.random.choice(['Month-to-Month', 'One Year', 'Two Year'], 150),
    'churned': np.random.choice([0, 1], 150, p=[0.73, 0.27])  # TARGET
})

# Add missing values
churn_data.loc[np.random.choice(churn_data.index, 15), 'monthly_charges'] = np.nan
churn_data.loc[np.random.choice(churn_data.index, 10), 'total_charges'] = np.nan

print("üìä Customer Churn Dataset:")
print(churn_data.head(10))
print(f"\nMissing values:\n{churn_data.isnull().sum()}")

In [None]:
# TODO: Your turn!

# 1. Fill missing values with median
churn_clean = churn_data.copy()
churn_clean['monthly_charges'].fillna(churn_clean['monthly_charges'].median(), inplace=True)
churn_clean['total_charges'].fillna(churn_clean['total_charges'].median(), inplace=True)

print("‚úÖ Step 1: Missing values handled")

# 2. Create new features
churn_clean['avg_monthly_spend'] = (churn_clean['total_charges'] / churn_clean['tenure_months']).round(2)
churn_clean['support_per_month'] = (churn_clean['support_calls'] / churn_clean['tenure_months']).round(3)
churn_clean['high_value_customer'] = (churn_clean['monthly_charges'] > churn_clean['monthly_charges'].median()).astype(int)

print("‚úÖ Step 2: Created 3 new features")

# 3. Encode contract type
contract_dummies = pd.get_dummies(churn_clean['contract_type'], prefix='contract')
churn_clean = pd.concat([churn_clean, contract_dummies], axis=1)

print("‚úÖ Step 3: Encoded categorical variable")

# 4. Show final dataset
print("\nüéâ Final Dataset:")
print(churn_clean.head())
print(f"\nShape: {churn_clean.shape}")
print(f"Churn rate: {churn_clean['churned'].mean() * 100:.1f}%")

## üéâ Congratulations!

**You just learned:**
- ‚úÖ Detecting and handling missing values (7 strategies!)
- ‚úÖ Data transformation with GroupBy and aggregation
- ‚úÖ Pivot tables for analysis
- ‚úÖ Merging datasets (INNER, LEFT, RIGHT, OUTER joins)
- ‚úÖ Feature engineering (creating powerful new features!)
- ‚úÖ Building complete AI preprocessing pipelines
- ‚úÖ Real-world examples (churn, loans, e-commerce)

**üéØ Your Complete AI Data Pipeline:**
```python
# 1. Load data
df = pd.read_csv('data.csv')

# 2. Explore
df.info()
df.describe()

# 3. Clean
df.fillna(df.median())
df.dropna(subset=['important_col'])

# 4. Transform
df.groupby('category').agg({'sales': 'sum'})

# 5. Merge
pd.merge(df1, df2, on='id', how='inner')

# 6. Engineer features
df['new_feature'] = df['col1'] / df['col2']

# 7. Normalize
df_normalized = (df - df.min()) / (df.max() - df.min())

# 8. Ready for ML!
```

---

**üìö Next Week:** Week 3 - Data Visualization (Make your data come alive!)

**üí° Industry Secret:** 
> "Feature engineering and data preparation are more important than the algorithm choice. A simple model with great features beats a complex model with poor features." 
> ‚Äî Every senior data scientist ever

**üèÜ You now have skills that:**
- Take 3-6 months to learn in traditional courses
- Are used daily by data scientists at FAANG companies
- Are required for 90% of AI/ML job interviews
- Power real production AI systems

---

*You're now ready to process data for real AI projects!* üöÄ