## CS 2100/4700: Introduction to Machine Learning -- Feature Engineering
**Goal:** Transform raw data into powerful predictive features

## Why Feature Engineering is the "Art" of Machine Learning

> "Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering." — Andrew Ng

### The Feature Engineering Impact

```
┌─────────────────────────────────────────────────────────────────────────┐
│                    THE FEATURE ENGINEERING MULTIPLIER                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   RAW DATA          GOOD FEATURES         GREAT MODEL                   │
│   ──────────   +    ────────────    =     ───────────                   │
│                                                                         │
│   Same algorithm + better features >>> better algorithm + poor features │
│                                                                         │
│   Example:                                                              │
│   • Raw: birth_date = "1985-03-15"                                      │
│   • Feature: age = 39                                                   │
│   • Better: age_group = "35-44"                                         │
│   • Even better: is_working_age = 1, years_to_retirement = 26           │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
```

### What is Feature Engineering?

| Term | Definition | Example |
|------|------------|--------|
| **Feature** | An individual measurable property used as input to a model | `age`, `income`, `hours_worked` |
| **Feature Engineering** | The process of creating new features from existing data | `age` → `age_squared`, `is_senior` |
| **Feature Extraction** | Deriving features from raw data | `timestamp` → `hour`, `day_of_week` |
| **Feature Selection** | Choosing the most relevant features | Keep `age`, drop `ssn` |
| **Feature Transformation** | Changing feature representation | `income` → `log(income)` |

### Why Feature Engineering Matters: Real Examples

| Scenario | Raw Feature | Engineered Feature | Model Improvement |
|----------|-------------|-------------------|-------------------|
| Predicting house prices | `year_built = 1995` | `house_age = 29` | +15% accuracy |
| Fraud detection | `transaction_time = 3:00 AM` | `is_unusual_hour = 1` | +25% recall |
| Customer churn | `last_purchase = 2024-01-15` | `days_since_purchase = 45` | +20% precision |
| Salary prediction | `education = "PhD"` | `education_years = 20` | +10% R² |

### The Feature Engineering Mindset

Think like a domain expert! Ask yourself:

| Question | Feature Ideas |
|----------|---------------|
| What would a human expert look at? | Domain-specific ratios, thresholds |
| What patterns exist in the data? | Trends, seasonality, clusters |
| What combinations might matter? | Interactions, ratios, differences |
| What external knowledge applies? | Industry benchmarks, known thresholds |

## Learning Objectives

By the end of this lecture, you will be able to:

1. Extract features from datetime variables
2. Apply multiple categorical encoding strategies
3. Create numerical transformations and interactions
4. Perform feature selection using multiple methods
5. Build a complete feature engineering pipeline
6. Avoid common feature engineering pitfalls

## Setup: Load and Prepare Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.feature_selection import mutual_info_classif, SelectKBest
from sklearn.ensemble import RandomForestClassifier

# Set visual style
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Load data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
           'marital_status', 'occupation', 'relationship', 'race',
           'sex', 'capital_gain', 'capital_loss', 'hours_per_week',
           'native_country', 'income']
df = pd.read_csv(url, names=columns, na_values=' ?', skipinitialspace=True)

# Clean
df = df.dropna()
df['income_binary'] = (df['income'] == '>50K').astype(int)

print(f"Dataset: {df.shape[0]:,} rows, {df.shape[1]} columns")
print(f"\nColumns available for feature engineering:")
print(df.columns.tolist())

## 1. DateTime Feature Engineering

### 1.1 Why DateTime Features Matter

DateTime is one of the richest sources of features! A single timestamp can generate dozens of useful features.

| DateTime Component | What It Captures | Example Use Case |
|--------------------|------------------|------------------|
| **Year** | Long-term trends | Housing prices over decades |
| **Month** | Seasonality | Retail sales patterns |
| **Day of Month** | Pay cycles | Spending behavior |
| **Day of Week** | Weekly patterns | Restaurant traffic |
| **Hour** | Daily patterns | Energy consumption |
| **Is Weekend** | Work vs leisure | Website traffic |
| **Quarter** | Business cycles | Financial reporting |
| **Is Holiday** | Special events | Travel demand |

### 1.2 Basic DateTime Extraction

In [None]:
# Create a sample datetime column for demonstration
# (Adult dataset doesn't have dates, so we'll create one)
np.random.seed(42)
df['application_date'] = pd.date_range(
    start='2020-01-01', 
    periods=len(df), 
    freq='H'
)[:len(df)]

print("Sample datetime values:")
print(df['application_date'].head())

### 1.3 Extracting DateTime Components

In [None]:
def extract_datetime_features(df, date_col):
    """
    Extract comprehensive datetime features from a date column.
    
    Parameters:
    -----------
    df : DataFrame
    date_col : str - name of datetime column
    
    Returns:
    --------
    DataFrame with new datetime features
    """
    df = df.copy()
    
    # Ensure datetime type
    df[date_col] = pd.to_datetime(df[date_col])
    
    # Basic extractions
    df[f'{date_col}_year'] = df[date_col].dt.year
    df[f'{date_col}_month'] = df[date_col].dt.month
    df[f'{date_col}_day'] = df[date_col].dt.day
    df[f'{date_col}_dayofweek'] = df[date_col].dt.dayofweek  # 0=Monday
    df[f'{date_col}_hour'] = df[date_col].dt.hour
    df[f'{date_col}_quarter'] = df[date_col].dt.quarter
    
    # Derived features
    df[f'{date_col}_is_weekend'] = df[date_col].dt.dayofweek.isin([5, 6]).astype(int)
    df[f'{date_col}_is_month_start'] = df[date_col].dt.is_month_start.astype(int)
    df[f'{date_col}_is_month_end'] = df[date_col].dt.is_month_end.astype(int)
    
    return df

# Apply datetime extraction
df = extract_datetime_features(df, 'application_date')

# Show new features
datetime_cols = [col for col in df.columns if 'application_date_' in col]
print("Extracted DateTime Features:")
print(df[datetime_cols].head())

### 1.4 Cyclical Encoding for DateTime

**Problem:** Month 12 (December) and Month 1 (January) are numerically far apart (12 vs 1) but actually adjacent in time!

**Solution:** Use sine and cosine transformations to capture cyclical nature.

```
┌─────────────────────────────────────────────────────────────────────────┐
│                    CYCLICAL ENCODING VISUALIZATION                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Linear Encoding:     1 ─── 2 ─── 3 ─── ... ─── 11 ─── 12              │
│                        ↑                                   ↑            │
│                        └───────── FAR APART ───────────────┘            │
│                                                                         │
│   Cyclical Encoding:          12  1                                     │
│                             11      2                                   │
│                           10          3                                 │
│                             9          4                                │
│                              8        5                                 │
│                                 7  6                                    │
└─────────────────────────────────────────────────────────────────────────┘
```

In [None]:
def cyclical_encode(df, col, max_val):
    """
    Apply cyclical encoding using sin/cos transformation.
    
    Parameters:
    -----------
    df : DataFrame
    col : str - column to encode
    max_val : int - maximum value in the cycle (e.g., 12 for months, 7 for days)
    
    Returns:
    --------
    DataFrame with sin and cos encoded columns
    """
    df = df.copy()
    df[f'{col}_sin'] = np.sin(2 * np.pi * df[col] / max_val)
    df[f'{col}_cos'] = np.cos(2 * np.pi * df[col] / max_val)
    return df

# Apply cyclical encoding
df = cyclical_encode(df, 'application_date_month', 12)
df = cyclical_encode(df, 'application_date_dayofweek', 7)
df = cyclical_encode(df, 'application_date_hour', 24)

print("Cyclical Encoding Example (Month):")
print(df[['application_date_month', 'application_date_month_sin', 
          'application_date_month_cos']].drop_duplicates().sort_values(
          'application_date_month').head(12))

### 1.5 Visual: Cyclical Encoding

In [None]:
# Visualize cyclical encoding
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Month on a circle
months = np.arange(1, 13)
month_sin = np.sin(2 * np.pi * months / 12)
month_cos = np.cos(2 * np.pi * months / 12)

ax = axes[0]
ax.scatter(month_cos, month_sin, s=200, c=months, cmap='hsv')
for i, m in enumerate(months):
    ax.annotate(f'M{m}', (month_cos[i]+0.1, month_sin[i]+0.1), fontsize=12)
ax.set_xlim(-1.5, 1.5)
ax.set_ylim(-1.5, 1.5)
ax.set_xlabel('Cosine')
ax.set_ylabel('Sine')
ax.set_title('Months on Unit Circle\n(Adjacent months are close!)', fontsize=14, fontweight='bold')
ax.axhline(0, color='gray', linestyle='--', alpha=0.5)
ax.axvline(0, color='gray', linestyle='--', alpha=0.5)
ax.set_aspect('equal')

# Plot 2: Comparison of encodings
ax = axes[1]
ax.plot(months, months, 'o-', label='Linear (1-12)', linewidth=2)
ax.plot(months, month_sin, 's-', label='Sin encoding', linewidth=2)
ax.plot(months, month_cos, '^-', label='Cos encoding', linewidth=2)
ax.set_xlabel('Month')
ax.set_ylabel('Encoded Value')
ax.set_title('Linear vs Cyclical Encoding', fontsize=14, fontweight='bold')
ax.legend()
ax.set_xticks(months)

plt.tight_layout()
plt.show()

### Exercise 1.1: DateTime Feature Engineering

Create datetime features from a purchase timestamp:

In [None]:
# Given this sample data
sample_dates = pd.DataFrame({
    'purchase_time': pd.to_datetime([
        '2024-12-25 14:30:00',  # Christmas afternoon
        '2024-07-04 09:00:00',  # July 4th morning
        '2024-01-01 00:15:00',  # New Year's midnight
        '2024-03-15 18:45:00',  # Random weekday evening
        '2024-06-15 12:00:00',  # Saturday noon
    ])
})

# Task 1: Extract year, month, day, hour, dayofweek
# Your code here


# Task 2: Create is_weekend feature


# Task 3: Create is_business_hours feature (9 AM - 5 PM on weekdays)


# Task 4: Apply cyclical encoding to hour

<details>
<summary>Click for Solution</summary>

```python
# Task 1: Basic extraction
sample_dates['year'] = sample_dates['purchase_time'].dt.year
sample_dates['month'] = sample_dates['purchase_time'].dt.month
sample_dates['day'] = sample_dates['purchase_time'].dt.day
sample_dates['hour'] = sample_dates['purchase_time'].dt.hour
sample_dates['dayofweek'] = sample_dates['purchase_time'].dt.dayofweek

# Task 2: Is weekend
sample_dates['is_weekend'] = sample_dates['dayofweek'].isin([5, 6]).astype(int)

# Task 3: Is business hours
sample_dates['is_business_hours'] = (
    (sample_dates['hour'] >= 9) & 
    (sample_dates['hour'] < 17) & 
    (sample_dates['dayofweek'] < 5)
).astype(int)

# Task 4: Cyclical encoding for hour
sample_dates['hour_sin'] = np.sin(2 * np.pi * sample_dates['hour'] / 24)
sample_dates['hour_cos'] = np.cos(2 * np.pi * sample_dates['hour'] / 24)

print(sample_dates)
```

</details>

---

## 2. Categorical Feature Engineering

### 2.1 Encoding Methods Overview

| Method | When to Use | Pros | Cons |
|--------|-------------|------|------|
| **Label Encoding** | Ordinal categories | Simple, memory efficient | Implies false ordering |
| **One-Hot Encoding** | Nominal categories, low cardinality | No ordering implied | High dimensionality |
| **Target Encoding** | High cardinality | Captures target relationship | Risk of data leakage |
| **Frequency Encoding** | High cardinality | Simple, no leakage | Loses category identity |
| **Binary Encoding** | High cardinality | Compact representation | Less interpretable |

### 2.2 Encoding Decision Guide

```
                    ┌─────────────────────────┐
                    │  Is there a natural     │
                    │  ORDER to categories?   │
                    └───────────┬─────────────┘
                                │
                    ┌───────────┴───────────┐
                    │                       │
                   YES                      NO
                    │                       │
                    v                       v
            ┌───────────────┐       ┌───────────────────┐
            │ LABEL/ORDINAL │       │ How many unique   │
            │ ENCODING      │       │ categories?       │
            └───────────────┘       └─────────┬─────────┘
                                              │
                                    ┌─────────┴─────────┐
                                    │                   │
                                  < 10               >= 10
                                    │                   │
                                    v                   v
                            ┌───────────────┐   ┌───────────────┐
                            │ ONE-HOT       │   │ TARGET or     │
                            │ ENCODING      │   │ FREQUENCY     │
                            └───────────────┘   │ ENCODING      │
                                                └───────────────┘
```

### 2.3 Label Encoding (for Ordinal Data)

Use when categories have a meaningful order.

In [None]:
# Education has a natural order
education_order = {
    'Preschool': 1, '1st-4th': 2, '5th-6th': 3, '7th-8th': 4,
    '9th': 5, '10th': 6, '11th': 7, '12th': 8, 'HS-grad': 9,
    'Some-college': 10, 'Assoc-voc': 11, 'Assoc-acdm': 12,
    'Bachelors': 13, 'Masters': 14, 'Prof-school': 15, 'Doctorate': 16
}

df['education_ordinal'] = df['education'].map(education_order)

print("Education Ordinal Encoding:")
print(df[['education', 'education_ordinal']].drop_duplicates().sort_values('education_ordinal'))

### 2.4 One-Hot Encoding (for Nominal Data)

Use when categories have no natural order and cardinality is low (< 10).

In [None]:
# One-hot encode 'sex' (only 2 categories)
sex_dummies = pd.get_dummies(df['sex'], prefix='sex', drop_first=True)
print("One-Hot Encoding (sex):")
print(sex_dummies.head())
print(f"New columns: {sex_dummies.columns.tolist()}")

# One-hot encode 'workclass' (7 categories - manageable)
workclass_dummies = pd.get_dummies(df['workclass'], prefix='workclass')
print(f"\nOne-Hot Encoding (workclass):")
print(f"Original: 1 column → New: {len(workclass_dummies.columns)} columns")
print(f"Columns: {workclass_dummies.columns.tolist()}")

### 2.5 Target Encoding (for High Cardinality)

Replace category with average target value for that category.

**WARNING:** Must be done carefully to avoid data leakage!

In [None]:
def target_encode(df, cat_col, target_col, smoothing=10):
    """
    Apply target encoding with smoothing to prevent overfitting.
    
    Parameters:
    -----------
    df : DataFrame
    cat_col : str - categorical column to encode
    target_col : str - target column
    smoothing : int - smoothing factor (higher = more regularization)
    
    Returns:
    --------
    Series with target-encoded values
    """
    # Global mean
    global_mean = df[target_col].mean()
    
    # Category statistics
    cat_stats = df.groupby(cat_col)[target_col].agg(['mean', 'count'])
    
    # Smoothed target encoding
    # Formula: (category_mean * count + global_mean * smoothing) / (count + smoothing)
    smooth_mean = (cat_stats['mean'] * cat_stats['count'] + global_mean * smoothing) / (cat_stats['count'] + smoothing)
    
    return df[cat_col].map(smooth_mean)

# Apply target encoding to 'occupation' (14 categories - high cardinality)
df['occupation_target_encoded'] = target_encode(df, 'occupation', 'income_binary')

print("Target Encoding (occupation):")
print(df.groupby('occupation').agg({
    'income_binary': 'mean',
    'occupation_target_encoded': 'first'
}).sort_values('income_binary', ascending=False).round(3))

### 2.6 Frequency Encoding

Replace category with its frequency in the dataset.

In [None]:
def frequency_encode(df, col):
    """
    Apply frequency encoding.
    
    Parameters:
    -----------
    df : DataFrame
    col : str - column to encode
    
    Returns:
    --------
    Series with frequency-encoded values
    """
    freq = df[col].value_counts(normalize=True)
    return df[col].map(freq)

# Apply to native_country (41 categories - very high cardinality)
df['native_country_freq'] = frequency_encode(df, 'native_country')

print("Frequency Encoding (native_country):")
print(df[['native_country', 'native_country_freq']].drop_duplicates().sort_values(
    'native_country_freq', ascending=False).head(10))

### 2.7 Comparison of Encoding Methods

In [None]:
# Compare encoding methods for 'occupation'
print("="*70)
print("ENCODING COMPARISON: occupation")
print("="*70)

# Original
print(f"\nOriginal unique values: {df['occupation'].nunique()}")

# Label encoding
le = LabelEncoder()
df['occupation_label'] = le.fit_transform(df['occupation'])

# One-hot would create 14 columns (too many to show)
occupation_onehot = pd.get_dummies(df['occupation'], prefix='occ')

# Target encoding (already done)
# Frequency encoding
df['occupation_freq'] = frequency_encode(df, 'occupation')

print("\nSample comparison:")
comparison = df[['occupation', 'occupation_label', 'occupation_target_encoded', 
                 'occupation_freq']].drop_duplicates().head(5)
comparison.columns = ['Original', 'Label', 'Target', 'Frequency']
print(comparison.to_string(index=False))

print(f"\nOne-Hot would create: {len(occupation_onehot.columns)} new columns")

### Exercise 2.1: Categorical Encoding

Apply different encoding methods to the `marital_status` column:

In [None]:
# Task 1: How many unique values does marital_status have?
n_unique = _____
print(f"Unique values: {n_unique}")

# Task 2: Apply one-hot encoding
marital_onehot = _____
print(f"One-hot columns: {marital_onehot.shape[1]}")

# Task 3: Apply frequency encoding
df['marital_freq'] = _____

# Task 4: Apply target encoding
df['marital_target'] = _____

# Task 5: Which marital status has the highest income rate?
# (Use target encoding to find out)

<details>
<summary>Click for Solution</summary>

```python
# Task 1
n_unique = df['marital_status'].nunique()
print(f"Unique values: {n_unique}")  # 7

# Task 2
marital_onehot = pd.get_dummies(df['marital_status'], prefix='marital')
print(f"One-hot columns: {marital_onehot.shape[1]}")  # 7

# Task 3
df['marital_freq'] = frequency_encode(df, 'marital_status')

# Task 4
df['marital_target'] = target_encode(df, 'marital_status', 'income_binary')

# Task 5
print("\nMarital status by income rate:")
print(df.groupby('marital_status')['income_binary'].mean().sort_values(ascending=False))
# Married-civ-spouse has highest income rate
```

</details>

### Exercise 2.2: Choosing the Right Encoding

For each column, decide which encoding method to use and explain why:

| Column | Unique Values | Has Order? | Your Choice | Why? |
|--------|---------------|------------|-------------|------|
| `sex` | 2 | No | _____ | _____ |
| `education` | 16 | Yes | _____ | _____ |
| `native_country` | 41 | No | _____ | _____ |
| `race` | 5 | No | _____ | _____ |

<details>
<summary>Click for Answers</summary>

| Column | Your Choice | Why? |
|--------|-------------|------|
| `sex` | One-Hot (or binary) | Only 2 categories, no order |
| `education` | Label/Ordinal | Has natural order (education level) |
| `native_country` | Frequency or Target | High cardinality (41), one-hot creates too many columns |
| `race` | One-Hot | Low cardinality (5), no order |

</details>

## 3. Numerical Feature Engineering

### 3.1 Types of Numerical Transformations

| Transformation | When to Use | Example |
|----------------|-------------|--------|
| **Binning** | Create categories from continuous | Age → Age Groups |
| **Log Transform** | Reduce skewness | Income → Log(Income) |
| **Square/Cube** | Capture non-linear relationships | Age → Age² |
| **Interaction** | Capture combined effects | Hours × Education |
| **Ratio** | Normalize by another feature | Income / Hours = Hourly Rate |
| **Binary Flag** | Indicate presence/absence | Capital Gain > 0 |
| **Clipping** | Handle outliers | Cap at 99th percentile |
| **Polynomial** | Capture curves | Age, Age², Age³ |

### 3.2 Binning: Converting Continuous to Categories

In [None]:
# Method 1: Equal-width bins
df['age_bins_equal'] = pd.cut(df['age'], bins=5, labels=['Very Young', 'Young', 'Middle', 'Senior', 'Elder'])

# Method 2: Custom bins (domain knowledge)
age_bins = [0, 25, 35, 45, 55, 65, 100]
age_labels = ['<25', '25-34', '35-44', '45-54', '55-64', '65+']
df['age_group'] = pd.cut(df['age'], bins=age_bins, labels=age_labels)

# Method 3: Quantile bins (equal frequency)
df['age_quantile'] = pd.qcut(df['age'], q=5, labels=['Q1', 'Q2', 'Q3', 'Q4', 'Q5'])

print("Binning Comparison:")
print(df[['age', 'age_bins_equal', 'age_group', 'age_quantile']].head(10))

# Show distribution
print("\nAge Group Distribution:")
print(df['age_group'].value_counts().sort_index())

### 3.3 When to Use Each Binning Method

| Method | When to Use | Example |
|--------|-------------|--------|
| **Equal-width** | Data is uniformly distributed | Temperature ranges |
| **Custom bins** | Domain knowledge exists | Age groups, income brackets |
| **Quantile bins** | Data is skewed, want equal samples per bin | Customer segments |

### 3.4 Log Transformation

In [None]:
# Log transform highly skewed features
df['capital_gain_log'] = np.log1p(df['capital_gain'])  # log1p handles zeros
df['capital_loss_log'] = np.log1p(df['capital_loss'])

print("Log Transform Effect:")
print(f"capital_gain - Original skew: {df['capital_gain'].skew():.2f}")
print(f"capital_gain - After log:     {df['capital_gain_log'].skew():.2f}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(df['capital_gain'], bins=50, edgecolor='black', alpha=0.7)
axes[0].set_title('Original capital_gain\n(Highly Skewed)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Value')

axes[1].hist(df['capital_gain_log'], bins=50, edgecolor='black', alpha=0.7, color='coral')
axes[1].set_title('Log-transformed capital_gain\n(Less Skewed)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Log(Value + 1)')

plt.tight_layout()
plt.show()

### 3.5 Polynomial Features

In [None]:
# Create polynomial features for age
df['age_squared'] = df['age'] ** 2
df['age_cubed'] = df['age'] ** 3

# Why? Income often increases with age, peaks, then decreases
# A quadratic term can capture this curve!

print("Polynomial Features (age):")
print(df[['age', 'age_squared', 'age_cubed']].describe().round(1))

### 3.6 Interaction Features

Interaction features capture the COMBINED effect of two variables.

In [None]:
# Create interaction features
df['age_x_education'] = df['age'] * df['education_num']
df['age_x_hours'] = df['age'] * df['hours_per_week']
df['education_x_hours'] = df['education_num'] * df['hours_per_week']

print("Interaction Features:")
print(df[['age', 'education_num', 'hours_per_week', 
          'age_x_education', 'age_x_hours', 'education_x_hours']].head())

# Why interactions matter:
# High education ALONE doesn't guarantee high income
# Many hours ALONE doesn't guarantee high income
# But high education AND many hours together often does!

### 3.7 Ratio Features

In [None]:
# Create ratio features
df['hourly_rate_proxy'] = df['capital_gain'] / (df['hours_per_week'] * 52 + 1)  # +1 to avoid division by zero
df['capital_net'] = df['capital_gain'] - df['capital_loss']
df['capital_ratio'] = df['capital_gain'] / (df['capital_loss'] + 1)

print("Ratio Features:")
print(df[['capital_gain', 'capital_loss', 'capital_net', 'capital_ratio']].head(10))

### 3.8 Binary Flag Features

In [None]:
# Create binary flag features
df['has_capital_gain'] = (df['capital_gain'] > 0).astype(int)
df['has_capital_loss'] = (df['capital_loss'] > 0).astype(int)
df['works_overtime'] = (df['hours_per_week'] > 40).astype(int)
df['is_senior'] = (df['age'] >= 60).astype(int)
df['is_highly_educated'] = (df['education_num'] >= 13).astype(int)  # Bachelor's or higher

print("Binary Flag Features:")
print(df[['has_capital_gain', 'has_capital_loss', 'works_overtime', 
          'is_senior', 'is_highly_educated']].sum())
print("\nPercentages:")
print((df[['has_capital_gain', 'has_capital_loss', 'works_overtime', 
           'is_senior', 'is_highly_educated']].mean() * 100).round(1))

### Exercise 3.1: Numerical Feature Engineering

Create the following features:

In [None]:
# Task 1: Create 'hours_per_week_squared'
df['hours_squared'] = _____

# Task 2: Create 'age_hours_interaction'
df['age_hours'] = _____

# Task 3: Create binary flag 'works_fulltime' (>= 35 hours)
df['works_fulltime'] = _____

# Task 4: Create age bins: 'Young' (<30), 'Middle' (30-50), 'Senior' (>50)
df['age_category'] = _____

# Task 5: Create log transform of (hours_per_week + 1)
df['hours_log'] = _____

<details>
<summary>Click for Solution</summary>

```python
# Task 1
df['hours_squared'] = df['hours_per_week'] ** 2

# Task 2
df['age_hours'] = df['age'] * df['hours_per_week']

# Task 3
df['works_fulltime'] = (df['hours_per_week'] >= 35).astype(int)

# Task 4
df['age_category'] = pd.cut(df['age'], 
                            bins=[0, 30, 50, 100], 
                            labels=['Young', 'Middle', 'Senior'])

# Task 5
df['hours_log'] = np.log1p(df['hours_per_week'])

print("New features created:")
print(df[['hours_per_week', 'hours_squared', 'age_hours', 
          'works_fulltime', 'age_category', 'hours_log']].head())
```

</details>

### Exercise 3.2: Feature Engineering Creativity

Come up with 3 NEW feature ideas for the Adult dataset that we haven't discussed. Explain why each might be useful:

| Feature Name | Formula/Logic | Why It Might Be Useful |
|--------------|---------------|------------------------|
| ____________ | ____________ | ____________ |
| ____________ | ____________ | ____________ |
| ____________ | ____________ | ____________ |

<details>
<summary>Click for Example Answers</summary>

| Feature Name | Formula/Logic | Why It Might Be Useful |
|--------------|---------------|------------------------|
| `years_until_retirement` | `65 - age` | Captures career stage, may correlate with savings behavior |
| `education_per_age` | `education_num / age` | Early achievers vs late bloomers |
| `is_married` | `marital_status.contains('Married')` | Simpler binary version of marital status |
| `capital_active` | `has_capital_gain OR has_capital_loss` | Shows investment activity |
| `work_intensity` | `hours_per_week / 40` | Normalized work effort |

</details>

---

## 4. Feature Selection

### 4.1 Why Feature Selection Matters

| Problem with Too Many Features | Consequence |
|-------------------------------|-------------|
| **Curse of dimensionality** | Need exponentially more data |
| **Overfitting** | Model learns noise, not signal |
| **Slow training** | More features = more computation |
| **Harder to interpret** | Can't explain 500 features |
| **Multicollinearity** | Correlated features confuse models |

### 4.2 Feature Selection Methods

| Method | Type | How It Works |
|--------|------|-------------|
| **Correlation** | Filter | Remove highly correlated features |
| **Variance Threshold** | Filter | Remove low-variance features |
| **Mutual Information** | Filter | Keep features with high info about target |
| **Chi-Square** | Filter | For categorical features vs categorical target |
| **Feature Importance** | Embedded | Use tree models to rank importance |
| **Recursive Feature Elimination** | Wrapper | Iteratively remove weakest features |

### 4.3 Method 1: Correlation-Based Selection

In [None]:
# Identify highly correlated feature pairs
numeric_features = df.select_dtypes(include=[np.number]).columns.tolist()
# Remove target and ID-like columns
numeric_features = [c for c in numeric_features if c not in ['income_binary', 'fnlwgt']]

corr_matrix = df[numeric_features].corr().abs()

# Find pairs with correlation > 0.8
high_corr_pairs = []
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        if corr_matrix.iloc[i, j] > 0.8:
            high_corr_pairs.append({
                'Feature 1': corr_matrix.columns[i],
                'Feature 2': corr_matrix.columns[j],
                'Correlation': corr_matrix.iloc[i, j]
            })

print("Highly Correlated Feature Pairs (>0.8):")
if high_corr_pairs:
    print(pd.DataFrame(high_corr_pairs))
else:
    print("No pairs found with correlation > 0.8")

# Rule: When two features are highly correlated, keep only one!

### 4.4 Method 2: Feature Importance (Tree-Based)

In [None]:
# Use Random Forest to get feature importance
from sklearn.ensemble import RandomForestClassifier

# Prepare features (only original numeric + some engineered)
feature_cols = ['age', 'education_num', 'capital_gain', 'capital_loss', 
                'hours_per_week', 'age_squared', 'has_capital_gain']

X = df[feature_cols]
y = df['income_binary']

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X, y)

# Get feature importance
importance_df = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

print("Feature Importance (Random Forest):")
print(importance_df.to_string(index=False))

# Visualize
plt.figure(figsize=(10, 6))
plt.barh(importance_df['Feature'], importance_df['Importance'], color='steelblue')
plt.xlabel('Importance')
plt.title('Feature Importance (Random Forest)', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

### 4.5 Method 3: Mutual Information

In [None]:
from sklearn.feature_selection import mutual_info_classif

# Calculate mutual information
mi_scores = mutual_info_classif(X, y, random_state=42)

mi_df = pd.DataFrame({
    'Feature': feature_cols,
    'MI Score': mi_scores
}).sort_values('MI Score', ascending=False)

print("Mutual Information Scores:")
print(mi_df.to_string(index=False))

### 4.6 Feature Selection Decision Framework

```
┌─────────────────────────────────────────────────────────────────────────┐
│                  FEATURE SELECTION DECISION FRAMEWORK                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  STEP 1: Remove obviously useless features                              │
│  ─────────────────────────────────────────                              │
│  • IDs, names, timestamps (unless engineered)                           │
│  • Features with >50% missing values                                    │
│  • Features with near-zero variance                                     │
│                          │                                              │
│                          ▼                                              │
│  STEP 2: Handle multicollinearity                                       │
│  ────────────────────────────────                                       │
│  • Remove one from each pair with correlation > 0.9                     │
│  • Keep the one more correlated with target                             │
│                          │                                              │
│                          ▼                                              │
│  STEP 3: Rank remaining features                                        │
│  ───────────────────────────────                                        │
│  • Use feature importance OR mutual information                         │
│  • Keep top K features (K based on domain knowledge or CV)              │
│                          │                                              │
│                          ▼                                              │
│  STEP 4: Validate selection                                             │
│  ─────────────────────────                                              │
│  • Train model with selected features                                   │
│  • Compare to model with all features                                   │
│  • Check for performance vs complexity tradeoff                         │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
```

### Exercise 4.1: Feature Selection

In [None]:
# Given these features and their importance scores:
features = {
    'age': 0.15,
    'education_num': 0.12,
    'hours_per_week': 0.10,
    'capital_gain': 0.25,
    'occupation_encoded': 0.08,
    'age_squared': 0.14,  # Highly correlated with age (0.95)
    'random_noise': 0.01
}

# Task 1: Which feature would you remove due to low importance?
# Answer: _____

# Task 2: Between 'age' and 'age_squared', which would you keep and why?
# Answer: _____

# Task 3: If you could only keep 4 features, which ones?
# Answer: _____

<details>
<summary>Click for Answers</summary>

1. **Remove:** `random_noise` (importance = 0.01, essentially useless)

2. **Keep `age_squared`** because it has slightly lower importance (0.14 vs 0.15) BUT captures non-linear relationships. Alternatively, keep `age` if interpretability matters more.

3. **Top 4 features:** `capital_gain` (0.25), `age` (0.15), `age_squared` (0.14), `education_num` (0.12)
   - Or if avoiding correlated features: `capital_gain`, `age`, `education_num`, `hours_per_week`

</details>

## 5. Complete Feature Engineering Pipeline

### 5.1 The Full Pipeline

In [None]:
def engineer_features(df):
    """
    Complete feature engineering pipeline for Adult dataset.
    
    Parameters:
    -----------
    df : DataFrame - raw data
    
    Returns:
    --------
    DataFrame with engineered features
    """
    df = df.copy()
    
    print("="*60)
    print("FEATURE ENGINEERING PIPELINE")
    print("="*60)
    
    # Step 1: Numerical Transformations
    print("\n Step 1: Numerical Transformations")
    df['age_squared'] = df['age'] ** 2
    df['age_cubed'] = df['age'] ** 3
    df['capital_gain_log'] = np.log1p(df['capital_gain'])
    df['capital_loss_log'] = np.log1p(df['capital_loss'])
    df['capital_net'] = df['capital_gain'] - df['capital_loss']
    print("   Created polynomial, log, and arithmetic features")
    
    # Step 2: Binary Flags
    print("\n Step 2: Binary Flags")
    df['has_capital_gain'] = (df['capital_gain'] > 0).astype(int)
    df['has_capital_loss'] = (df['capital_loss'] > 0).astype(int)
    df['works_overtime'] = (df['hours_per_week'] > 40).astype(int)
    df['is_senior'] = (df['age'] >= 60).astype(int)
    df['is_highly_educated'] = (df['education_num'] >= 13).astype(int)
    print("   Created 5 binary flag features")
    
    # Step 3: Binning
    print("\n Step 3: Binning")
    age_bins = [0, 25, 35, 45, 55, 65, 100]
    age_labels = ['<25', '25-34', '35-44', '45-54', '55-64', '65+']
    df['age_group'] = pd.cut(df['age'], bins=age_bins, labels=age_labels)
    
    hours_bins = [0, 20, 35, 40, 50, 100]
    hours_labels = ['Part-time', 'Reduced', 'Full-time', 'Overtime', 'Extreme']
    df['hours_group'] = pd.cut(df['hours_per_week'], bins=hours_bins, labels=hours_labels)
    print("    Created age_group and hours_group bins")
    
    # Step 4: Interaction Features
    print("\n Step 4: Interaction Features")
    df['age_x_education'] = df['age'] * df['education_num']
    df['age_x_hours'] = df['age'] * df['hours_per_week']
    df['education_x_hours'] = df['education_num'] * df['hours_per_week']
    print("   Created 3 interaction features")
    
    # Step 5: Categorical Encoding
    print("\n Step 5: Categorical Encoding")
    
    # Label encode ordinal features
    education_order = {
        'Preschool': 1, '1st-4th': 2, '5th-6th': 3, '7th-8th': 4,
        '9th': 5, '10th': 6, '11th': 7, '12th': 8, 'HS-grad': 9,
        'Some-college': 10, 'Assoc-voc': 11, 'Assoc-acdm': 12,
        'Bachelors': 13, 'Masters': 14, 'Prof-school': 15, 'Doctorate': 16
    }
    df['education_ordinal'] = df['education'].map(education_order)
    
    # Binary encode sex
    df['is_male'] = (df['sex'] == 'Male').astype(int)
    
    # Frequency encode high-cardinality features
    for col in ['occupation', 'native_country']:
        freq = df[col].value_counts(normalize=True)
        df[f'{col}_freq'] = df[col].map(freq)
    
    print("   Encoded education (ordinal), sex (binary), occupation & country (frequency)")
    
    # Summary
    print("\n" + "="*60)
    print("PIPELINE COMPLETE")
    print("="*60)
    original_cols = 15
    new_cols = len(df.columns)
    print(f"Original columns: {original_cols}")
    print(f"Final columns: {new_cols}")
    print(f"New features created: {new_cols - original_cols}")
    
    return df

# Apply the pipeline
df_engineered = engineer_features(df)

### 5.2 Verify Feature Quality

In [None]:
def verify_features(df, target_col='income_binary'):
    """
    Verify quality of engineered features.
    """
    print("="*60)
    print("FEATURE QUALITY VERIFICATION")
    print("="*60)
    
    # Get numeric columns (excluding target)
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    numeric_cols = [c for c in numeric_cols if c != target_col]
    
    # Check 1: Missing values
    print("\n1  Missing Values:")
    missing = df[numeric_cols].isnull().sum()
    missing = missing[missing > 0]
    if len(missing) == 0:
        print("   No missing values in numeric features")
    else:
        print(f"    Features with missing values: {missing.to_dict()}")
    
    # Check 2: Infinite values
    print("\n2  Infinite Values:")
    inf_counts = np.isinf(df[numeric_cols]).sum()
    inf_counts = inf_counts[inf_counts > 0]
    if len(inf_counts) == 0:
        print("   No infinite values")
    else:
        print(f"    Features with infinite values: {inf_counts.to_dict()}")
    
    # Check 3: Constant features
    print("\n3  Constant Features (zero variance):")
    constants = [c for c in numeric_cols if df[c].std() == 0]
    if len(constants) == 0:
        print("   No constant features")
    else:
        print(f"   Constant features: {constants}")
    
    # Check 4: High correlation with target
    print("\n4 Top 5 Features by Target Correlation:")
    correlations = df[numeric_cols].corrwith(df[target_col]).abs().sort_values(ascending=False)
    print(correlations.head().to_string())
    
    print("\n" + "="*60)

# Verify
verify_features(df_engineered)

## 6. Common Pitfalls and Best Practices

### 6.1 The Deadly Sin: Data Leakage

```
┌─────────────────────────────────────────────────────────────────────────┐
│                          ⚠️  DATA LEAKAGE ⚠️                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Data leakage = Using information during training that won't be         │
│                 available at prediction time                            │
│                                                                         │
│  COMMON CAUSES:                                                         │
│  • Target encoding on full data (not just training)                     │
│  • Scaling on full data (not just training)                             │
│  • Features derived from future data                                    │
│  • Including proxy features for target                                  │
│                                                                         │
│  SYMPTOMS:                                                              │
│  • Suspiciously high validation accuracy                                │
│  • Model performs poorly in production                                  │
│                                                                         │
│  PREVENTION:                                                            │
│  • Always split data BEFORE feature engineering                         │
│  • Fit transformers on training data only                               │
│  • Apply same transformers to test/validation data                      │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
```

### 6.2 Correct vs Incorrect Pipeline

In [None]:
# WRONG: Leakage! Fitting on all data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Wrong way
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # ❌ Fitting on ALL data
X_train, X_test = train_test_split(X_scaled)  # ❌ Then splitting

# CORRECT: Fit on training only
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # ✅ Fit on training
X_test_scaled = scaler.transform(X_test)  # ✅ Only transform test

print("✅ Correct pipeline: fit_transform on train, transform only on test")

### 6.3 Best Practices Summary

| Practice | Description | Why It Matters |
|----------|-------------|----------------|
| **Split first** | Split data before any transformation | Prevents leakage |
| **Fit on train** | Fit encoders/scalers on training only | Simulates production |
| **Document features** | Keep track of what you created | Reproducibility |
| **Start simple** | Try simple features first | Easier to debug |
| **Validate impact** | Check if new features improve model | Avoid complexity |
| **Handle edge cases** | What if a new category appears? | Production robustness |
| **Version control** | Save feature engineering code | Reproducibility |

### 6.4 Feature Engineering Checklist

```
┌─────────────────────────────────────────────────────────────────────────┐
│                    FEATURE ENGINEERING CHECKLIST                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  BEFORE ENGINEERING:                                                    │
│  □ Split data into train/test                                           │
│  □ Understand each feature's meaning                                    │
│  □ Identify feature types (numeric, categorical, datetime)              │
│                                                                         │
│  DURING ENGINEERING:                                                    │
│  □ Create meaningful transformations                                    │
│  □ Handle missing values appropriately                                  │
│  □ Encode categorical variables                                         │
│  □ Create interaction features (if appropriate)                         │
│  □ Document each new feature                                            │
│                                                                         │
│  AFTER ENGINEERING:                                                     │
│  □ Check for missing/infinite values                                    │
│  □ Remove constant features                                             │
│  □ Check for high correlations                                          │
│  □ Validate feature importance                                          │
│  □ Test model with new features                                         │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
```

### Exercise 6.1: Spot the Leakage

Which of the following scenarios have data leakage?

1. **Scenario A:** You calculate the mean income for each occupation using the entire dataset, then use this as a feature.

2. **Scenario B:** You create a feature `days_since_last_purchase` by subtracting purchase date from today's date.

3. **Scenario C:** You split data, then apply StandardScaler fitted on training data to both train and test sets.

4. **Scenario D:** You include `customer_lifetime_value` as a feature to predict `will_purchase_again`.

<details>
<summary>Click for Answers</summary>

1. **Scenario A: LEAKAGE ⚠️**
   - Mean income includes test data information
   - Fix: Calculate mean only from training data

2. **Scenario B: NO LEAKAGE ✅**
   - Using current date is fine (available at prediction time)
   - But be careful with "days since" features in time series!

3. **Scenario C: NO LEAKAGE ✅**
   - This is the correct approach!

4. **Scenario D: LEAKAGE ⚠️**
   - Lifetime value is calculated AFTER purchases happen
   - It includes future information about the customer

</details>

## Summary: Feature Engineering Quick Reference

### DateTime Features
```python
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['dayofweek'] = df['date'].dt.dayofweek
df['is_weekend'] = df['date'].dt.dayofweek.isin([5,6]).astype(int)

# Cyclical encoding
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
```

### Categorical Encoding
```python
# One-hot
pd.get_dummies(df['cat_col'], prefix='cat')

# Label/Ordinal
df['encoded'] = df['cat_col'].map(order_dict)

# Frequency
df['freq'] = df['cat_col'].map(df['cat_col'].value_counts(normalize=True))
```

### Numerical Transformations
```python
# Log transform
df['log_col'] = np.log1p(df['col'])

# Polynomial
df['col_squared'] = df['col'] ** 2

# Binning
df['binned'] = pd.cut(df['col'], bins=[0, 25, 50, 100], labels=['Low', 'Med', 'High'])

# Interaction
df['interaction'] = df['col1'] * df['col2']

# Binary flag
df['flag'] = (df['col'] > threshold).astype(int)
```

### Feature Selection
```python
# Correlation
corr_matrix = df.corr().abs()

# Random Forest importance
rf = RandomForestClassifier()
rf.fit(X, y)
importance = rf.feature_importances_

# Mutual information
from sklearn.feature_selection import mutual_info_classif
mi = mutual_info_classif(X, y)
```