# X1: Feature Engineering - The Art of Creating Better FeaturesFeature engineering is often the difference between mediocre and winning models.

## Introduction**Andrew Ng's famous quote:**> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering."Good features make simple models work. Bad features make complex models fail.

## Table of Contents1. Why Feature Engineering Matters2. Handling Missing Data3. Encoding Categorical Variables4. Numerical Transformations5. Feature Scaling & Normalization6. Creating Interaction Features7. Polynomial Features8. Time-based Features9. Text Features10. Automated Feature Engineering

In [None]:
# Install required library if not available
# Uncomment the line below if running locally and category-encoders is not installed
# !pip install category-encoders

try:
    import category_encoders
    print(f'‚úÖ category-encoders version {category_encoders.__version__} found')
except ImportError:
    print('‚ö†Ô∏è category-encoders not found. Installing...')
    import sys
    import subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "install", "category-encoders"])
    import category_encoders
    print(f'‚úÖ category-encoders version {category_encoders.__version__} installed successfully!')

# Import all required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from category_encoders import TargetEncoder
import warnings
warnings.filterwarnings('ignore')

print('‚úÖ All libraries loaded successfully!')

## 1. Handling Missing Data**Three Strategies:****A. Deletion:**- Drop rows: When <5% missing- Drop columns: When >70% missing**B. Imputation:**- Mean/Median: Numerical features- Mode: Categorical features- Forward/Backward fill: Time series- Model-based: Predict missing values**C. Flag Creation:**- Create 'is_missing' binary feature- Often helps models learn patterns

In [None]:
# Create sample data with missing valuesdf = pd.DataFrame({    'age': [25, np.nan, 35, 45, np.nan],    'income': [50000, 60000, np.nan, 80000, 75000],    'city': ['NYC', 'LA', np.nan, 'Chicago', 'NYC']})print('Original data:')print(df)print(f'\nMissing values:\n{df.isnull().sum()}')# Strategy 1: Simple imputationfrom sklearn.impute import SimpleImputernum_imputer = SimpleImputer(strategy='mean')df['age_imputed'] = num_imputer.fit_transform(df[['age']])df['income_imputed'] = SimpleImputer(strategy='median').fit_transform(df[['income']])# Strategy 2: Create missing flagdf['age_was_missing'] = df['age'].isnull().astype(int)print('\nAfter imputation:')print(df[['age', 'age_imputed', 'age_was_missing']])

## 2. Encoding Categorical Variables**Methods:****A. Label Encoding:** For ordinal data (low, medium, high)**B. One-Hot Encoding:** For nominal data with few categories (<10)- Creates binary column for each category- Avoids imposing false order**C. Target Encoding:** For high-cardinality features (>10 categories)- Replace category with mean target value- Risk of overfitting - use cross-validation**D. Frequency Encoding:** Replace with category frequency

## ‚ö†Ô∏è CRITICAL: Avoiding Data Leakage in Target Encoding

**Data leakage** is one of the most dangerous mistakes in machine learning. It happens when information from the test set "leaks" into the training process, causing models to appear much better than they actually are.

### The Problem with Target Encoding

When you replace a categorical variable with the mean of the target variable for that category, you MUST compute those means using ONLY the training data, never the full dataset.

### ‚ùå WRONG (Data Leakage):
```python
# DON'T DO THIS!
target_means = df.groupby('category')['target'].mean()  # Uses ALL data
df['category_encoded'] = df['category'].map(target_means)
# THEN split train/test
X_train, X_test, y_train, y_test = train_test_split(...)
```

**Why it's wrong:** You're using information from the test set to create features. Your model will learn patterns that include test data, making validation metrics unrealistically optimistic. In production, when you encounter truly new data, performance will drop dramatically.

### ‚úÖ CORRECT (No Leakage):
```python
# Split FIRST
X_train, X_test, y_train, y_test = train_test_split(...)

# Compute encoding ONLY on training data
target_means = X_train.groupby('category')['target'].mean()

# Apply to both sets
X_train['category_encoded'] = X_train['category'].map(target_means)
X_test['category_encoded'] = X_test['category'].map(target_means)

# Handle unseen categories in test set
X_test['category_encoded'].fillna(y_train.mean(), inplace=True)
```

**The demonstration below shows both approaches side-by-side so you can see the difference and understand why this matters.**

In [None]:
# One-Hot Encoding
df_cat = pd.DataFrame({'color': ['red', 'blue', 'green', 'red']})
df_onehot = pd.get_dummies(df_cat, columns=['color'], prefix='color')

print('One-Hot Encoding:')
print(df_onehot)

print('\n' + '='*70)
print('TARGET ENCODING - Data Leakage Demonstration')
print('='*70)

# Create sample data with train/test split
np.random.seed(42)
n_samples = 100
df_full = pd.DataFrame({
    'city': np.random.choice(['NYC', 'LA', 'Chicago', 'Boston'], n_samples),
    'price': np.random.randint(200, 600, n_samples)
})

# Split into train and test
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df_full, test_size=0.3, random_state=42)

print(f'\nDataset: {len(train_df)} train samples, {len(test_df)} test samples')
print('\n‚ö†Ô∏è WARNING: WRONG APPROACH (Data Leakage) ‚ö†Ô∏è')
print('-' * 70)

# WRONG: Computing statistics on full dataset before split
wrong_target_means = df_full.groupby('city')['price'].mean()
train_df_wrong = train_df.copy()
test_df_wrong = test_df.copy()
train_df_wrong['city_encoded_WRONG'] = train_df_wrong['city'].map(wrong_target_means)
test_df_wrong['city_encoded_WRONG'] = test_df_wrong['city'].map(wrong_target_means)

print('Encoding computed on FULL dataset (includes test data!):')
print(wrong_target_means)
print('\n‚ùå Problem: Test set statistics leaked into training!')
print('‚ùå Model will appear better than it actually is')
print('‚ùå Performance will drop significantly in production')

print('\n‚úÖ CORRECT APPROACH (No Leakage) ‚úÖ')
print('-' * 70)

# RIGHT: Computing statistics ONLY on training data
correct_target_means = train_df.groupby('city')['price'].mean()
train_df_correct = train_df.copy()
test_df_correct = test_df.copy()

# Apply to training data
train_df_correct['city_encoded'] = train_df_correct['city'].map(correct_target_means)

# Apply to test data with fallback for unseen categories
global_mean = train_df['price'].mean()
test_df_correct['city_encoded'] = test_df_correct['city'].map(correct_target_means).fillna(global_mean)

print('Encoding computed ONLY on training data:')
print(correct_target_means)
print('\n‚úÖ Training data uses its own statistics')
print('‚úÖ Test data uses training statistics (as it should)')
print('‚úÖ Unseen categories filled with global mean')
print('‚úÖ No information from test set leaked!')

# Show comparison
print('\n' + '='*70)
print('COMPARISON: Wrong vs Correct Encoding Values')
print('='*70)
comparison = pd.DataFrame({
    'City': ['NYC', 'LA', 'Chicago', 'Boston'],
    'Wrong (full data)': [wrong_target_means.get(c, 0) for c in ['NYC', 'LA', 'Chicago', 'Boston']],
    'Correct (train only)': [correct_target_means.get(c, 0) for c in ['NYC', 'LA', 'Chicago', 'Boston']],
    'Difference': [abs(wrong_target_means.get(c, 0) - correct_target_means.get(c, 0)) 
                   for c in ['NYC', 'LA', 'Chicago', 'Boston']]
})
print(comparison.to_string(index=False))

print('\n' + '='*70)
print('BEST PRACTICE: Use sklearn TargetEncoder with proper CV')
print('='*70)

# Using sklearn's TargetEncoder (handles cross-validation internally)
from category_encoders import TargetEncoder

encoder = TargetEncoder(cols=['city'], smoothing=1.0)
train_df_sklearn = train_df[['city', 'price']].copy()
test_df_sklearn = test_df[['city', 'price']].copy()

# Fit on training data only
encoder.fit(train_df_sklearn['city'], train_df_sklearn['price'])

# Transform both sets
train_df_sklearn['city_encoded'] = encoder.transform(train_df_sklearn['city'])
test_df_sklearn['city_encoded'] = encoder.transform(test_df_sklearn['city'])

print('\n‚úÖ TargetEncoder handles:')
print('  ‚Ä¢ Fitting only on training data')
print('  ‚Ä¢ Smoothing to prevent overfitting')
print('  ‚Ä¢ Unseen categories automatically')
print('  ‚Ä¢ Can use cross-validation to reduce overfitting')

print('\nüéØ KEY TAKEAWAY:')
print('  ALWAYS fit encoders on training data ONLY!')
print('  NEVER use test set statistics during training!')

## 3. Numerical Transformations**Common Transformations:****A. Log Transform:** For skewed data- $x' = \log(x + 1)$- Makes distribution more Gaussian**B. Square Root:** For count data- $x' = \sqrt{x}$**C. Box-Cox:** Automatic optimal transformation**D. Binning:** Convert continuous to categorical- Equal-width bins- Equal-frequency bins (quantiles)- Custom thresholds

In [None]:
# Log transformation for skewed dataskewed_data = np.exp(np.random.normal(0, 1, 1000))  # Highly skewedfig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))ax1.hist(skewed_data, bins=50)ax1.set_title('Original (Skewed)', fontweight='bold')ax1.set_xlabel('Value')log_transformed = np.log1p(skewed_data)ax2.hist(log_transformed, bins=50)ax2.set_title('Log Transformed (More Gaussian)', fontweight='bold')ax2.set_xlabel('Log(Value + 1)')plt.tight_layout()plt.show()print(f'Original skewness: {pd.Series(skewed_data).skew():.2f}')print(f'Log-transformed skewness: {pd.Series(log_transformed).skew():.2f}')

## 4. Feature Scaling**Why Scale?** Many algorithms (KNN, SVM, Neural Networks) are sensitive to feature magnitude.**Methods:****A. Standardization (Z-score):**- $x' = \frac{x - \mu}{\sigma}$- Mean=0, Std=1- Use when data is Gaussian**B. Min-Max Normalization:**- $x' = \frac{x - x_{min}}{x_{max} - x_{min}}$- Range: [0, 1]- Use when need bounded range**C. Robust Scaling:**- Uses median and IQR (robust to outliers)

In [None]:
# Comparison of scaling methodsdata = np.array([[1, 2000], [2, 3000], [3, 5000], [4, 10000], [5, 2500]])df_scale = pd.DataFrame(data, columns=['feature1', 'feature2'])print('Original data:')print(df_scale)# StandardScalerscaler_std = StandardScaler()df_std = pd.DataFrame(scaler_std.fit_transform(df_scale), columns=['f1_std', 'f2_std'])# MinMaxScaler  scaler_mm = MinMaxScaler()df_mm = pd.DataFrame(scaler_mm.fit_transform(df_scale), columns=['f1_mm', 'f2_mm'])print('\nStandardized:')print(df_std.head())print('\nMin-Max Normalized:')print(df_mm.head())

## 5. Interaction Features**Idea:** Combine features to capture relationships**Examples:**- Price √ó Quantity = Total Revenue- Height √ó Weight = BMI indicator- Day √ó Hour = Time of day patterns**Polynomial Features:**- Degree 2: $[x_1, x_2] ‚Üí [x_1, x_2, x_1^2, x_1x_2, x_2^2]$- Captures non-linear relationships

In [None]:
# Polynomial featuresX = np.array([[2, 3], [4, 5]])poly = PolynomialFeatures(degree=2, include_bias=False)X_poly = poly.fit_transform(X)print('Original features: [x1, x2]')print(X)print('\nPolynomial features (degree=2): [x1, x2, x1¬≤, x1*x2, x2¬≤]')print(X_poly)print(f'\nFeature names: {poly.get_feature_names_out()}')

## 6. Time-Based Features**From timestamps, extract:**- Hour, day of week, month, quarter, year- Is_weekend, is_holiday- Days since epoch- Cyclical encoding (sin/cos for hour/month)**Cyclical Encoding:**- Hour 23 and Hour 0 are close, but numerically 23 apart- Solution: $hour\_sin = \sin(2\pi \times hour / 24)$- $hour\_cos = \cos(2\pi \times hour / 24)$

In [None]:
# Time-based feature extractiondates = pd.date_range('2024-01-01', periods=5, freq='12H')df_time = pd.DataFrame({'timestamp': dates})# Extract featuresdf_time['hour'] = df_time['timestamp'].dt.hourdf_time['day_of_week'] = df_time['timestamp'].dt.dayofweekdf_time['is_weekend'] = (df_time['day_of_week'] >= 5).astype(int)# Cyclical encoding for hourdf_time['hour_sin'] = np.sin(2 * np.pi * df_time['hour'] / 24)df_time['hour_cos'] = np.cos(2 * np.pi * df_time['hour'] / 24)print(df_time)

## 7. Automated Feature Engineering

Automated feature engineering tools can generate hundreds of features from your raw data, but they require careful application to avoid overfitting.

### Popular Tools

**Featuretools** - Deep Feature Synthesis
- Automatically creates features from relational datasets
- Handles multi-table databases
- Generates aggregation and transformation features

**tsfresh** - Time Series Features
- Extracts 700+ features from time series data
- Built-in feature selection
- Statistical and signal processing features

**AutoFeat** - Linear Models
- Generates non-linear features for linear models
- Automatic feature selection
- Particularly good for regression problems

### Why We Don't Cover Them In-Depth Here

The manual feature engineering techniques you've learned in this notebook (encoding, scaling, transformations, interactions, time features) form the foundation that you MUST understand before using automated tools. Automated tools should augment your domain knowledge, not replace it.

### Learning Resources

If you want to explore automated feature engineering:

**Featuretools:**
- Official Tutorial: https://docs.featuretools.com/en/stable/getting_started/getting_started.html
- GitHub: https://github.com/alteryx/featuretools
- Best for: Relational databases, multi-table data

**Example Workflow:**
```python
# Install: pip install featuretools
import featuretools as ft

# Create entity set
es = ft.EntitySet(id='my_data')
es = es.add_dataframe(dataframe_name='transactions', 
                      dataframe=df, 
                      index='transaction_id',
                      time_index='timestamp')

# Generate features automatically
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name='transactions',
    max_depth=2,  # How many levels of features to create
    verbose=True
)
```

### Best Practices for Automated Feature Engineering

1. **Start Manual** - Create domain-knowledge features first
2. **Understand Output** - Review generated features, don't blindly use all
3. **Feature Selection** - Most generated features won't be useful
4. **Cross-Validation** - Essential to avoid overfitting with many features
5. **Computation Cost** - Can be slow on large datasets
6. **Interpretability** - Complex generated features may be hard to explain

### When to Use Automated Tools

‚úÖ **Good for:**
- Exploration phase of a project
- Finding non-obvious feature interactions
- Kaggle competitions where performance matters most
- Large relational databases

‚ùå **Not ideal for:**
- Production systems requiring interpretability
- When you have limited data (risk of overfitting)
- Real-time systems (computation overhead)
- When domain expertise is critical

### The Bottom Line

**Master manual feature engineering first** (this notebook), then explore automated tools to augment your work. The techniques you've learned here - handling missing data, encoding categories correctly, avoiding data leakage, scaling features, and creating interactions - are fundamental skills that will serve you in every machine learning project.

## Conclusion**Feature Engineering Checklist:****Data Cleaning:**- ‚òë Handle missing values- ‚òë Remove duplicates- ‚òë Fix data types**Encoding:**- ‚òë One-hot encode nominal categories (<10 levels)- ‚òë Target encode high-cardinality (>10 levels)- ‚òë Label encode ordinal categories**Transformations:**- ‚òë Log transform skewed features- ‚òë Scale features for distance-based models- ‚òë Create polynomial/interaction features**Domain-Specific:**- ‚òë Extract time-based features- ‚òë Create ratios and aggregations- ‚òë Add domain knowledge features**Validation:**- ‚òë Check for data leakage- ‚òë Validate with cross-validation- ‚òë Analyze feature importance**Remember:** Good features > Complex models!