# Feature Engineering

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to predictive models, resulting in improved model accuracy on unseen data. It is one of the most important steps in the machine learning pipeline.

In this notebook, we'll explore various feature engineering techniques that can significantly improve your model's performance.

## Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

## Sample Dataset

Let's create a sample dataset to demonstrate various feature engineering techniques:

In [None]:
# Create a sample dataset
np.random.seed(42)
n_samples = 1000

# Generate synthetic data
age = np.random.randint(18, 80, n_samples)
income = np.random.normal(50000, 15000, n_samples)
experience = np.random.randint(0, 40, n_samples)
education_years = np.random.randint(8, 20, n_samples)

# Create some relationships
purchase_amount = (
    0.5 * age +
    0.0001 * income +
    2 * experience +
    10 * education_years +
    np.random.normal(0, 100, n_samples)  # noise
)

# Ensure positive values
purchase_amount = np.abs(purchase_amount)

# Create categorical features
cities = np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'], n_samples)
membership = np.random.choice(['Basic', 'Premium', 'VIP'], n_samples, p=[0.6, 0.3, 0.1])

# Create DataFrame
data = {
    'age': age,
    'income': income,
    'experience': experience,
    'education_years': education_years,
    'city': cities,
    'membership': membership,
    'purchase_amount': purchase_amount
}

df = pd.DataFrame(data)
print("Original Dataset:")
print(df.head(10))
print("\nDataset Shape:", df.shape)
print("\nDataset Info:")
print(df.info())

## 1. Mathematical Transformations

Mathematical transformations can help normalize distributions, stabilize variance, or create new meaningful features.

**Common transformations:**
- Logarithmic transformation
- Square root transformation
- Box-Cox transformation
- Polynomial features

In [None]:
# Logarithmic transformation
df['log_income'] = np.log1p(df['income'])  # log(1+x) to handle zero values

# Square root transformation
df['sqrt_income'] = np.sqrt(df['income'])

# Polynomial features (example with age)
df['age_squared'] = df['age'] ** 2
df['age_cubed'] = df['age'] ** 3

print("Dataset after mathematical transformations:")
print(df[['income', 'log_income', 'sqrt_income', 'age', 'age_squared', 'age_cubed']].head())

In [None]:
# Visualize the effect of transformations
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Original income distribution
axes[0, 0].hist(df['income'], bins=30, alpha=0.7)
axes[0, 0].set_title('Original Income Distribution')
axes[0, 0].set_xlabel('Income')

# Log-transformed income
axes[0, 1].hist(df['log_income'], bins=30, alpha=0.7, color='orange')
axes[0, 1].set_title('Log-Transformed Income Distribution')
axes[0, 1].set_xlabel('Log(Income)')

# Square root transformed income
axes[1, 0].hist(df['sqrt_income'], bins=30, alpha=0.7, color='green')
axes[1, 0].set_title('Square Root Transformed Income')
axes[1, 0].set_xlabel('Sqrt(Income)')

# Age squared
axes[1, 1].hist(df['age_squared'], bins=30, alpha=0.7, color='red')
axes[1, 1].set_title('Age Squared Distribution')
axes[1, 1].set_xlabel('Age²')

plt.tight_layout()
plt.show()

## 2. Binning (Discretization)

Binning converts continuous variables into categorical ones by grouping values into bins. This can help capture non-linear relationships.

**Types of binning:**
- Equal-width binning
- Equal-frequency binning
- Custom binning

In [None]:
# Equal-width binning
df['age_binned_equal_width'] = pd.cut(df['age'], bins=5, labels=['Very Young', 'Young', 'Middle', 'Old', 'Very Old'])

# Equal-frequency binning (quantile binning)
df['income_binned_quantile'] = pd.qcut(df['income'], q=5, labels=['Q1', 'Q2', 'Q3', 'Q4', 'Q5'])

# Custom binning
income_bins = [0, 30000, 50000, 70000, 100000, np.inf]
income_labels = ['Low', 'Lower-Middle', 'Middle', 'Upper-Middle', 'High']
df['income_binned_custom'] = pd.cut(df['income'], bins=income_bins, labels=income_labels)

print("Binned features:")
print(df[['age', 'age_binned_equal_width', 'income', 'income_binned_quantile', 'income_binned_custom']].head(10))

In [None]:
# Visualize binning results
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Age distribution with equal-width bins
df['age_binned_equal_width'].value_counts().plot(kind='bar', ax=axes[0], color='skyblue')
axes[0].set_title('Age Distribution (Equal-Width Bins)')
axes[0].tick_params(axis='x', rotation=45)

# Income distribution with quantile bins
df['income_binned_quantile'].value_counts().plot(kind='bar', ax=axes[1], color='lightgreen')
axes[1].set_title('Income Distribution (Quantile Bins)')
axes[1].tick_params(axis='x', rotation=45)

# Income distribution with custom bins
df['income_binned_custom'].value_counts().plot(kind='bar', ax=axes[2], color='salmon')
axes[2].set_title('Income Distribution (Custom Bins)')
axes[2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## 3. Interaction Features

Interaction features combine two or more features to capture relationships between them. These can reveal hidden patterns in the data.

**Common interactions:**
- Multiplication of features
- Ratio features
- Polynomial combinations

In [None]:
# Multiplication interactions
df['income_age_interaction'] = df['income'] * df['age']
df['experience_education_interaction'] = df['experience'] * df['education_years']

# Ratio features
df['income_per_experience'] = df['income'] / (df['experience'] + 1)  # Adding 1 to avoid division by zero
df['education_per_age'] = df['education_years'] / df['age']

# Polynomial features using sklearn
poly_features = PolynomialFeatures(degree=2, include_bias=False, interaction_only=True)
features_for_poly = df[['age', 'income', 'experience']].head(5)  # Using subset for demonstration
poly_features_matrix = poly_features.fit_transform(features_for_poly)

print("Original features:")
print(features_for_poly)
print("\nPolynomial interaction features (degree=2, interaction only):")
print(poly_features_matrix)
print("\nFeature names:")
print(poly_features.get_feature_names_out(['age', 'income', 'experience']))

## 4. Date and Time Features

When working with temporal data, extracting meaningful features from dates and times can significantly improve model performance.

**Common date/time features:**
- Year, month, day, hour
- Day of week, weekend indicator
- Seasonal indicators
- Time since a reference point

In [None]:
# Create sample date data
date_range = pd.date_range(start='2020-01-01', periods=n_samples, freq='D')
df_dates = pd.DataFrame({'date': date_range})

# Extract date components
df_dates['year'] = df_dates['date'].dt.year
df_dates['month'] = df_dates['date'].dt.month
df_dates['day'] = df_dates['date'].dt.day
df_dates['day_of_week'] = df_dates['date'].dt.dayofweek
df_dates['day_name'] = df_dates['date'].dt.day_name()
df_dates['quarter'] = df_dates['date'].dt.quarter
df_dates['is_weekend'] = df_dates['day_of_week'].isin([5, 6]).astype(int)  # 5=Saturday, 6=Sunday

print("Date feature extraction:")
print(df_dates.head(10))

## 5. Domain-Specific Features

Domain knowledge can help create highly predictive features. These are specific to the problem domain.

**Examples:**
- BMI from height and weight
- Customer lifetime value
- Price-to-income ratio
- Efficiency metrics

In [None]:
# Domain-specific features based on our dataset

# Income efficiency (income per year of education)
df['income_efficiency'] = df['income'] / df['education_years']

# Experience ratio (experience relative to age)
df['experience_ratio'] = df['experience'] / df['age']

# Income category based on domain knowledge
def categorize_income(income):
    if income < 30000:
        return 'Low'
    elif income < 50000:
        return 'Lower-Middle'
    elif income < 70000:
        return 'Middle'
    elif income < 100000:
        return 'Upper-Middle'
    else:
        return 'High'

df['income_category'] = df['income'].apply(categorize_income)

print("Domain-specific features:")
print(df[['income', 'education_years', 'income_efficiency', 'experience', 'age', 'experience_ratio', 'income_category']].head(10))

## 6. Feature Scaling

Scaling features to a similar range is crucial for many machine learning algorithms. Different scaling techniques serve different purposes.

**Common scaling methods:**
- Standardization (Z-score normalization)
- Min-Max scaling
- Robust scaling
- Unit vector scaling

In [None]:
# Select numerical features for scaling
numerical_features = ['age', 'income', 'experience', 'education_years']
df_numerical = df[numerical_features].copy()

# Standardization (Z-score normalization)
scaler_standard = StandardScaler()
df_scaled_standard = scaler_standard.fit_transform(df_numerical)
df_scaled_standard = pd.DataFrame(df_scaled_standard, columns=[f'{col}_standard' for col in numerical_features])

# Min-Max scaling
from sklearn.preprocessing import MinMaxScaler
scaler_minmax = MinMaxScaler()
df_scaled_minmax = scaler_minmax.fit_transform(df_numerical)
df_scaled_minmax = pd.DataFrame(df_scaled_minmax, columns=[f'{col}_minmax' for col in numerical_features])

print("Original numerical features:")
print(df_numerical.describe())
print("\nStandardized features (mean≈0, std≈1):")
print(df_scaled_standard.describe())
print("\nMin-Max scaled features (range 0-1):")
print(df_scaled_minmax.describe())

## 7. Feature Selection

Not all features contribute equally to model performance. Feature selection helps identify the most relevant features.

**Common methods:**
- Correlation analysis
- Variance threshold
- Recursive feature elimination
- Feature importance from tree-based models

In [None]:
# Correlation analysis
correlation_matrix = df[['age', 'income', 'experience', 'education_years', 'purchase_amount']].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

print("Correlation with target variable (purchase_amount):")
target_corr = correlation_matrix['purchase_amount'].drop('purchase_amount').sort_values(key=abs, ascending=False)
print(target_corr)

## 8. Putting It All Together: A Complete Example

Let's apply feature engineering to improve a machine learning model's performance:

In [None]:
# Create a complete example with feature engineering
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Prepare the dataset
df_model = df.copy()

# Apply feature engineering
# 1. Mathematical transformations
df_model['log_income'] = np.log1p(df_model['income'])
df_model['age_squared'] = df_model['age'] ** 2

# 2. Binning
df_model['income_category'] = df_model['income'].apply(categorize_income)

# 3. Interaction features
df_model['income_age_interaction'] = df_model['income'] * df_model['age']
df_model['experience_education_interaction'] = df_model['experience'] * df_model['education_years']

# 4. Domain-specific features
df_model['income_efficiency'] = df_model['income'] / df_model['education_years']
df_model['experience_ratio'] = df_model['experience'] / df_model['age']

# Encode categorical variables
df_model = pd.get_dummies(df_model, columns=['city', 'membership', 'income_category'], prefix=['city', 'membership', 'income_cat'])

# Split the data
features = [col for col in df_model.columns if col != 'purchase_amount']
X = df_model[features]
y = df_model['purchase_amount']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train models with and without feature engineering
# Baseline model (original features only)
original_features = ['age', 'income', 'experience', 'education_years']
X_train_baseline = X_train[original_features]
X_test_baseline = X_test[original_features]

model_baseline = LinearRegression()
model_baseline.fit(X_train_baseline, y_train)
y_pred_baseline = model_baseline.predict(X_test_baseline)

# Enhanced model (with feature engineering)
model_enhanced = LinearRegression()
model_enhanced.fit(X_train, y_train)
y_pred_enhanced = model_enhanced.predict(X_test)

# Compare performance
mse_baseline = mean_squared_error(y_test, y_pred_baseline)
mse_enhanced = mean_squared_error(y_test, y_pred_enhanced)

r2_baseline = r2_score(y_test, y_pred_baseline)
r2_enhanced = r2_score(y_test, y_pred_enhanced)

print("Model Performance Comparison:")
print(f"Baseline Model (Original Features Only):")
print(f"  MSE: {mse_baseline:.2f}")
print(f"  R²: {r2_baseline:.4f}")
print(f"\nEnhanced Model (With Feature Engineering):")
print(f"  MSE: {mse_enhanced:.2f}")
print(f"  R²: {r2_enhanced:.4f}")
print(f"\nImprovement in R²: {r2_enhanced - r2_baseline:.4f}")

## 9. Using Pipelines for Feature Engineering

Pipelines help organize and automate feature engineering steps, making the process reproducible and preventing data leakage.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Define feature engineering pipeline
# Numerical features pipeline
numerical_pipeline = Pipeline([
    ('scaler', StandardScaler())
])

# Categorical features pipeline
categorical_pipeline = Pipeline([
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, ['age', 'income', 'experience', 'education_years']),
        ('cat', categorical_pipeline, ['city', 'membership'])
    ]
)

# Complete pipeline
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# Prepare data for pipeline
X_pipeline = df[['age', 'income', 'experience', 'education_years', 'city', 'membership']]
y_pipeline = df['purchase_amount']

X_train_pipe, X_test_pipe, y_train_pipe, y_test_pipe = train_test_split(
    X_pipeline, y_pipeline, test_size=0.2, random_state=42
)

# Fit and evaluate pipeline
full_pipeline.fit(X_train_pipe, y_train_pipe)
y_pred_pipe = full_pipeline.predict(X_test_pipe)

mse_pipeline = mean_squared_error(y_test_pipe, y_pred_pipe)
r2_pipeline = r2_score(y_test_pipe, y_pred_pipe)

print("Pipeline Model Performance:")
print(f"  MSE: {mse_pipeline:.2f}")
print(f"  R²: {r2_pipeline:.4f}")

## Feature Engineering Best Practices

| Practice | Description | Benefits |
|----------|-------------|----------|
| Understand your data | Explore distributions, relationships, and anomalies | Better feature creation |
| Use domain knowledge | Leverage expertise to create meaningful features | Higher predictive power |
| Validate your features | Test if new features improve model performance | Avoid useless features |
| Prevent data leakage | Don't use future information in training | Reliable model performance |
| Automate with pipelines | Create reproducible feature engineering workflows | Consistency and efficiency |
| Monitor feature importance | Track which features contribute most | Model interpretability |

**Key Guidelines:**
1. **Start simple**: Begin with basic features and gradually add complexity
2. **Validate everything**: Always test if a new feature improves performance
3. **Avoid overfitting**: Don't create too many features relative to your sample size
4. **Document your process**: Keep track of what works and what doesn't
5. **Think causally**: Consider whether a feature might actually influence the target
6. **Consider computational cost**: Balance feature complexity with model training time

## Summary

Feature engineering is both an art and a science that requires creativity, domain knowledge, and experimentation. The techniques we've covered include:

1. **Mathematical transformations** to normalize distributions and create polynomial features
2. **Binning** to convert continuous variables into categorical ones
3. **Interaction features** to capture relationships between variables
4. **Date/time features** for temporal data
5. **Domain-specific features** based on expert knowledge
6. **Feature scaling** to prepare data for algorithms
7. **Feature selection** to identify the most relevant features
8. **Pipelines** to automate and organize the process

Remember that feature engineering is an iterative process. Start with simple transformations, validate their impact, and gradually build more complex features. The goal is to create features that help your model better understand the underlying patterns in your data.