# Feature Engineering Basics

**Module 4: Data Cleaning & Transformation**

## Learning Objectives
- Understand what feature engineering is and why it matters
- Create new features from existing data
- Apply binning and discretization techniques
- Encode categorical variables for analysis
- Transform numerical features

## Business Context
> "Feature engineering is where domain knowledge meets data science. The best features tell a story that raw data cannot."

Feature engineering is the process of creating new variables from existing data to improve analysis and modeling. As a Data Analyst, this skill helps you uncover insights that aren't immediately visible in raw data.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)
np.random.seed(42)

print("‚úì Libraries loaded successfully")

---
## 1. What is Feature Engineering?

**Feature engineering** is the process of using domain knowledge to create new variables (features) that make patterns in data more apparent.

### Types of Feature Engineering

| Type | Description | Example |
|------|-------------|---------|
| **Extraction** | Pull out parts of existing features | Year from date, domain from email |
| **Aggregation** | Combine multiple values | Total spend, average rating |
| **Transformation** | Change feature representation | Log scale, normalization |
| **Binning** | Convert continuous to categorical | Age groups, income brackets |
| **Encoding** | Convert categories to numbers | One-hot, label encoding |
| **Interaction** | Combine features | Price per unit, BMI |

In [None]:
# Create a realistic e-commerce dataset
np.random.seed(42)
n = 500

customers = pd.DataFrame({
    'customer_id': range(1001, 1001 + n),
    'registration_date': pd.date_range('2020-01-01', periods=n, freq='D'),
    'birth_date': pd.to_datetime('1990-01-01') + pd.to_timedelta(
        np.random.randint(-10000, 10000, n), unit='D'
    ),
    'gender': np.random.choice(['M', 'F', 'Other'], n, p=[0.48, 0.48, 0.04]),
    'city': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'], n),
    'total_orders': np.random.poisson(5, n),
    'total_spend': np.random.exponential(500, n),
    'last_order_date': pd.date_range('2023-06-01', periods=n, freq='h'),
    'email': [f'customer_{i}@{np.random.choice(["gmail.com", "yahoo.com", "outlook.com"])}' 
              for i in range(n)],
    'subscription_plan': np.random.choice(['Free', 'Basic', 'Premium', 'Enterprise'], n, 
                                          p=[0.4, 0.3, 0.2, 0.1])
})

print("E-commerce Customer Dataset:")
print(f"Shape: {customers.shape}")
customers.head(10)

---
## 2. Feature Extraction

Create new features by extracting information from existing ones.

### 2.1 Date/Time Features

In [None]:
# Extract date components
customers['registration_year'] = customers['registration_date'].dt.year
customers['registration_month'] = customers['registration_date'].dt.month
customers['registration_quarter'] = customers['registration_date'].dt.quarter
customers['registration_day_of_week'] = customers['registration_date'].dt.dayofweek
customers['registration_day_name'] = customers['registration_date'].dt.day_name()
customers['registered_on_weekend'] = customers['registration_date'].dt.dayofweek >= 5

print("Date Features Extracted:")
print(customers[['registration_date', 'registration_year', 'registration_month', 
                  'registration_quarter', 'registration_day_name', 'registered_on_weekend']].head())

In [None]:
# Calculate age and customer tenure
today = pd.Timestamp.today()

# Age in years
customers['age'] = ((today - customers['birth_date']).dt.days / 365.25).astype(int)

# Customer tenure (days since registration)
customers['tenure_days'] = (today - customers['registration_date']).dt.days
customers['tenure_months'] = (customers['tenure_days'] / 30).round(1)

# Days since last order
customers['days_since_last_order'] = (today - customers['last_order_date']).dt.days

print("Calculated Time Features:")
print(customers[['customer_id', 'age', 'tenure_days', 'tenure_months', 
                  'days_since_last_order']].head())

### 2.2 Text Features

In [None]:
# Extract email domain
customers['email_domain'] = customers['email'].str.split('@').str[1]

# Create email provider category
customers['email_provider'] = customers['email_domain'].map({
    'gmail.com': 'Google',
    'yahoo.com': 'Yahoo',
    'outlook.com': 'Microsoft'
})

print("Email Features:")
print(customers[['email', 'email_domain', 'email_provider']].head())

print("\nEmail Provider Distribution:")
print(customers['email_provider'].value_counts())

---
## 3. Binning (Discretization)

Convert continuous variables into categorical groups.

### 3.1 Equal-Width Binning

In [None]:
# Bin age into equal-width groups
customers['age_group_equal'] = pd.cut(
    customers['age'],
    bins=[0, 25, 35, 45, 55, 100],
    labels=['18-25', '26-35', '36-45', '46-55', '55+']
)

print("Age Groups (Equal Width):")
print(customers['age_group_equal'].value_counts().sort_index())

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(customers['age'], bins=30, edgecolor='black')
axes[0].set_title('Original Age Distribution')
axes[0].set_xlabel('Age')

customers['age_group_equal'].value_counts().sort_index().plot(kind='bar', ax=axes[1], color='skyblue')
axes[1].set_title('Age Groups (Binned)')
axes[1].set_xlabel('Age Group')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

### 3.2 Equal-Frequency Binning (Quantiles)

In [None]:
# Bin total_spend into quartiles
customers['spend_quartile'] = pd.qcut(
    customers['total_spend'],
    q=4,
    labels=['Low', 'Medium', 'High', 'Very High']
)

print("Spend Quartiles:")
print(customers['spend_quartile'].value_counts().sort_index())

# Show the actual cutoff values
print("\nQuartile boundaries:")
for i, q in enumerate([0, 0.25, 0.5, 0.75, 1.0]):
    print(f"  {q*100:.0f}%: ${customers['total_spend'].quantile(q):.2f}")

In [None]:
# Custom bins based on business logic
# Example: Customer segments based on spending
customers['customer_segment'] = pd.cut(
    customers['total_spend'],
    bins=[0, 100, 500, 1000, np.inf],
    labels=['Bronze', 'Silver', 'Gold', 'Platinum']
)

print("Customer Segments (Business-Defined):")
segment_stats = customers.groupby('customer_segment').agg({
    'customer_id': 'count',
    'total_spend': ['mean', 'min', 'max'],
    'total_orders': 'mean'
}).round(2)

print(segment_stats)

---
## 4. Ratio and Interaction Features

Create new features by combining existing ones.

In [None]:
# Calculate derived metrics

# Average order value
customers['avg_order_value'] = customers['total_spend'] / customers['total_orders'].replace(0, np.nan)

# Orders per month (customer tenure)
customers['orders_per_month'] = customers['total_orders'] / (customers['tenure_months'].replace(0, 1))

# Spend per month
customers['spend_per_month'] = customers['total_spend'] / (customers['tenure_months'].replace(0, 1))

print("Derived Metrics:")
print(customers[['customer_id', 'total_orders', 'total_spend', 'tenure_months',
                  'avg_order_value', 'orders_per_month', 'spend_per_month']].head(10))

In [None]:
# RFM (Recency, Frequency, Monetary) - Classic customer analysis
# Recency: Days since last order (lower = better)
# Frequency: Total orders (higher = better)
# Monetary: Total spend (higher = better)

# Score each dimension (1-5, with 5 being best)
customers['R_score'] = pd.qcut(customers['days_since_last_order'], q=5, 
                                labels=[5, 4, 3, 2, 1])  # Reversed: low recency = high score

customers['F_score'] = pd.qcut(customers['total_orders'].rank(method='first'), q=5, 
                                labels=[1, 2, 3, 4, 5])

customers['M_score'] = pd.qcut(customers['total_spend'].rank(method='first'), q=5, 
                                labels=[1, 2, 3, 4, 5])

# Create RFM segment
customers['RFM_score'] = (customers['R_score'].astype(str) + 
                          customers['F_score'].astype(str) + 
                          customers['M_score'].astype(str))

customers['RFM_total'] = (customers['R_score'].astype(int) + 
                          customers['F_score'].astype(int) + 
                          customers['M_score'].astype(int))

print("RFM Analysis:")
print(customers[['customer_id', 'days_since_last_order', 'total_orders', 'total_spend',
                  'R_score', 'F_score', 'M_score', 'RFM_score', 'RFM_total']].head(10))

In [None]:
# Create customer value segment based on RFM
def rfm_segment(row):
    if row['RFM_total'] >= 12:
        return 'Champions'
    elif row['RFM_total'] >= 9:
        return 'Loyal Customers'
    elif row['RFM_total'] >= 6:
        return 'Potential Loyalists'
    elif row['RFM_total'] >= 4:
        return 'At Risk'
    else:
        return 'Lost'

customers['rfm_segment'] = customers.apply(rfm_segment, axis=1)

print("\nRFM Segments:")
segment_summary = customers.groupby('rfm_segment').agg({
    'customer_id': 'count',
    'total_spend': 'mean',
    'total_orders': 'mean'
}).round(2)
segment_summary.columns = ['Count', 'Avg Spend', 'Avg Orders']
print(segment_summary)

---
## 5. Encoding Categorical Variables

Convert categorical data into numerical format.

### 5.1 Label Encoding (Ordinal)

In [None]:
# Label encoding for ordinal categories
# Use when there's a natural order

# Subscription plan has a natural order
plan_order = {'Free': 0, 'Basic': 1, 'Premium': 2, 'Enterprise': 3}
customers['subscription_level'] = customers['subscription_plan'].map(plan_order)

print("Label Encoding (Ordinal):")
print(customers[['subscription_plan', 'subscription_level']].drop_duplicates().sort_values('subscription_level'))

### 5.2 One-Hot Encoding (Nominal)

In [None]:
# One-hot encoding for nominal categories (no natural order)
# Use for: city, gender, color, etc.

# Method 1: pd.get_dummies()
city_dummies = pd.get_dummies(customers['city'], prefix='city')
print("One-Hot Encoding (City):")
print(city_dummies.head())

# Add to dataframe
customers_encoded = pd.concat([customers, city_dummies], axis=1)

In [None]:
# Encode multiple columns at once
encoded_df = pd.get_dummies(
    customers[['gender', 'city', 'email_provider']],
    drop_first=True  # Avoid dummy variable trap
)

print("\nOne-Hot Encoding (Multiple Columns, drop_first=True):")
print(encoded_df.head())

print(f"\nOriginal columns: gender, city, email_provider")
print(f"Encoded columns: {encoded_df.columns.tolist()}")

### 5.3 Binary Encoding

In [None]:
# Create binary flags from conditions

# Is the customer active? (ordered in last 30 days)
customers['is_active'] = (customers['days_since_last_order'] <= 30).astype(int)

# Is the customer a high spender? (above median)
median_spend = customers['total_spend'].median()
customers['is_high_spender'] = (customers['total_spend'] > median_spend).astype(int)

# Is premium customer?
customers['is_premium'] = customers['subscription_plan'].isin(['Premium', 'Enterprise']).astype(int)

print("Binary Features:")
print(customers[['customer_id', 'days_since_last_order', 'is_active', 
                  'total_spend', 'is_high_spender', 
                  'subscription_plan', 'is_premium']].head(10))

---
## 6. Numerical Transformations

Transform numerical features to improve analysis.

### 6.1 Log Transformation

In [None]:
# Log transformation for skewed data
# Useful for: monetary values, counts, highly skewed distributions

# Check skewness before
print(f"Original total_spend skewness: {customers['total_spend'].skew():.2f}")

# Apply log transformation (add 1 to handle zeros)
customers['log_total_spend'] = np.log1p(customers['total_spend'])

print(f"Log-transformed skewness: {customers['log_total_spend'].skew():.2f}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(customers['total_spend'], bins=50, edgecolor='black')
axes[0].set_title('Original Distribution (Skewed)')
axes[0].set_xlabel('Total Spend ($)')

axes[1].hist(customers['log_total_spend'], bins=50, edgecolor='black', color='green')
axes[1].set_title('Log-Transformed Distribution')
axes[1].set_xlabel('Log(Total Spend)')

plt.tight_layout()
plt.show()

### 6.2 Standardization (Z-Score)

In [None]:
# Standardization: Mean = 0, Std = 1
# Useful for: comparing different scales, clustering, some ML algorithms

def standardize(series):
    """Standardize a series to have mean=0 and std=1"""
    return (series - series.mean()) / series.std()

customers['spend_standardized'] = standardize(customers['total_spend'])
customers['orders_standardized'] = standardize(customers['total_orders'])
customers['tenure_standardized'] = standardize(customers['tenure_days'])

print("Standardized Features (Z-Scores):")
print(customers[['total_spend', 'spend_standardized', 
                  'total_orders', 'orders_standardized']].describe().round(2))

### 6.3 Min-Max Normalization

In [None]:
# Min-Max Normalization: Scale to [0, 1]
# Useful for: neural networks, when you need bounded values

def min_max_normalize(series):
    """Normalize a series to range [0, 1]"""
    return (series - series.min()) / (series.max() - series.min())

customers['spend_normalized'] = min_max_normalize(customers['total_spend'])
customers['orders_normalized'] = min_max_normalize(customers['total_orders'])

print("Min-Max Normalized Features:")
print(customers[['total_spend', 'spend_normalized', 
                  'total_orders', 'orders_normalized']].describe().round(3))

---
## 7. Aggregation Features

Create features by aggregating data from related records.

In [None]:
# Create transaction-level data
np.random.seed(42)
n_transactions = 2000

transactions = pd.DataFrame({
    'transaction_id': range(1, n_transactions + 1),
    'customer_id': np.random.choice(customers['customer_id'], n_transactions),
    'transaction_date': pd.date_range('2023-01-01', periods=n_transactions, freq='4h'),
    'amount': np.random.exponential(100, n_transactions),
    'category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Home', 'Books'], n_transactions)
})

print("Transaction Data:")
print(transactions.head(10))

In [None]:
# Aggregate transaction data per customer
customer_agg = transactions.groupby('customer_id').agg({
    'transaction_id': 'count',
    'amount': ['sum', 'mean', 'std', 'min', 'max'],
    'transaction_date': ['min', 'max'],
    'category': 'nunique'
})

# Flatten column names
customer_agg.columns = ['_'.join(col).strip() for col in customer_agg.columns]
customer_agg = customer_agg.reset_index()

# Rename for clarity
customer_agg.columns = [
    'customer_id', 'num_transactions', 'total_amount', 'avg_amount', 
    'std_amount', 'min_amount', 'max_amount', 'first_transaction', 
    'last_transaction', 'num_categories'
]

print("\nAggregated Customer Features:")
print(customer_agg.head())

In [None]:
# Create category-specific features
category_spend = transactions.pivot_table(
    index='customer_id',
    columns='category',
    values='amount',
    aggfunc='sum',
    fill_value=0
).reset_index()

# Add prefix to column names
category_spend.columns = ['customer_id'] + [f'spend_{cat.lower()}' for cat in category_spend.columns[1:]]

print("Category-Specific Spending:")
print(category_spend.head())

---
## 8. Practical Exercises

### Exercise 1: Create Employee Features

In [None]:
# Employee dataset
np.random.seed(42)
n = 200

employees = pd.DataFrame({
    'employee_id': range(1, n + 1),
    'hire_date': pd.date_range('2015-01-01', periods=n, freq='W'),
    'birth_date': pd.to_datetime('1985-01-01') + pd.to_timedelta(
        np.random.randint(-5000, 5000, n), unit='D'
    ),
    'department': np.random.choice(['Sales', 'IT', 'HR', 'Marketing', 'Finance'], n),
    'salary': np.random.normal(60000, 15000, n),
    'performance_rating': np.random.choice([1, 2, 3, 4, 5], n, p=[0.05, 0.15, 0.40, 0.30, 0.10]),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n, p=[0.2, 0.5, 0.25, 0.05]),
    'is_remote': np.random.choice([True, False], n, p=[0.3, 0.7])
})

print("Employee Data:")
employees.head(10)

In [None]:
# TODO: Create the following features:
# 1. age (from birth_date)
# 2. tenure_years (from hire_date)
# 3. age_group (bins: 20-30, 30-40, 40-50, 50+)
# 4. salary_band (Low, Medium, High, Very High based on quartiles)
# 5. education_level (ordinal: 0, 1, 2, 3)
# 6. is_high_performer (performance_rating >= 4)
# 7. salary_per_experience_year (salary / tenure_years)


### Exercise 2: RFM Analysis for Retail

In [None]:
# Retail transaction data
np.random.seed(42)
n = 1000

retail = pd.DataFrame({
    'customer_id': np.random.randint(100, 200, n),
    'purchase_date': pd.date_range('2023-01-01', periods=n, freq='8h'),
    'purchase_amount': np.random.exponential(75, n)
})

print("Retail Transactions:")
print(retail.head())
print(f"\nUnique customers: {retail['customer_id'].nunique()}")

In [None]:
# TODO: Perform RFM Analysis
# 1. Calculate Recency (days since last purchase for each customer)
# 2. Calculate Frequency (number of purchases per customer)
# 3. Calculate Monetary (total spend per customer)
# 4. Score each dimension 1-5 using quintiles
# 5. Create customer segments based on RFM scores


### Exercise 3: Feature Engineering for Prediction

In [None]:
# Website session data
np.random.seed(42)
n = 500

sessions = pd.DataFrame({
    'session_id': range(1, n + 1),
    'user_id': np.random.randint(1, 101, n),
    'session_start': pd.date_range('2024-01-01', periods=n, freq='30min'),
    'duration_seconds': np.random.exponential(300, n),
    'pages_viewed': np.random.poisson(5, n),
    'device': np.random.choice(['Mobile', 'Desktop', 'Tablet'], n, p=[0.6, 0.3, 0.1]),
    'source': np.random.choice(['Organic', 'Paid', 'Social', 'Direct'], n),
    'converted': np.random.choice([0, 1], n, p=[0.9, 0.1])
})

print("Session Data:")
sessions.head()

In [None]:
# TODO: Create features that might help predict conversion:
# 1. hour_of_day (from session_start)
# 2. day_of_week
# 3. is_weekend
# 4. duration_minutes (from duration_seconds)
# 5. pages_per_minute
# 6. is_mobile (binary: 1 if Mobile, 0 otherwise)
# 7. One-hot encode 'source'
# 8. User-level aggregates (avg_session_duration, total_sessions, etc.)


---
## 9. Key Takeaways

### ‚úÖ Feature Engineering Best Practices

1. **Start with domain knowledge** - What makes sense for your business?
2. **Extract date/time features** - Year, month, day of week, is_weekend
3. **Create ratio features** - Value per unit, rate of change
4. **Use binning wisely** - Business-defined bins vs. quantiles
5. **Encode properly** - Label encoding for ordinal, one-hot for nominal
6. **Aggregate related data** - Sum, mean, count, unique count

### üìã Quick Reference

```python
# Binning
pd.cut(df['col'], bins=[0, 10, 20, 30])  # Custom bins
pd.qcut(df['col'], q=4)  # Quantile-based bins

# Encoding
pd.get_dummies(df['col'], drop_first=True)  # One-hot
df['col'].map({'Low': 0, 'High': 1})  # Label encoding

# Transformations
np.log1p(df['col'])  # Log transform (handles zeros)
(df['col'] - df['col'].mean()) / df['col'].std()  # Standardize

# Date features
df['date'].dt.year / .dt.month / .dt.dayofweek

# Aggregation
df.groupby('key').agg({'val': ['sum', 'mean', 'count']})
```

### ‚ö†Ô∏è Common Mistakes

1. Using label encoding for nominal categories (no order)
2. Not handling zeros before log transformation
3. Creating too many one-hot columns (dimensionality explosion)
4. Not documenting feature creation logic