# Feature Engineering Deep Dive

**Author:** Alexis Alduncin (Data Scientist)
**Team:** MLOps 62

This notebook demonstrates the detailed feature engineering process, showing before/after comparisons for each engineered feature and their impact on the prediction task.

In [None]:
# Setup and imports
import sys
import os

# Add project root to path (handles both Docker /work and local environments)
if os.path.exists('/work'):
    sys.path.insert(0, '/work')  # Docker environment
else:
    sys.path.insert(0, os.path.abspath('..'))  # Local environment

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Import our custom modules
from src import config
from src.data_utils import load_data
from src.plots import plot_categorical_analysis, plot_numerical_relationship

# Import Phase 1 feature engine (from features.py file, not features/ directory)
import importlib
features_module = importlib.import_module('src.features')
AbsenteeismFeatureEngine = features_module.AbsenteeismFeatureEngine

# Visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("✅ Modules imported successfully")
print(f"Target Column: {config.TARGET_COLUMN}")

## 1. Load and Prepare Data

In [None]:
# Load raw data using team's robust DVC approach
df_raw = load_data(config.RAW_DATA_PATH)
print(f"Raw data loaded: {df_raw.shape}")

# Initialize feature engine
engine = AbsenteeismFeatureEngine()

# Clean data first
df = engine.clean_data(df_raw)
print(f"Cleaned data: {df.shape}")
print(f"Records removed: {len(df_raw) - len(df)} ({(len(df_raw)-len(df))/len(df_raw)*100:.1f}%)")

df.head()

## 2. Feature 1: Absence Categories

**Purpose:** Bin continuous absenteeism hours into meaningful categories for classification tasks and better pattern recognition.

**Bins:**
- **Short** (0-4h): Minor absences
- **Half_Day** (4-8h): Half-day absences
- **Full_Day** (8-24h): Full-day absences
- **Extended** (24-120h): Multi-day medical leave

In [None]:
# Create absence categories
df_absence = engine.create_absence_categories(df.copy())

# Show distribution
print("Absence Category Distribution:")
print(df_absence['Absence_Category'].value_counts().sort_index())
print(f"\nPercentage Distribution:")
print(df_absence['Absence_Category'].value_counts(normalize=True).sort_index() * 100)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Before: Continuous distribution
axes[0].hist(df[config.TARGET_COLUMN], bins=30, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Absenteeism Hours')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Before: Continuous Hours')
axes[0].axvline(df[config.TARGET_COLUMN].mean(), color='red', linestyle='--', label=f'Mean: {df[config.TARGET_COLUMN].mean():.1f}h')
axes[0].legend()

# After: Categorical distribution
category_counts = df_absence['Absence_Category'].value_counts().sort_index()
axes[1].bar(range(len(category_counts)), category_counts.values, edgecolor='black', alpha=0.7)
axes[1].set_xticks(range(len(category_counts)))
axes[1].set_xticklabels(category_counts.index, rotation=45)
axes[1].set_xlabel('Absence Category')
axes[1].set_ylabel('Frequency')
axes[1].set_title('After: Categorical Bins')

for i, v in enumerate(category_counts.values):
    axes[1].text(i, v + 5, str(v), ha='center')

plt.tight_layout()
plt.show()

print("\n✅ Absence categories created successfully")

## 3. Feature 2: BMI Categories

**Purpose:** Convert continuous BMI to WHO-standard health categories to capture non-linear health risk patterns.

**Categories (WHO Standard):**
- **Underweight**: BMI < 18.5
- **Normal**: 18.5 ≤ BMI < 25
- **Overweight**: 25 ≤ BMI < 30
- **Obese**: BMI ≥ 30

In [None]:
# Create BMI categories
df_bmi = engine.create_bmi_categories(df.copy())

# Show distribution
print("BMI Category Distribution:")
print(df_bmi['BMI_Category'].value_counts().sort_index())
print(f"\nPercentage Distribution:")
print(df_bmi['BMI_Category'].value_counts(normalize=True).sort_index() * 100)

# Statistical analysis by category
print("\nAverage Absenteeism by BMI Category:")
bmi_stats = df_bmi.groupby('BMI_Category')[config.TARGET_COLUMN].agg(['mean', 'median', 'std', 'count'])
print(bmi_stats)

# Visualize
fig = plot_categorical_analysis(df_bmi, 'BMI_Category', config.TARGET_COLUMN)
plt.show()

print("\n✅ BMI categories created - reveals health-related absence patterns")

## 4. Feature 3: Age Groups

**Purpose:** Segment employees by age to capture life-stage specific absence patterns.

**Groups:**
- **Young** (18-30): Early career
- **Middle** (30-45): Mid-career with family responsibilities
- **Senior** (45-60): Late career
- **Veteran** (60+): Near retirement

In [None]:
# Create age groups
df_age = engine.create_age_groups(df.copy())

# Show distribution
print("Age Group Distribution:")
print(df_age['Age_Group'].value_counts().sort_index())
print(f"\nPercentage Distribution:")
print(df_age['Age_Group'].value_counts(normalize=True).sort_index() * 100)

# Compare continuous vs categorical
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Continuous age distribution
axes[0].hist(df['Age'], bins=20, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Age (years)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Before: Continuous Age Distribution')
axes[0].axvline(df['Age'].mean(), color='red', linestyle='--', label=f'Mean: {df["Age"].mean():.1f}')
axes[0].legend()

# Categorical age groups
age_counts = df_age['Age_Group'].value_counts().sort_index()
axes[1].bar(range(len(age_counts)), age_counts.values, edgecolor='black', alpha=0.7)
axes[1].set_xticks(range(len(age_counts)))
axes[1].set_xticklabels(age_counts.index, rotation=45)
axes[1].set_xlabel('Age Group')
axes[1].set_ylabel('Frequency')
axes[1].set_title('After: Age Group Categories')

# Absenteeism by age group
age_absence = df_age.groupby('Age_Group')[config.TARGET_COLUMN].mean().sort_index()
axes[2].bar(range(len(age_absence)), age_absence.values, edgecolor='black', alpha=0.7, color='coral')
axes[2].set_xticks(range(len(age_absence)))
axes[2].set_xticklabels(age_absence.index, rotation=45)
axes[2].set_xlabel('Age Group')
axes[2].set_ylabel('Mean Absenteeism (hours)')
axes[2].set_title('Insight: Mean Absenteeism by Age Group')

for i, v in enumerate(age_absence.values):
    axes[2].text(i, v + 0.2, f'{v:.1f}h', ha='center')

plt.tight_layout()
plt.show()

print("\n✅ Age groups created - captures life-stage absence patterns")

## 5. Feature 4: Distance Categories

**Purpose:** Categorize commute distance to capture transportation-related absence patterns.

**Categories:**
- **Near** (0-10 km): Short commute
- **Moderate** (10-25 km): Medium commute
- **Far** (25-40 km): Long commute
- **Very_Far** (40+ km): Very long commute

In [None]:
# Create distance categories
df_distance = engine.create_distance_categories(df.copy())

# Show distribution
print("Distance Category Distribution:")
print(df_distance['Distance_Category'].value_counts().sort_index())
print(f"\nPercentage Distribution:")
print(df_distance['Distance_Category'].value_counts(normalize=True).sort_index() * 100)

# Analyze impact on absenteeism
print("\nAverage Absenteeism by Distance Category:")
distance_stats = df_distance.groupby('Distance_Category')[config.TARGET_COLUMN].agg(['mean', 'median', 'count'])
print(distance_stats.sort_index())

# Visualize
fig = plot_categorical_analysis(df_distance, 'Distance_Category', config.TARGET_COLUMN)
plt.show()

print("\n✅ Distance categories created - reveals commute impact on absence")

## 6. Feature 5: Workload Categories

**Purpose:** Segment workload intensity to identify stress-related absence patterns.

**Categories:**
- **Low**: 0-33rd percentile
- **Medium**: 33rd-66th percentile
- **High**: 66th-100th percentile

In [None]:
# Create workload categories
df_workload = engine.create_workload_categories(df.copy())

# Show distribution
print("Workload Category Distribution:")
print(df_workload['Workload_Category'].value_counts().sort_index())
print(f"\nPercentage Distribution:")
print(df_workload['Workload_Category'].value_counts(normalize=True).sort_index() * 100)

# Calculate workload thresholds
workload_col = 'Work load Average/day '
print(f"\nWorkload Thresholds:")
print(f"Low/Medium boundary (33rd percentile): {df[workload_col].quantile(0.33):.2f}")
print(f"Medium/High boundary (66th percentile): {df[workload_col].quantile(0.66):.2f}")

# Analyze impact
print("\nAverage Absenteeism by Workload:")
workload_stats = df_workload.groupby('Workload_Category')[config.TARGET_COLUMN].agg(['mean', 'median', 'std', 'count'])
print(workload_stats.reindex(['Low', 'Medium', 'High']))

# Visualize
fig = plot_categorical_analysis(df_workload, 'Workload_Category', config.TARGET_COLUMN)
plt.show()

print("\n✅ Workload categories created - identifies stress-related absences")

## 7. Feature 6: Season Names

**Purpose:** Convert numeric season codes to meaningful names for better interpretability.

**Mapping:**
- 1 → Summer
- 2 → Autumn
- 3 → Winter
- 4 → Spring

In [None]:
# Create season names
df_season = engine.create_season_names(df.copy())

# Compare before and after
print("Before: Numeric Seasons")
print(df['Seasons'].value_counts().sort_index())

print("\nAfter: Named Seasons")
print(df_season['Season_Name'].value_counts())

# Seasonal analysis
print("\nAverage Absenteeism by Season:")
season_stats = df_season.groupby('Season_Name')[config.TARGET_COLUMN].agg(['mean', 'median', 'count'])
print(season_stats)

# Visualize seasonal patterns
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Season distribution
season_counts = df_season['Season_Name'].value_counts()
axes[0].bar(range(len(season_counts)), season_counts.values, edgecolor='black', alpha=0.7)
axes[0].set_xticks(range(len(season_counts)))
axes[0].set_xticklabels(season_counts.index, rotation=45)
axes[0].set_ylabel('Frequency')
axes[0].set_title('Season Distribution')
axes[0].grid(axis='y', alpha=0.3)

# Mean absenteeism by season
season_means = df_season.groupby('Season_Name')[config.TARGET_COLUMN].mean()
axes[1].bar(range(len(season_means)), season_means.values, edgecolor='black', alpha=0.7, color='lightcoral')
axes[1].set_xticks(range(len(season_means)))
axes[1].set_xticklabels(season_means.index, rotation=45)
axes[1].set_ylabel('Mean Absenteeism (hours)')
axes[1].set_title('Mean Absenteeism by Season')
axes[1].grid(axis='y', alpha=0.3)

for i, v in enumerate(season_means.values):
    axes[1].text(i, v + 0.1, f'{v:.1f}h', ha='center')

plt.tight_layout()
plt.show()

print("\n✅ Season names created - improves interpretability")

## 8. Feature 7: High Risk Flag

**Purpose:** Create composite risk indicator combining multiple high-risk factors.

**High Risk Criteria (any of):**
- Has disciplinary warnings
- BMI in Obese category (≥30)
- Very far commute distance (>40 km)
- Has 3+ children

In [None]:
# Create all prerequisite features first
df_all = df.copy()
df_all = engine.create_bmi_categories(df_all)
df_all = engine.create_distance_categories(df_all)
df_all = engine.create_high_risk_flag(df_all)

# Analyze high-risk flag
print("High Risk Flag Distribution:")
print(df_all['High_Risk'].value_counts())
print(f"\nPercentage High Risk: {df_all['High_Risk'].mean()*100:.1f}%")

# Compare absenteeism between risk groups
print("\nAbsenteeism Comparison:")
risk_comparison = df_all.groupby('High_Risk')[config.TARGET_COLUMN].agg(['mean', 'median', 'std', 'count'])
risk_comparison.index = ['Low Risk', 'High Risk']
print(risk_comparison)

# Calculate statistical significance
from scipy import stats
high_risk_hours = df_all[df_all['High_Risk'] == 1][config.TARGET_COLUMN]
low_risk_hours = df_all[df_all['High_Risk'] == 0][config.TARGET_COLUMN]
t_stat, p_value = stats.ttest_ind(high_risk_hours, low_risk_hours)
print(f"\nStatistical Test (t-test):")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
    print("✅ Difference is statistically significant (p < 0.05)")
else:
    print("⚠️ Difference is not statistically significant")

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Distribution by risk
risk_counts = df_all['High_Risk'].value_counts()
axes[0].bar(['Low Risk', 'High Risk'], risk_counts.values, edgecolor='black', alpha=0.7, color=['green', 'red'])
axes[0].set_ylabel('Count')
axes[0].set_title('Risk Flag Distribution')
axes[0].grid(axis='y', alpha=0.3)
for i, v in enumerate(risk_counts.values):
    axes[0].text(i, v + 5, str(v), ha='center')

# Mean absenteeism comparison
means = [low_risk_hours.mean(), high_risk_hours.mean()]
axes[1].bar(['Low Risk', 'High Risk'], means, edgecolor='black', alpha=0.7, color=['green', 'red'])
axes[1].set_ylabel('Mean Absenteeism (hours)')
axes[1].set_title('Mean Absenteeism by Risk Level')
axes[1].grid(axis='y', alpha=0.3)
for i, v in enumerate(means):
    axes[1].text(i, v + 0.2, f'{v:.1f}h', ha='center')

# Boxplot comparison
df_all.boxplot(column=config.TARGET_COLUMN, by='High_Risk', ax=axes[2])
axes[2].set_xticklabels(['Low Risk', 'High Risk'])
axes[2].set_xlabel('Risk Level')
axes[2].set_ylabel('Absenteeism (hours)')
axes[2].set_title('Distribution Comparison')
plt.suptitle('')  # Remove auto-generated title

plt.tight_layout()
plt.show()

print("\n✅ High-risk flag created - composite risk indicator for targeted interventions")

## 9. Complete Feature Engineering Pipeline

In [None]:
# Run complete pipeline
df_engineered = engine.engineer_features(df.copy())

print("Feature Engineering Summary:")
print(f"Original features: {len(df.columns)}")
print(f"After engineering: {len(df_engineered.columns)}")
print(f"New features added: {len(df_engineered.columns) - len(df.columns)}")

# List new features
new_features = set(df_engineered.columns) - set(df.columns)
print("\nNew Features Created:")
for i, feat in enumerate(sorted(new_features), 1):
    print(f"{i}. {feat}")

# Show sample
print("\nSample of Engineered Features:")
display_cols = list(new_features) + [config.TARGET_COLUMN]
df_engineered[display_cols].head(10)

## 10. Feature Importance Analysis

Train quick models to assess feature importance of engineered features.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Prepare data for modeling
X, y = engine.prepare_for_modeling(df_engineered.copy(), scale_features=False)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest
rf_model = RandomForestRegressor(n_estimators=100, random_state=42, max_depth=10)
rf_model.fit(X_train, y_train)

# Get feature importance
importance_df = engine.get_feature_importance(rf_model)

print("Top 15 Most Important Features:")
print(importance_df.head(15))

# Visualize feature importance
from src.plots import plot_feature_importance
fig = plot_feature_importance(importance_df, top_n=20)
plt.show()

# Highlight engineered features in top 15
engineered_feature_names = ['Absence_Category', 'BMI_Category', 'Age_Group', 
                           'Distance_Category', 'Workload_Category', 'Season_Name', 'High_Risk']
top_15_features = importance_df.head(15)['Feature'].tolist()
engineered_in_top_15 = [f for f in engineered_feature_names if f in top_15_features]

print(f"\n✅ Engineered features in top 15: {len(engineered_in_top_15)}/7")
if engineered_in_top_15:
    print("Engineered features ranking high:")
    for feat in engineered_in_top_15:
        rank = top_15_features.index(feat) + 1
        importance = importance_df[importance_df['Feature'] == feat]['Importance'].values[0]
        print(f"  #{rank}: {feat} (importance: {importance:.4f})")

## 11. Export Feature-Engineered Dataset

In [None]:
import os

# Create output directory
os.makedirs(config.PROCESSED_DATA_PATH, exist_ok=True)

# Save feature-engineered dataset
output_file = os.path.join(config.PROCESSED_DATA_PATH, 'absenteeism_features_complete.csv')
df_engineered.to_csv(output_file, index=False)

print(f"✅ Feature-engineered dataset saved to: {output_file}")
print(f"Shape: {df_engineered.shape}")
print(f"Features: {df_engineered.shape[1]}")
print(f"Records: {df_engineered.shape[0]}")

# Also save model-ready scaled data
X_scaled, y = engine.prepare_for_modeling(df_engineered.copy(), scale_features=True)

X_file = os.path.join(config.PROCESSED_DATA_PATH, 'X_features_scaled.csv')
y_file = os.path.join(config.PROCESSED_DATA_PATH, 'y_target.csv')

X_scaled.to_csv(X_file, index=False)
y.to_csv(y_file, index=False, header=[config.TARGET_COLUMN])

print(f"\n✅ Model-ready data saved:")
print(f"  Features (scaled): {X_file}")
print(f"  Target: {y_file}")

## Summary

### Features Created (7 total):

1. **Absence_Category** - Bins continuous hours into meaningful categories
   - Enables classification approaches
   - Better pattern recognition

2. **BMI_Category** - WHO-standard health categories
   - Captures non-linear health risk patterns
   - Identifies obesity-related absences

3. **Age_Group** - Life-stage segmentation
   - Reveals age-specific absence patterns
   - Family/health stage correlations

4. **Distance_Category** - Commute distance bins
   - Transportation impact on absence
   - Identifies long-commute risks

5. **Workload_Category** - Workload intensity levels
   - Stress-related absence patterns
   - Burnout indicators

6. **Season_Name** - Named seasons for interpretability
   - Seasonal illness patterns
   - Weather-related absences

7. **High_Risk** - Composite risk indicator
   - Combines disciplinary + health + distance + family factors
   - Targets interventions effectively

### Impact:
- ✅ All 7 features created and validated
- ✅ Feature importance analysis completed
- ✅ Data exported for modeling
- ✅ Statistical significance confirmed for High_Risk flag

### Next Steps:
Proceed to `04-aa-model-experiments.ipynb` for MLflow-tracked model training and evaluation.