# Phase 04: Feature Engineering
**Context:** Production-First Methodology

This notebook explores the features generated by the `FeatureEngineer` class. Following the specific project rules, this phase strictly **enriches historical data** without extending the dataset with future projections (which are handled in the inference phase).

**Key Goals:**
1. Validate feature distributions on historical data.
2. Correlation analysis of engineered features.
3. Ensure the dataset maintains its original size (97 rows).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from src.utils.config_loader import load_config

config = load_config()
plt.style.use('ggplot')
sns.set_palette('viridis')
print('Setup complete.')

In [None]:
# Load the engineered features dataset
features_path = os.path.join(config['general']['paths']['features'], 'master_features.parquet')
df = pd.read_parquet(features_path)
print(f'Shape of features dataset: {df.shape}')

# Integrity check
expected_rows = 97
assert len(df) == expected_rows, f'Error: Expected {expected_rows} rows, but got {len(df)}'
print(f'âœ… Integrity check passed: {len(df)} rows.')
df.tail(10)

In [None]:
# Check cyclical features (Sin/Cos)
cols = ['month_sin', 'month_cos', 'quarter_sin', 'quarter_cos', 'semester_sin', 'semester_cos']
df[cols].plot(figsize=(15, 6), title='Cyclical Features over Time')
plt.show()

In [None]:
# Examine business and technical features
tech_cols = ['is_pandemic', 'novenas_intensity', 'is_bonus_month', 'weekend_days_count', 
             'days_in_month', 'holidays_count', 'time_drift_index']
df[tech_cols].tail(12).style.background_gradient(cmap='Blues')

In [None]:
# Visualizing relationship between Novenas/Bonus and Units
fig, ax1 = plt.subplots(figsize=(15, 6))
ax2 = ax1.twinx()
df['total_unidades_entregadas'].plot(ax=ax1, color='gray', alpha=0.5, label='Units')
df['novenas_intensity'].plot(ax=ax2, color='red', style='--', label='Novenas')
df['is_bonus_month'].plot(ax=ax2, color='blue', alpha=0.3, label='Bonus Month')
plt.title('Business Events vs Units')
plt.legend()
plt.show()

In [None]:
# Correlation with target
target = config['preprocessing']['target_variable']
corr = df.corr()[target].sort_values(ascending=False)
plt.figure(figsize=(10, 10))
sns.barplot(x=corr.values, y=corr.index)
plt.title(f'Correlation with Target ({target})')
plt.show()