# üöÄ REVENUE FORECASTING - 15 BEST FEATURES WITH REASONING

## ADDRESSING UNDER-PREDICTION ISSUE

Based on the transcript discussions, predictions tend to under-predict. We address this by:

1. **Combining features strategically** - Using weighted combinations that capture upside potential
2. **Using remaining sums, not individual forecasts** - Individual forecasts are too close to actual (causes bias)
3. **Trend-based committed ratio** - Ratio increases Jan‚ÜíDec, simulate with random % increase
4. **Recursive lag features** - Prediction becomes lag_1 for next month
5. **Using 3-month prediction average** - For features requiring smoothed inputs

## 15 SELECTED FEATURES WITH REASONING

Each feature is selected based on:
- Business intuition (explainable to stakeholders)
- Predictive power (correlation + mutual information)
- Stability (minimal imputation needed for simulation)
- Addressing under-prediction (upside capture)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import random
warnings.filterwarnings('ignore')

from sklearn.linear_model import ElasticNet, Lasso, Ridge
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error
from sklearn.feature_selection import mutual_info_regression

pd.set_option('display.float_format', lambda x: f'{x:.2f}')
print('‚úÖ Libraries imported!')

In [None]:
# Load data
df = pd.read_csv('mon_final.csv', index_col=0)
df = df.sort_values(['year', 'month_num']).reset_index(drop=True)
df['month_id'] = df['year'] * 100 + df['month_num']  # Unique month identifier

print(f'Dataset shape: {df.shape}')
print(f'Years: {sorted(df["year"].unique())}')
print(f'Sample columns: {list(df.columns)[:15]}...')

## STEP 1: CREATE COMPREHENSIVE FEATURE SET

We create many features but will select only the TOP 15 based on:
1. Correlation with actual revenue
2. Mutual information (non-linear relationships)
3. Business reasoning (explainability)
4. Simulation stability (can we reliably impute during prediction?)

In [None]:
def create_comprehensive_features(df):
    """
    Create a comprehensive feature set.
    Features are designed with UNDER-PREDICTION fix in mind:
    - Combine features to capture upside
    - Use remaining sums (not individual forecasts)
    - Create momentum indicators that capture growth
    """
    df_feat = df.copy().sort_values(['year', 'month_num']).reset_index(drop=True)
    print('\n' + '='*80)
    print('CREATING COMPREHENSIVE FEATURE SET FOR TOP 15 SELECTION')
    print('='*80)
    
    # ========== A. CORE TIME FEATURES ==========
    print('\nüìä A. TIME & SEASONALITY FEATURES')
    
    df_feat['remaining_months'] = 13 - df_feat['month_num']
    df_feat['quarter'] = ((df_feat['month_num'] - 1) // 3) + 1
    df_feat['is_q4'] = (df_feat['quarter'] == 4).astype(int)
    df_feat['is_holiday_month'] = df_feat['month_num'].isin([11, 12]).astype(int)
    df_feat['is_end_of_quarter'] = df_feat['month_num'].isin([3, 6, 9, 12]).astype(int)
    df_feat['year_urgency'] = df_feat['month_num'] / 12  # Increases Jan‚ÜíDec
    print('   ‚úì Time features: remaining_months, quarter, is_q4, is_holiday_month, is_end_of_quarter, year_urgency')
    
    # ========== B. LAST YEAR ANCHOR FEATURES (STABLE - NO IMPUTATION NEEDED) ==========
    print('\nüìä B. LAST YEAR ANCHOR FEATURES (Stable for simulation)')
    
    # Same month last year - KEY STABLE ANCHOR
    df_feat['ly_same_month_revenue'] = df_feat.groupby('month_num')['actual_revenue'].shift(1)
    print('   ‚úì ly_same_month_revenue: Same month last year (STABLE ANCHOR)')
    
    # Same quarter last year average
    df_feat['ly_same_qtr_avg'] = df_feat.groupby(['quarter'])['actual_revenue'].transform(
        lambda x: x.shift(3).rolling(3, min_periods=1).mean()
    )
    print('   ‚úì ly_same_qtr_avg: Same quarter last year average')
    
    # YoY growth from last year (to apply to current)
    df_feat['ly_yoy_growth_rate'] = df_feat.groupby('month_num')['actual_revenue'].transform(
        lambda x: x.pct_change()
    ).shift(1).clip(-0.5, 1.0)  # Last year's growth rate for this month
    print('   ‚úì ly_yoy_growth_rate: Historical growth rate for this month')
    
    # ========== C. REMAINING FORECAST FEATURES (SUM, NOT INDIVIDUAL!) ==========
    print('\nüìä C. REMAINING FORECAST FEATURES (Sum of remaining months only)')
    
    # Total remaining forecast - CORE FEATURE
    df_feat['fcst_total_rem'] = (
        df_feat['committed_sign_revenue'] + 
        df_feat['committed_unsig_revenue'] + 
        df_feat['wtd_pipeline_revenue']
    )
    df_feat['fcst_signed_rem'] = df_feat['committed_sign_revenue']
    df_feat['fcst_unsigned_rem'] = df_feat['committed_unsig_revenue']
    df_feat['fcst_pipeline_rem'] = df_feat['wtd_pipeline_revenue']
    print('   ‚úì fcst_total_rem, fcst_signed_rem, fcst_unsigned_rem, fcst_pipeline_rem')
    
    # Signed per remaining month (density)
    df_feat['signed_per_month'] = df_feat['fcst_signed_rem'] / df_feat['remaining_months'].replace(0, 1)
    print('   ‚úì signed_per_month: Signed revenue per remaining month')
    
    # ========== D. COMMITTED RATIO - KEY FOR UNDER-PREDICTION FIX ==========
    print('\nüìä D. COMMITTED RATIO & CONVERSION FEATURES')
    
    # Committed ratio (signed / total) - CRITICAL: increases Jan‚ÜíDec
    df_feat['committed_ratio'] = df_feat['fcst_signed_rem'] / (df_feat['fcst_total_rem'] + 1e-10)
    print('   ‚úì committed_ratio: % of forecast that is committed/signed')
    
    # Unsigned ratio (deals in progress)
    df_feat['unsigned_ratio'] = df_feat['fcst_unsigned_rem'] / (df_feat['fcst_total_rem'] + 1e-10)
    print('   ‚úì unsigned_ratio: % of forecast that is unsigned (in negotiation)')
    
    # Pipeline quality (higher = more in late stages)
    df_feat['pipeline_quality'] = (
        df_feat['fcst_signed_rem'] * 1.0 + 
        df_feat['fcst_unsigned_rem'] * 0.7 +
        df_feat['fcst_pipeline_rem'] * 0.3
    ) / (df_feat['fcst_total_rem'] + 1e-10)
    print('   ‚úì pipeline_quality: Weighted pipeline maturity score')
    
    # ========== E. REVENUE LAG FEATURES (RECURSIVE DURING SIMULATION) ==========
    print('\nüìä E. REVENUE LAG & VELOCITY FEATURES')
    
    # Lags - CORE RECURSIVE FEATURES
    df_feat['revenue_lag_1'] = df_feat['actual_revenue'].shift(1)
    df_feat['revenue_lag_2'] = df_feat['actual_revenue'].shift(2)
    df_feat['revenue_lag_3'] = df_feat['actual_revenue'].shift(3)
    print('   ‚úì revenue_lag_1, revenue_lag_2, revenue_lag_3')
    
    # 3-month average (smoothed baseline) - KEY FOR ADDRESSING UNDER-PREDICTION
    df_feat['revenue_3mo_avg'] = df_feat['actual_revenue'].shift(1).rolling(3, min_periods=1).mean()
    print('   ‚úì revenue_3mo_avg: Smoothed 3-month average (reduces volatility)')
    
    # Velocity (momentum indicator)
    df_feat['revenue_velocity'] = df_feat['revenue_lag_1'] - df_feat['revenue_lag_2']
    print('   ‚úì revenue_velocity: Month-over-month change')
    
    # Acceleration (is momentum increasing?)
    prev_velocity = df_feat['revenue_velocity'].shift(1)
    df_feat['revenue_acceleration'] = df_feat['revenue_velocity'] - prev_velocity
    print('   ‚úì revenue_acceleration: Change in momentum')
    
    # ========== F. COMBINED FEATURES TO ADDRESS UNDER-PREDICTION ==========
    print('\nüìä F. COMBINED FEATURES (Addressing Under-Prediction)')
    
    # Expected revenue = signed + probability-weighted pipeline
    # This captures UPSIDE potential that raw lag features miss
    avg_prob = df_feat['avg_prob_pct'].fillna(30) / 100
    df_feat['expected_revenue'] = (
        df_feat['fcst_signed_rem'] / df_feat['remaining_months'].replace(0, 1) +  # Monthly signed
        df_feat['fcst_unsigned_rem'] * 0.7 / df_feat['remaining_months'].replace(0, 1) +  # 70% of unsigned
        df_feat['fcst_pipeline_rem'] * avg_prob / df_feat['remaining_months'].replace(0, 1)  # Prob-weighted pipeline
    )
    print('   ‚úì expected_revenue: Probability-weighted expected monthly revenue')
    
    # Blend of lag and expected (addresses under-prediction by adding upside)
    df_feat['blended_forecast'] = (
        0.6 * df_feat['revenue_lag_1'] +  # Recent actual
        0.4 * df_feat['expected_revenue']  # Expected from pipeline
    )
    print('   ‚úì blended_forecast: 60% recent actual + 40% expected (captures upside)')
    
    # YoY-adjusted expectation (if last year grew X%, apply to current)
    df_feat['yoy_adjusted_rev'] = df_feat['ly_same_month_revenue'] * (1 + df_feat['ly_yoy_growth_rate'].fillna(0))
    print('   ‚úì yoy_adjusted_rev: Last year √ó (1 + historical growth rate)')
    
    # Performance vs last year (are we above/below trend?)
    df_feat['perf_vs_ly'] = (
        df_feat['revenue_lag_1'] - df_feat['ly_same_month_revenue'].shift(1)
    ) / (df_feat['ly_same_month_revenue'].shift(1).replace(0, np.nan))
    df_feat['perf_vs_ly'] = df_feat['perf_vs_ly'].clip(-0.5, 1.0).fillna(0)
    print('   ‚úì perf_vs_ly: Performance relative to same month last year')
    
    # ========== G. TREND & MOMENTUM INDICATORS ==========
    print('\nüìä G. TREND & MOMENTUM INDICATORS')
    
    # Rolling 6-month average
    df_feat['revenue_6mo_avg'] = df_feat['actual_revenue'].shift(1).rolling(6, min_periods=1).mean()
    
    # Trend direction (3mo vs 6mo)
    df_feat['trend_direction'] = np.sign(df_feat['revenue_3mo_avg'] - df_feat['revenue_6mo_avg'])
    print('   ‚úì trend_direction: +1 uptrend, -1 downtrend, 0 flat')
    
    # Trend strength (how much above/below long-term average)
    df_feat['trend_strength'] = (
        df_feat['revenue_3mo_avg'] - df_feat['revenue_6mo_avg']
    ) / (df_feat['revenue_6mo_avg'] + 1e-10)
    df_feat['trend_strength'] = df_feat['trend_strength'].clip(-0.5, 0.5)
    print('   ‚úì trend_strength: Magnitude of trend deviation')
    
    # YoY momentum (how we're doing vs last year overall)
    df_feat['yoy_momentum'] = df_feat['revenue_lag_1'] - df_feat['ly_same_month_revenue']
    print('   ‚úì yoy_momentum: Current vs same period last year')
    
    # ========== H. FORECAST REALIZATION & CONFIDENCE ==========
    print('\nüìä H. FORECAST CONFIDENCE FEATURES')
    
    # Signed coverage (how much of typical monthly revenue is already signed)
    df_feat['signed_coverage'] = df_feat['signed_per_month'] / (df_feat['revenue_3mo_avg'] + 1e-10)
    df_feat['signed_coverage'] = df_feat['signed_coverage'].clip(0, 3)
    print('   ‚úì signed_coverage: Signed per month vs recent average')
    
    # Pipeline health (total pipeline vs what we need)
    df_feat['pipeline_coverage'] = df_feat['fcst_total_rem'] / (
        df_feat['revenue_3mo_avg'] * df_feat['remaining_months'] + 1e-10
    )
    df_feat['pipeline_coverage'] = df_feat['pipeline_coverage'].clip(0, 3)
    print('   ‚úì pipeline_coverage: Pipeline vs remaining target')
    
    # Upside potential (what could we get if everything converts)
    df_feat['upside_potential'] = (
        df_feat['fcst_total_rem'] - 
        df_feat['fcst_signed_rem'] - 
        df_feat['fcst_unsigned_rem'] * 0.7
    ) / (df_feat['remaining_months'].replace(0, 1))
    print('   ‚úì upside_potential: Monthly upside if pipeline converts')
    
    # ========== I. COMPOSITE SCORE (FINAL COMBINED FEATURE) ==========
    print('\nüìä I. COMPOSITE FEATURES')
    
    # Revenue + Growth composite (weights momentum into prediction)
    df_feat['revenue_growth_composite'] = (
        df_feat['revenue_3mo_avg'] * (1 + df_feat['trend_strength'])
    )
    print('   ‚úì revenue_growth_composite: 3mo avg adjusted by trend')
    
    # Final expected (blending all information)
    df_feat['final_expected'] = (
        0.4 * df_feat['revenue_lag_1'] +  # Most recent
        0.3 * df_feat['revenue_3mo_avg'] +  # Smoothed
        0.2 * df_feat['expected_revenue'] +  # Pipeline-based
        0.1 * df_feat['yoy_adjusted_rev']  # YoY trend
    )
    print('   ‚úì final_expected: Weighted combination of all signals')
    
    # ========== HANDLE INFINITIES ==========
    for col in df_feat.columns:
        if df_feat[col].dtype in [np.float64, np.int64, np.float32]:
            df_feat[col] = df_feat[col].replace([np.inf, -np.inf], np.nan)
    
    # All features list
    all_features = [
        # Time & Seasonality
        'remaining_months', 'quarter', 'is_q4', 'is_holiday_month', 'is_end_of_quarter', 'year_urgency',
        # Last Year Anchors
        'ly_same_month_revenue', 'ly_same_qtr_avg', 'ly_yoy_growth_rate',
        # Forecast Features
        'fcst_total_rem', 'fcst_signed_rem', 'fcst_unsigned_rem', 'fcst_pipeline_rem', 'signed_per_month',
        # Conversion & Ratios
        'committed_ratio', 'unsigned_ratio', 'pipeline_quality',
        # Lag & Velocity
        'revenue_lag_1', 'revenue_lag_2', 'revenue_lag_3', 'revenue_3mo_avg',
        'revenue_velocity', 'revenue_acceleration',
        # Combined Features
        'expected_revenue', 'blended_forecast', 'yoy_adjusted_rev', 'perf_vs_ly',
        # Trend & Momentum
        'trend_direction', 'trend_strength', 'yoy_momentum',
        # Confidence
        'signed_coverage', 'pipeline_coverage', 'upside_potential',
        # Composite
        'revenue_growth_composite', 'final_expected'
    ]
    
    print(f'\n‚úÖ Created {len(all_features)} features for selection!')
    
    return df_feat, all_features

df_features, all_features = create_comprehensive_features(df)

## STEP 2: SELECT TOP 15 FEATURES WITH DETAILED REASONING

We select the TOP 15 features based on:
1. **Predictive power**: High correlation + mutual information with actual revenue
2. **Business intuition**: Features that stakeholders can understand and trust
3. **Simulation stability**: Features that can be reliably computed during recursive forecasting
4. **Under-prediction fix**: Features that capture upside potential

### FEATURE SELECTION RATIONALE

| # | Feature | Why It's Important |
|---|---------|-------------------|
| 1 | revenue_lag_1 | Most recent performance is the strongest short-term predictor |
| 2 | revenue_3mo_avg | Smooths volatility, provides stable baseline |
| 3 | ly_same_month_revenue | Captures seasonality without needing imputation |
| 4 | fcst_signed_rem | Committed revenue has highest conversion certainty |
| 5 | signed_per_month | Normalizes signed by remaining time |
| 6 | committed_ratio | Shows deal maturity, increases Jan‚ÜíDec |
| 7 | expected_revenue | Probability-weighted upside (addresses under-prediction) |
| 8 | blended_forecast | Combines actuals + pipeline expectations |
| 9 | yoy_adjusted_rev | Projects growth based on historical patterns |
| 10 | revenue_velocity | Captures momentum (up or down trend) |
| 11 | trend_strength | Measures deviation from long-term average |
| 12 | yoy_momentum | Year-over-year performance comparison |
| 13 | signed_coverage | Pipeline health indicator |
| 14 | is_end_of_quarter | End-of-quarter push effect |
| 15 | final_expected | Composite of all signals for robustness |

In [None]:
def select_top_15_features_with_reasoning(df, feature_list):
    """
    Select top 15 features with detailed reasoning for each.
    Uses 2023-2024 data for selection to avoid data leakage.
    
    ADDRESSING UNDER-PREDICTION:
    - Include combined features that capture upside potential
    - Include momentum features that detect growth trends
    - Include YoY features that project based on historical growth
    """
    print('\n' + '='*80)
    print('TOP 15 FEATURE SELECTION WITH BUSINESS REASONING')
    print('='*80)
    
    # Use only 2023-2024 data for feature selection
    df_select = df[df['year'].isin([2023, 2024])].copy().dropna(subset=['actual_revenue'])
    
    # Remove features with too many NaNs
    valid_features = []
    for f in feature_list:
        if f in df_select.columns:
            null_pct = df_select[f].isna().mean()
            if null_pct < 0.3:
                valid_features.append(f)
    
    # Fill NaNs for scoring
    X = df_select[valid_features].fillna(df_select[valid_features].median())
    y = df_select['actual_revenue']
    
    # Compute scores
    correlations = X.corrwith(y).abs()
    mi_scores = mutual_info_regression(X, y, random_state=42)
    mi_df = pd.DataFrame({'Feature': valid_features, 'MI_Score': mi_scores})
    
    # Normalize and combine
    corr_norm = (correlations - correlations.min()) / (correlations.max() - correlations.min() + 1e-10)
    mi_norm = (mi_df.set_index('Feature')['MI_Score'] - mi_df['MI_Score'].min()) / (mi_df['MI_Score'].max() - mi_df['MI_Score'].min() + 1e-10)
    combined_score = 0.5 * corr_norm + 0.5 * mi_norm
    ranking = combined_score.sort_values(ascending=False)
    
    # Define our CURATED top 15 with reasoning
    # (We use data-driven ranking but also apply business logic)
    
    curated_15 = [
        ('revenue_lag_1', 'Most recent actual revenue - strongest short-term predictor'),
        ('revenue_3mo_avg', 'Smoothed baseline - reduces monthly volatility'),
        ('ly_same_month_revenue', 'Same month last year - captures seasonality without imputation'),
        ('fcst_signed_rem', 'Committed signed revenue - highest certainty pipeline'),
        ('signed_per_month', 'Signed per remaining month - time-normalized commitment'),
        ('committed_ratio', 'Signed/Total ratio - deal maturity indicator (increases Jan‚ÜíDec)'),
        ('expected_revenue', 'Probability-weighted expected - captures UPSIDE potential'),
        ('blended_forecast', '60% actual + 40% expected - combines past with pipeline'),
        ('yoy_adjusted_rev', 'Last year √ó growth rate - projects historical patterns'),
        ('revenue_velocity', 'Month-over-month change - captures momentum'),
        ('trend_strength', 'Deviation from long-term average - trend magnitude'),
        ('yoy_momentum', 'Current vs last year difference - YoY performance'),
        ('signed_coverage', 'Signed vs avg revenue - pipeline health ratio'),
        ('is_end_of_quarter', 'Quarter-end indicator - captures closing push'),
        ('final_expected', 'Weighted composite - robust combination of all signals')
    ]
    
    print('\n' + '='*80)
    print('THE 15 SELECTED FEATURES WITH REASONING')
    print('='*80)
    print()
    
    selected_features = []
    feature_reasoning = {}
    
    for i, (feat, reason) in enumerate(curated_15, 1):
        if feat in valid_features:
            score = combined_score.get(feat, 0)
            corr_val = correlations.get(feat, 0)
            mi_val = mi_df[mi_df['Feature'] == feat]['MI_Score'].values[0] if feat in mi_df['Feature'].values else 0
            
            print(f'\nüîπ FEATURE {i}: {feat}')
            print(f'   üìä Correlation: {corr_val:.3f} | MI Score: {mi_val:.3f} | Combined: {score:.3f}')
            print(f'   üí° REASONING: {reason}')
            
            # Additional category explanation
            if 'lag' in feat or 'avg' in feat:
                print(f'   üìà CATEGORY: Historical Performance')
                print(f'   üîß SIMULATION: Updated recursively from predictions')
            elif 'ly_' in feat:
                print(f'   üìà CATEGORY: Year-over-Year Anchor')
                print(f'   üîß SIMULATION: Stable - uses last year data directly')
            elif 'fcst' in feat or 'signed' in feat or 'committed' in feat:
                print(f'   üìà CATEGORY: Pipeline/Forecast')
                print(f'   üîß SIMULATION: Updated with trend-based revision + ratio increase')
            elif 'expected' in feat or 'blended' in feat or 'final' in feat:
                print(f'   üìà CATEGORY: Combined/Composite')
                print(f'   üîß SIMULATION: Computed from other features - captures upside')
            elif 'velocity' in feat or 'trend' in feat or 'momentum' in feat:
                print(f'   üìà CATEGORY: Momentum/Trend')
                print(f'   üîß SIMULATION: Recomputed from lag features')
            elif 'quarter' in feat:
                print(f'   üìà CATEGORY: Seasonality')
                print(f'   üîß SIMULATION: Known from calendar - no imputation needed')
            
            selected_features.append(feat)
            feature_reasoning[feat] = reason
        else:
            print(f'\n‚ö†Ô∏è Feature {feat} not available, skipping...')
    
    print('\n' + '='*80)
    print('SUMMARY: TOP 15 FEATURES SELECTED')
    print('='*80)
    
    for i, f in enumerate(selected_features, 1):
        print(f'   {i:2}. {f}')
    
    print(f'\n‚úÖ Selected {len(selected_features)} features for modeling!')
    
    # Show why these features address UNDER-PREDICTION
    print('\n' + '='*80)
    print('HOW THESE FEATURES ADDRESS UNDER-PREDICTION')
    print('='*80)
    print('''
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ PROBLEM: Pure lag-based models tend to under-predict because they      ‚îÇ
    ‚îÇ          only look at past actuals, missing pipeline upside.           ‚îÇ
    ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
    ‚îÇ SOLUTION: Our feature set includes:                                    ‚îÇ
    ‚îÇ                                                                         ‚îÇ
    ‚îÇ   1. UPSIDE CAPTURE FEATURES:                                          ‚îÇ
    ‚îÇ      ‚Ä¢ expected_revenue: Probability-weighted pipeline potential       ‚îÇ
    ‚îÇ      ‚Ä¢ blended_forecast: Combines actuals + expected (40% upside)      ‚îÇ
    ‚îÇ      ‚Ä¢ final_expected: Composite that weights all signals              ‚îÇ
    ‚îÇ                                                                         ‚îÇ
    ‚îÇ   2. MOMENTUM FEATURES:                                                ‚îÇ
    ‚îÇ      ‚Ä¢ revenue_velocity: Captures if we're trending up                 ‚îÇ
    ‚îÇ      ‚Ä¢ trend_strength: How much above/below average                    ‚îÇ
    ‚îÇ      ‚Ä¢ yoy_momentum: Are we beating last year?                         ‚îÇ
    ‚îÇ                                                                         ‚îÇ
    ‚îÇ   3. YoY GROWTH PROJECTION:                                            ‚îÇ
    ‚îÇ      ‚Ä¢ yoy_adjusted_rev: Projects growth from historical patterns      ‚îÇ
    ‚îÇ      ‚Ä¢ ly_same_month_revenue: Stable anchor for seasonality            ‚îÇ
    ‚îÇ                                                                         ‚îÇ
    ‚îÇ   4. PIPELINE CONFIDENCE:                                              ‚îÇ
    ‚îÇ      ‚Ä¢ committed_ratio: Increases as year progresses                   ‚îÇ
    ‚îÇ      ‚Ä¢ signed_coverage: How healthy is our pipeline?                   ‚îÇ
    ‚îÇ      ‚Ä¢ signed_per_month: Normalized commitment level                   ‚îÇ
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
    ''')
    
    return selected_features, feature_reasoning, ranking

top_15_features, feature_reasoning, feature_ranking = select_top_15_features_with_reasoning(df_features, all_features)
print(f'\nüéØ Final Feature Set: {top_15_features}')

## STEP 3: ANALYZE COMMITTED RATIO TREND

Key insight from transcript: Committed ratio increases from January to December.
We simulate this by applying a random percentage increase each month.

In [None]:
def analyze_committed_ratio_trend(df_hist):
    """
    Analyze how committed_ratio increases from Jan to Dec.
    This is KEY for simulation - we must simulate this increase!
    """
    print('\n' + '='*80)
    print('COMMITTED RATIO TREND ANALYSIS')
    print('='*80)
    
    # Get monthly averages for 2023-2024
    monthly_ratios = df_hist[df_hist['year'].isin([2023, 2024])].groupby('month_num')['committed_ratio'].agg(['mean', 'std'])
    
    print('\nüìä Committed Ratio by Month (2023-2024 Average):')
    print('-' * 60)
    for m, row in monthly_ratios.iterrows():
        bar = '‚ñà' * int(row['mean'] * 40)
        print(f'   Month {m:2}: {row["mean"]:.3f} ¬± {row["std"]:.3f}  {bar}')
    
    # Calculate month-over-month increase
    increases = []
    for m in range(1, 12):
        if m in monthly_ratios.index and (m+1) in monthly_ratios.index:
            increase = monthly_ratios.loc[m+1, 'mean'] - monthly_ratios.loc[m, 'mean']
            increases.append({'from': m, 'to': m+1, 'increase': increase})
    
    inc_df = pd.DataFrame(increases)
    avg_increase = inc_df['increase'].mean()
    std_increase = inc_df['increase'].std()
    
    print('\nüìà Month-over-Month Increases:')
    for _, row in inc_df.iterrows():
        print(f'   Month {row["from"]:2} ‚Üí {row["to"]:2}: {row["increase"]:+.4f}')
    
    print(f'\nüìä Summary:')
    print(f'   Average increase: {avg_increase:.4f}')
    print(f'   Std deviation:    {std_increase:.4f}')
    print(f'   Range for simulation: [{max(0.01, avg_increase-std_increase):.4f}, {avg_increase+std_increase:.4f}]')
    
    return {
        'monthly_ratios': monthly_ratios,
        'avg_increase': avg_increase,
        'std_increase': std_increase,
        'min_increase': max(0.01, avg_increase - std_increase),
        'max_increase': avg_increase + std_increase
    }

ratio_trend = analyze_committed_ratio_trend(df_features)

## STEP 4: TRAIN MODELS ON 2023-2024 DATA

Train Lasso, Ridge, ElasticNet on historical data.
NO SCALING as discussed - improves precision.

In [None]:
def train_models(df, features, train_years=[2023, 2024]):
    """
    Train Lasso, Ridge, ElasticNet on historical data.
    NO SCALING - improves precision.
    """
    print('\n' + '='*80)
    print(f'TRAINING MODELS ON {train_years} DATA (NO SCALING)')
    print('='*80)
    
    # Get training data
    train_df = df[df['year'].isin(train_years)].copy().dropna(subset=['actual_revenue'])
    
    # Prepare features
    X_train = train_df[features].copy()
    y_train = train_df['actual_revenue'].copy()
    
    # Fill NaNs with median and store for imputation
    feature_medians = {}
    for col in features:
        median_val = X_train[col].median()
        feature_medians[col] = median_val
        X_train[col] = X_train[col].fillna(median_val)
    
    print(f'\nTraining samples: {len(X_train)}')
    print(f'Features: {len(features)}')
    
    print('\nüìä Feature Medians (for imputation):')
    for f, m in feature_medians.items():
        print(f'   {f:30}: {m:>15,.2f}')
    
    # Define models
    models = {
        'Lasso': Lasso(alpha=500, random_state=42, max_iter=10000),
        'Ridge': Ridge(alpha=500, random_state=42),
        'ElasticNet': ElasticNet(alpha=500, l1_ratio=0.5, random_state=42, max_iter=10000)
    }
    
    # Cross-validation
    tscv = TimeSeriesSplit(n_splits=3)
    
    results = {}
    for name, model in models.items():
        print(f'\nüìä Training {name}...')
        
        # CV score
        cv_scores = cross_val_score(model, X_train, y_train, cv=tscv, scoring='neg_mean_absolute_error')
        cv_mae = -cv_scores.mean()
        
        # Fit on full training data
        model.fit(X_train, y_train)
        train_pred = model.predict(X_train)
        train_mae = mean_absolute_error(y_train, train_pred)
        train_mape = mean_absolute_percentage_error(y_train, train_pred) * 100
        
        results[name] = {
            'model': model,
            'cv_mae': cv_mae,
            'train_mae': train_mae,
            'train_mape': train_mape
        }
        
        print(f'   CV MAE:    ${cv_mae:,.0f}')
        print(f'   Train MAE: ${train_mae:,.0f}')
        print(f'   Train MAPE: {train_mape:.2f}%')
        
        # Show coefficients with reasoning
        coef_df = pd.DataFrame({'Feature': features, 'Coefficient': model.coef_})
        coef_df['Abs_Coef'] = coef_df['Coefficient'].abs()
        coef_df = coef_df.sort_values('Abs_Coef', ascending=False)
        
        print(f'   Top 5 Coefficients (what drives predictions):')
        for _, row in coef_df.head(5).iterrows():
            direction = 'üìà' if row['Coefficient'] > 0 else 'üìâ'
            print(f'      {direction} {row["Feature"]}: {row["Coefficient"]:,.2f}')
    
    # Select best model
    best_name = min(results.keys(), key=lambda x: results[x]['cv_mae'])
    print(f'\nüèÜ Best Model: {best_name} (lowest CV MAE: ${results[best_name]["cv_mae"]:,.0f})')
    
    return results, feature_medians, best_name

model_results, feature_medians, best_model_name = train_models(df_features, top_15_features)

## STEP 5: RECURSIVE SIMULATION FOR 2025

### THE CORRECT APPROACH (From Transcript):

1. **Predict March** using actual features
2. **Use March prediction as `revenue_lag_1`** for April
3. **Recompute all features** based on predictions:
   - `revenue_velocity` = lag_1 - lag_2
   - `revenue_3mo_avg` = average of last 3 predictions
   - Combined features recalculated
4. **Update committed_ratio** with random % increase (simulating Jan‚ÜíDec trend)
5. **Repeat until December**

### ADDRESSING UNDER-PREDICTION:
- Combined features (expected_revenue, blended_forecast) capture pipeline upside
- 3-month average of predictions (as per transcript) provides smoothed input
- YoY-adjusted features project growth

In [None]:
def recompute_features_from_predictions(running_predictions, last_row, sim_month, features, feature_medians, ratio_trend, df_hist):
    """
    Recompute all features based on predictions.
    
    KEY FROM TRANSCRIPT:
    - revenue_lag_1 = last prediction
    - For June, take average of March, April, May predictions as a feature
    - Committed ratio increases each month
    """
    
    # Get last year data for this month
    ly_data = df_hist[(df_hist['year'] == 2024) & (df_hist['month_num'] == sim_month)]
    ly_same_month_rev = ly_data['actual_revenue'].values[0] if len(ly_data) > 0 else feature_medians.get('ly_same_month_revenue', 0)
    
    # Get predictions list
    pred_list = [p['predicted'] for p in running_predictions]
    
    # === COMPUTE LAG FEATURES FROM PREDICTIONS ===
    revenue_lag_1 = pred_list[-1] if len(pred_list) >= 1 else feature_medians.get('revenue_lag_1', 0)
    revenue_lag_2 = pred_list[-2] if len(pred_list) >= 2 else feature_medians.get('revenue_lag_2', 0)
    revenue_lag_3 = pred_list[-3] if len(pred_list) >= 3 else feature_medians.get('revenue_lag_3', 0)
    
    # === 3-MONTH AVERAGE FROM PREDICTIONS (KEY FROM TRANSCRIPT) ===
    # "Take average of last 3 months (model predictions) to use as feature"
    if len(pred_list) >= 3:
        revenue_3mo_avg = np.mean(pred_list[-3:])
    elif len(pred_list) >= 1:
        revenue_3mo_avg = np.mean(pred_list)
    else:
        revenue_3mo_avg = feature_medians.get('revenue_3mo_avg', 0)
    
    # === VELOCITY & MOMENTUM ===
    revenue_velocity = revenue_lag_1 - revenue_lag_2 if revenue_lag_2 > 0 else 0
    yoy_momentum = revenue_lag_1 - ly_same_month_rev
    
    # === TREND FEATURES ===
    revenue_6mo_avg = np.mean(pred_list[-6:]) if len(pred_list) >= 6 else revenue_3mo_avg
    trend_strength = (revenue_3mo_avg - revenue_6mo_avg) / (revenue_6mo_avg + 1e-10)
    trend_strength = np.clip(trend_strength, -0.5, 0.5)
    
    # === COMMITTED RATIO - INCREASES EACH MONTH ===
    # Random increase within historical range
    base_ratio = running_predictions[-1].get('committed_ratio', 0.5) if running_predictions else 0.5
    ratio_increase = random.uniform(
        ratio_trend['min_increase'],
        ratio_trend['max_increase']
    )
    committed_ratio = min(base_ratio + ratio_increase, 0.95)
    
    # === FORECAST STATE UPDATE ===
    # Burn down forecast by last prediction
    remaining_months = 13 - sim_month
    last_fcst_total = running_predictions[-1].get('fcst_total_rem', 0) if running_predictions else 0
    fcst_total_rem = max(0, last_fcst_total - revenue_lag_1 * 0.8)  # Some remains unconverted
    fcst_signed_rem = fcst_total_rem * committed_ratio
    signed_per_month = fcst_signed_rem / max(remaining_months, 1)
    
    # === YoY ADJUSTED REVENUE ===
    ly_growth = ly_data['ly_yoy_growth_rate'].values[0] if len(ly_data) > 0 and 'ly_yoy_growth_rate' in ly_data.columns else 0.05
    yoy_adjusted_rev = ly_same_month_rev * (1 + ly_growth)
    
    # === EXPECTED REVENUE (UPSIDE CAPTURE) ===
    avg_prob = 0.3  # Default probability
    expected_revenue = (
        signed_per_month +
        (fcst_total_rem - fcst_signed_rem) * 0.5 * avg_prob / max(remaining_months, 1)
    )
    
    # === BLENDED FORECAST (60% actual + 40% expected) ===
    blended_forecast = 0.6 * revenue_lag_1 + 0.4 * expected_revenue
    
    # === SIGNED COVERAGE ===
    signed_coverage = signed_per_month / (revenue_3mo_avg + 1e-10)
    signed_coverage = np.clip(signed_coverage, 0, 3)
    
    # === FINAL EXPECTED (COMPOSITE) ===
    final_expected = (
        0.4 * revenue_lag_1 +
        0.3 * revenue_3mo_avg +
        0.2 * expected_revenue +
        0.1 * yoy_adjusted_rev
    )
    
    # === BUILD FEATURE DICT ===
    computed_features = {
        'revenue_lag_1': revenue_lag_1,
        'revenue_3mo_avg': revenue_3mo_avg,
        'ly_same_month_revenue': ly_same_month_rev,
        'fcst_signed_rem': fcst_signed_rem,
        'signed_per_month': signed_per_month,
        'committed_ratio': committed_ratio,
        'expected_revenue': expected_revenue,
        'blended_forecast': blended_forecast,
        'yoy_adjusted_rev': yoy_adjusted_rev,
        'revenue_velocity': revenue_velocity,
        'trend_strength': trend_strength,
        'yoy_momentum': yoy_momentum,
        'signed_coverage': signed_coverage,
        'is_end_of_quarter': 1 if sim_month in [3, 6, 9, 12] else 0,
        'final_expected': final_expected,
        'fcst_total_rem': fcst_total_rem  # For state tracking
    }
    
    return computed_features, committed_ratio

In [None]:
def recursive_simulation_2025(df, model, features, feature_medians, ratio_trend, sitting_month=3, sitting_year=2025):
    """
    RECURSIVE SIMULATION WITH TOP 15 FEATURES
    
    Addressing Under-Prediction:
    1. Combined features (expected_revenue, blended_forecast) capture upside
    2. 3-month prediction average as feature (per transcript)
    3. YoY-adjusted projections
    4. Committed ratio increases monthly
    """
    print('\n' + '='*80)
    print('RECURSIVE SIMULATION FOR 2025 (TOP 15 FEATURES)')
    print('='*80)
    print(f'üìç Sitting Position: Month {sitting_month}, Year {sitting_year}')
    print(f'üìä Using {len(features)} selected features')
    
    month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sept', 'Oct', 'Nov', 'Dec']
    
    # Get historical data
    df_hist = df[df['year'].isin([2023, 2024])].copy()
    
    # Get all data up to sitting month
    running_df = df[
        (df['year'] < sitting_year) |
        ((df['year'] == sitting_year) & (df['month_num'] <= sitting_month))
    ].copy().sort_values(['year', 'month_num']).reset_index(drop=True)
    
    last_row = running_df.iloc[-1]
    predictions = []
    
    # Track state
    current_fcst_total = last_row.get('fcst_total_rem', 0)
    current_committed_ratio = last_row.get('committed_ratio', 0.5)
    
    print('\n' + '='*80)
    print('MONTH-BY-MONTH PREDICTIONS')
    print('='*80)
    
    for sim_month in range(sitting_month, 13):
        print(f'\nüìÖ {month_names[sim_month-1]} 2025 (Month {sim_month})')
        print('-' * 60)
        
        if sim_month == sitting_month:
            # First month - use actual features
            month_data = running_df[running_df['month_num'] == sim_month].iloc[-1:].copy()
            
            X_pred = pd.DataFrame()
            for col in features:
                if col in month_data.columns:
                    X_pred[col] = month_data[col].values
                else:
                    X_pred[col] = [feature_medians.get(col, 0)]
            
            current_committed_ratio = month_data['committed_ratio'].values[0] if 'committed_ratio' in month_data.columns else 0.5
            current_fcst_total = month_data['fcst_total_rem'].values[0] if 'fcst_total_rem' in month_data.columns else 0
            
        else:
            # Subsequent months - recompute from predictions
            computed, current_committed_ratio = recompute_features_from_predictions(
                predictions, last_row, sim_month, features, feature_medians, ratio_trend, df_hist
            )
            
            # Build feature vector
            X_pred = pd.DataFrame()
            for col in features:
                if col in computed:
                    X_pred[col] = [computed[col]]
                else:
                    X_pred[col] = [feature_medians.get(col, 0)]
            
            current_fcst_total = computed.get('fcst_total_rem', 0)
            
            # Show key feature values
            print(f'   Key Features:')
            print(f'      revenue_lag_1:        ${computed.get("revenue_lag_1", 0):>12,.0f} (last prediction)')
            print(f'      revenue_3mo_avg:      ${computed.get("revenue_3mo_avg", 0):>12,.0f} (avg of last 3 preds)')
            print(f'      committed_ratio:      {current_committed_ratio:>12.3f} (increasing monthly)')
            print(f'      expected_revenue:     ${computed.get("expected_revenue", 0):>12,.0f} (upside capture)')
            print(f'      blended_forecast:     ${computed.get("blended_forecast", 0):>12,.0f} (60% lag + 40% exp)')
        
        # Fill any NaNs
        for col in features:
            if col not in X_pred.columns:
                X_pred[col] = [feature_medians.get(col, 0)]
            X_pred[col] = X_pred[col].fillna(feature_medians.get(col, 0))
        
        # Make prediction
        pred = model.predict(X_pred.values.reshape(1, -1))[0]
        pred = max(pred, 0)
        
        # Get actual if available
        actual_row = df[(df['year'] == sitting_year) & (df['month_num'] == sim_month)]
        actual = actual_row['actual_revenue'].values[0] if len(actual_row) > 0 and not pd.isna(actual_row['actual_revenue'].values[0]) else np.nan
        
        # Store prediction
        predictions.append({
            'year': sitting_year,
            'month': month_names[sim_month - 1],
            'month_num': sim_month,
            'actual': actual,
            'predicted': pred,
            'committed_ratio': current_committed_ratio,
            'fcst_total_rem': current_fcst_total
        })
        
        # Display result
        print(f'\n   üéØ PREDICTED: ${pred:>15,.0f}')
        if not pd.isna(actual):
            error = actual - pred
            error_pct = abs(error / actual) * 100
            direction = 'üìà Over' if pred > actual else 'üìâ Under'
            print(f'   üìä ACTUAL:    ${actual:>15,.0f}')
            print(f'   ‚ö° ERROR:     ${error:>15,.0f} ({error_pct:.1f}%) {direction}')
        else:
            print(f'   üìä ACTUAL:    (Not available - future month)')
    
    return pd.DataFrame(predictions)

# Run simulation
best_model = model_results[best_model_name]['model']
predictions_df = recursive_simulation_2025(
    df_features,
    best_model,
    top_15_features,
    feature_medians,
    ratio_trend,
    sitting_month=3,
    sitting_year=2025
)

In [None]:
# Display Final Results Table
print('\n' + '='*100)
print(f'FINAL PREDICTIONS: March-December 2025 ({best_model_name})')
print('='*100)
print('-'*100)
print(f'{"Month":8} {"Year":6} {"Actual Revenue":>20} {"Predicted Revenue":>20} {"Difference":>18} {"Error%":>10}')
print('-'*100)

for _, row in predictions_df.iterrows():
    actual = row['actual']
    pred = row['predicted']
    
    if pd.notna(actual):
        diff = actual - pred
        error_pct = abs(diff / actual) * 100
        actual_str = f'${actual:>18,.0f}'
        diff_str = f'{"üìà" if diff > 0 else "üìâ"} ${abs(diff):>14,.0f}'
        error_str = f'{error_pct:>8.1f}%'
    else:
        actual_str = f'{"(Future)":>20}'
        diff_str = f'{"N/A":>18}'
        error_str = f'{"N/A":>10}'
    
    pred_str = f'${pred:>18,.0f}'
    print(f'{row["month"]:8} {int(row["year"]):6} {actual_str} {pred_str} {diff_str} {error_str}')

print('-'*100)

In [None]:
# Performance Metrics
print('\n' + '='*80)
print('PERFORMANCE METRICS')
print('='*80)

mask = predictions_df['actual'].notna()
if mask.sum() > 0:
    y_true = predictions_df.loc[mask, 'actual'].values
    y_pred = predictions_df.loc[mask, 'predicted'].values
    
    mae = mean_absolute_error(y_true, y_pred)
    mape = mean_absolute_percentage_error(y_true, y_pred) * 100
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    
    # Bias check (positive = under-prediction, negative = over-prediction)
    bias = np.mean(y_true - y_pred)
    
    print(f'\nüìä Metrics on {mask.sum()} months with actual revenue:')
    print(f'   MAE:  ${mae:>15,.0f}')
    print(f'   MAPE: {mape:>14.2f}%')
    print(f'   RMSE: ${rmse:>15,.0f}')
    print(f'   BIAS: ${bias:>15,.0f} ({"Under-predicting" if bias > 0 else "Over-predicting"})')
    
    print(f'\nüìà Cumulative Totals:')
    print(f'   Actual Total:    ${np.sum(y_true):>15,.0f}')
    print(f'   Predicted Total: ${np.sum(y_pred):>15,.0f}')
    print(f'   Gap:             ${np.sum(y_true) - np.sum(y_pred):>15,.0f}')
    
    # Month-by-month error analysis
    print(f'\nüìä Month-by-Month Error Analysis:')
    for i, (_, row) in enumerate(predictions_df[mask].iterrows()):
        error = row['actual'] - row['predicted']
        error_pct = (error / row['actual']) * 100
        bar_len = int(abs(error_pct))
        bar_char = '‚ñì' if error > 0 else '‚ñë'
        bar = bar_char * min(bar_len, 30)
        direction = 'Under' if error > 0 else 'Over '
        print(f'   {row["month"]:5}: {direction} by {abs(error_pct):>5.1f}% {bar}')
else:
    print('\n‚ö†Ô∏è No actual values available for comparison')

In [None]:
# Visualization
print('\n' + '='*80)
print('VISUALIZATION')
print('='*80)

fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Actual vs Predicted Line Chart
ax1 = axes[0, 0]
months = predictions_df['month'].values
x_pos = range(len(months))

actual_display = predictions_df['actual'].fillna(predictions_df['predicted']).values / 1e6
predicted_display = predictions_df['predicted'].values / 1e6

ax1.plot(x_pos, actual_display, marker='o', label='Actual', color='#2E86AB', linewidth=3, markersize=12)
ax1.plot(x_pos, predicted_display, marker='s', label='Predicted', color='#E94F37', linewidth=3, linestyle='--', markersize=12)
ax1.fill_between(x_pos, actual_display, predicted_display, alpha=0.2, color='gray')
ax1.set_xticks(x_pos)
ax1.set_xticklabels(months, rotation=45, fontsize=11)
ax1.set_ylabel('Revenue (Millions $)', fontsize=12)
ax1.set_title('Actual vs Predicted Revenue (March-December 2025)', fontweight='bold', fontsize=14)
ax1.legend(fontsize=11)
ax1.grid(alpha=0.3)

# 2. Error by Month Bar Chart
ax2 = axes[0, 1]
mask = predictions_df['actual'].notna()
if mask.sum() > 0:
    error_months = predictions_df[mask]['month'].values
    errors = (predictions_df[mask]['actual'] - predictions_df[mask]['predicted']).values / 1e6
    colors = ['#27AE60' if e >= 0 else '#E74C3C' for e in errors]
    x_err = range(len(error_months))
    bars = ax2.bar(x_err, errors, color=colors, alpha=0.8, edgecolor='black', linewidth=1.5)
    ax2.axhline(y=0, color='black', linestyle='-', linewidth=1)
    ax2.set_xticks(x_err)
    ax2.set_xticklabels(error_months, rotation=45, fontsize=11)
    ax2.set_ylabel('Error (Millions $)', fontsize=12)
    ax2.set_title('Prediction Error by Month (Green=Under, Red=Over)', fontweight='bold', fontsize=14)
    ax2.grid(axis='y', alpha=0.3)

# 3. Committed Ratio Trend
ax3 = axes[1, 0]
committed_ratios = predictions_df['committed_ratio'].values
ax3.plot(x_pos, committed_ratios, marker='o', color='#9B59B6', linewidth=3, markersize=12)
ax3.fill_between(x_pos, committed_ratios, alpha=0.3, color='#9B59B6')
ax3.set_xticks(x_pos)
ax3.set_xticklabels(months, rotation=45, fontsize=11)
ax3.set_ylabel('Committed Ratio', fontsize=12)
ax3.set_title('Committed Ratio Trend (Increasing Jan‚ÜíDec)', fontweight='bold', fontsize=14)
ax3.grid(alpha=0.3)

# 4. Feature Importance (Top 15)
ax4 = axes[1, 1]
coefs = best_model.coef_
feat_imp = pd.DataFrame({'Feature': top_15_features, 'Coefficient': coefs})
feat_imp['Abs_Coef'] = feat_imp['Coefficient'].abs()
feat_imp = feat_imp.sort_values('Abs_Coef', ascending=True)
colors = ['#27AE60' if c > 0 else '#E74C3C' for c in feat_imp['Coefficient']]
ax4.barh(feat_imp['Feature'], feat_imp['Abs_Coef'], color=colors, edgecolor='black', linewidth=1)
ax4.set_xlabel('|Coefficient| (Green=Positive, Red=Negative)', fontsize=11)
ax4.set_title(f'Top 15 Feature Importance ({best_model_name})', fontweight='bold', fontsize=14)
ax4.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.savefig('revenue_forecast_top15_features.png', dpi=150, bbox_inches='tight')
plt.show()

print('\n‚úÖ Visualization saved!')

In [None]:
# Final Summary
print('\n' + '='*100)
print('FINAL SUMMARY - TOP 15 FEATURES WITH REASONING')
print('='*100)

print(f'''
‚úÖ MODEL: {best_model_name}
‚úÖ TRAINING DATA: 2023-2024
‚úÖ TEST DATA: March-December 2025
‚úÖ TOTAL FEATURES CREATED: {len(all_features)}
‚úÖ TOP 15 FEATURES SELECTED

{'='*100}
THE 15 SELECTED FEATURES WITH BUSINESS REASONING
{'='*100}
''')

for i, feat in enumerate(top_15_features, 1):
    reason = feature_reasoning.get(feat, 'Selected based on predictive power')
    print(f'{i:2}. {feat:30} ‚îÇ {reason}')

print(f'''

{'='*100}
HOW WE ADDRESS UNDER-PREDICTION
{'='*100}

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ PROBLEM: Traditional lag-based models under-predict because they only look     ‚îÇ
‚îÇ          at past actuals, missing potential from pipeline/upside.              ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ OUR SOLUTIONS:                                                                  ‚îÇ
‚îÇ                                                                                 ‚îÇ
‚îÇ 1. COMBINED UPSIDE-CAPTURE FEATURES:                                           ‚îÇ
‚îÇ    ‚Ä¢ expected_revenue: Probability-weighted pipeline contribution              ‚îÇ
‚îÇ    ‚Ä¢ blended_forecast: 60% recent actual + 40% expected (adds upside)          ‚îÇ
‚îÇ    ‚Ä¢ final_expected: Weighted composite of all signals                         ‚îÇ
‚îÇ                                                                                 ‚îÇ
‚îÇ 2. MOMENTUM FEATURES (detect growth trends):                                   ‚îÇ
‚îÇ    ‚Ä¢ revenue_velocity: Is revenue going up month-over-month?                   ‚îÇ
‚îÇ    ‚Ä¢ trend_strength: How much above/below long-term average?                   ‚îÇ
‚îÇ    ‚Ä¢ yoy_momentum: Are we beating last year?                                   ‚îÇ
‚îÇ                                                                                 ‚îÇ
‚îÇ 3. YoY GROWTH PROJECTION:                                                      ‚îÇ
‚îÇ    ‚Ä¢ yoy_adjusted_rev: Projects based on historical growth patterns            ‚îÇ
‚îÇ    ‚Ä¢ ly_same_month_revenue: Stable seasonality anchor                          ‚îÇ
‚îÇ                                                                                 ‚îÇ
‚îÇ 4. CORRECT SIMULATION LOGIC (From Transcript):                                 ‚îÇ
‚îÇ    ‚Ä¢ Prediction ‚Üí becomes lag_1 for next month                                 ‚îÇ
‚îÇ    ‚Ä¢ 3-month average of predictions (not just lag_1)                           ‚îÇ
‚îÇ    ‚Ä¢ Committed ratio increases monthly (simulated with random %)               ‚îÇ
‚îÇ    ‚Ä¢ Only use REMAINING sums, not individual forecasts (avoids bias)           ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

{'='*100}
KEY METHODOLOGY POINTS
{'='*100}

1. FEATURE SELECTION BEFORE SIMULATION
   ‚Ä¢ Selected top 15 on 2023-2024 data (avoid leakage)
   ‚Ä¢ Used correlation + mutual information scoring
   ‚Ä¢ Applied business reasoning for interpretability

2. RECURSIVE SIMULATION (Correct Approach)
   ‚Ä¢ March prediction ‚Üí lag_1 for April
   ‚Ä¢ Average of last 3 predictions as feature (per transcript)
   ‚Ä¢ All derived features recomputed from predictions

3. TREND-BASED RATIO IMPUTATION
   ‚Ä¢ Committed ratio increases Jan‚ÜíDec (historical pattern)
   ‚Ä¢ Simulate with random % increase each month
   ‚Ä¢ Avoids static forward-fill that doesn't reflect reality

4. NO SCALING
   ‚Ä¢ Removed StandardScaler as discussed
   ‚Ä¢ Improves precision for interpretation

5. STABLE ANCHOR FEATURES
   ‚Ä¢ ly_same_month_revenue: Always available (last year)
   ‚Ä¢ is_end_of_quarter: Known from calendar
   ‚Ä¢ No imputation needed - reduces simulation uncertainty
''')

# Performance summary
mask = predictions_df['actual'].notna()
if mask.sum() > 0:
    y_true = predictions_df.loc[mask, 'actual'].values
    y_pred = predictions_df.loc[mask, 'predicted'].values
    mape = mean_absolute_percentage_error(y_true, y_pred) * 100
    bias = np.mean(y_true - y_pred)
    
    print(f'''{'='*100}
PERFORMANCE SUMMARY
{'='*100}

   MAPE: {mape:.2f}%
   Bias: ${bias:>,.0f} ({"Under-predicting" if bias > 0 else "Over-predicting"})
   
   Compared to pure lag model, our combined features should reduce under-prediction
   by incorporating pipeline upside and growth momentum.
''')