# AAVAIL Revenue Prediction - Part 2: Model Iteration

## Assignment 02: Time-Series Forecasting Models

**Objective**: Compare different modeling approaches to predict next 30 days revenue

**Models to Compare**:
1. ARIMA - Traditional time series
2. Exponential Smoothing - Holt-Winters method
3. Random Forest - ML with engineered features
4. Gradient Boosting - Advanced ensemble method
5. LSTM - Deep learning approach

In [1]:
# Import libraries
import sys
import os
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Import custom modules
from data_ingestion import load_retail_data
from model_approaches import TimeSeriesModelApproaches, run_model_comparison

print("Libraries imported successfully!")

TensorFlow not available. LSTM models will be skipped.
Libraries imported successfully!


In [2]:
# Load processed data from Part 1
print("Loading processed data...")

# Load the focused dataset (top 10 countries)
df_focused = pd.read_csv('../data/processed/focused_data_top10.csv')
df_focused['date'] = pd.to_datetime(df_focused['date'])

print(f"Data loaded: {len(df_focused):,} records")
print(f"Date range: {df_focused['date'].min()} to {df_focused['date'].max()}")
print(f"Countries: {df_focused['country'].nunique()}")

Loading processed data...
Data loaded: 797,424 records
Date range: 2017-11-28 00:00:00 to 2019-07-31 00:00:00
Countries: 10


In [3]:
# Run model comparison for all countries combined
print("Running model comparison for all top 10 countries combined...")

results_all, comparison_all = run_model_comparison(df_focused, country=None)

print("\nModel Comparison Results:")
print(comparison_all)

Running model comparison for all top 10 countries combined...


INFO:model_approaches:Comparing all modeling approaches...
INFO:model_approaches:Training ARIMA model...
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
INFO:model_approaches:ARIMA - MAE: 3646.12, MAPE: 431539536298399170560.00%
INFO:model_approaches:Training Exponential Smoothing model...
  self._init_dates(dates, freq)
INFO:model_approaches:Exponential Smoothing - MAE: 2950.15, MAPE: 197289858494560239616.00%
INFO:model_approaches:Training Random Forest model...
INFO:model_approaches:Random Forest - MAE: 29701.88, MAPE: 18.87%
INFO:model_approaches:Training Gradient Boosting model...
INFO:model_approaches:Gradient Boosting - MAE: 27620.84, MAPE: 16.80%



Model Comparison Results:
                   Model   Status                     Error           MAE  \
0                  ARIMA  Success                      None   3646.119383   
1  Exponential Smoothing  Success                      None   2950.152584   
2          Random Forest  Success                      None  29701.878711   
3      Gradient Boosting  Success                      None  27620.836198   
4                   LSTM   Failed  TensorFlow not available           NaN   

           MAPE   Forecast_30d  
0  4.315395e+18  153418.290627  
1  1.972899e+18  161380.566822  
2  1.886887e-01  203432.217070  
3  1.679624e-01  210383.922274  
4           NaN            NaN  


In [4]:
# Run model comparison for United Kingdom (top country)
print("Running model comparison for United Kingdom...")

results_uk, comparison_uk = run_model_comparison(df_focused, country='United Kingdom')

print("\nUK Model Comparison Results:")
print(comparison_uk)

Running model comparison for United Kingdom...


INFO:model_approaches:Comparing all modeling approaches...
INFO:model_approaches:Training ARIMA model...
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
INFO:model_approaches:ARIMA - MAE: 3292.43, MAPE: 394184826171644706816.00%
INFO:model_approaches:Training Exponential Smoothing model...
  self._init_dates(dates, freq)
INFO:model_approaches:Exponential Smoothing - MAE: 2678.15, MAPE: 182521965313144586240.00%
INFO:model_approaches:Training Random Forest model...
INFO:model_approaches:Random Forest - MAE: 28796.77, MAPE: 20.30%
INFO:model_approaches:Training Gradient Boosting model...
INFO:model_approaches:Gradient Boosting - MAE: 29612.17, MAPE: 20.20%



UK Model Comparison Results:
                   Model   Status                     Error           MAE  \
0                  ARIMA  Success                      None   3292.432805   
1  Exponential Smoothing  Success                      None   2678.151864   
2          Random Forest  Success                      None  28796.769739   
3      Gradient Boosting  Success                      None  29612.169329   
4                   LSTM   Failed  TensorFlow not available           NaN   

           MAPE   Forecast_30d  
0  3.941848e+18  140096.105287  
1  1.825220e+18  149368.327368  
2  2.029795e-01  173196.658140  
3  2.019814e-01  162068.271426  
4           NaN            NaN  


In [5]:
# Select best model and prepare for deployment
modeler = TimeSeriesModelApproaches()
best_model_name, best_model_result = modeler.select_best_model(results_all)

print(f"\nSelected Best Model: {best_model_name}")
print(f"MAPE: {best_model_result['mape']:.2%}")
print(f"MAE: {best_model_result['mae']:.2f}")
print(f"30-day forecast: ${best_model_result['forecast_30d_sum']:,.2f}")

INFO:model_approaches:Best model: Gradient Boosting with MAPE: 16.80%



Selected Best Model: Gradient Boosting
MAPE: 16.80%
MAE: 27620.84
30-day forecast: $210,383.92


In [6]:
# Save results and best model
import pickle

# Save model comparison results
comparison_all.to_csv('../reports/model_comparison_results.csv', index=False)

# Save best model
model_data = {
    'best_model_name': best_model_name,
    'best_model_result': best_model_result,
    'model_comparison': comparison_all.to_dict()
}

with open('../models/best_model_assignment02.pkl', 'wb') as f:
    pickle.dump(model_data, f)

print("Model results saved successfully!")

FileNotFoundError: [Errno 2] No such file or directory: '../models/best_model_assignment02.pkl'

## Summary

Assignment 02 completed successfully:
- Compared 5 different modeling approaches
- Selected best performing model based on MAPE
- Prepared model for deployment in Assignment 03
- Generated comprehensive comparison report