# AAVAIL Revenue Prediction - Part 1: Data Investigation

## Assignment 01: Capstone Through the Eyes of Our Working Example

**Business Context**: AAVAIL is transitioning from tiered subscription to à la carte billing model. Management needs monthly revenue predictions with country-specific capabilities.

**Objectives:**
1. Assimilate business scenario and articulate testable hypotheses
2. State ideal data requirements
3. Create automated data ingestion pipeline
4. Investigate data relationships
5. Generate deliverable with visualizations

In [None]:
# Import required libraries
import sys
import os
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Import custom modules
from data_ingestion import load_retail_data
from eda import perform_eda, EDAAnalyzer

print("Libraries imported successfully!")

## 1. Business Scenario Analysis

### Business Opportunity Statement

AAVAIL has successfully experimented with an à la carte billing model outside the US market and now has 2+ years of transaction data across 38 countries. Management needs to:

- **Primary Goal**: Predict monthly revenue at any point in time
- **Secondary Goal**: Project revenue for specific countries
- **Scope**: Focus on top 10 countries by revenue
- **Impact**: Improve staffing and budget projections, reduce manager time spent on manual forecasting

### Testable Hypotheses

Based on the business scenario, we propose the following testable hypotheses:

1. **H1**: Revenue shows seasonal patterns that can be leveraged for prediction
2. **H2**: The top 10 countries contribute to ≥80% of total revenue (Pareto principle)
3. **H3**: Customer transaction frequency correlates with customer lifetime value
4. **H4**: Monthly revenue trends show growth patterns suitable for extrapolation
5. **H5**: Weekend vs weekday transaction patterns differ significantly
6. **H6**: Country-specific revenue patterns are stable over time
7. **H7**: Customer retention affects monthly revenue predictability
8. **H8**: Transaction amount distributions vary significantly by country

## 2. Ideal Data Requirements

### Required Data for Revenue Prediction:

**Primary Features:**
- Transaction amounts and dates
- Country information
- Customer identifiers

**Supporting Features:**
- Service usage metrics (times_viewed)
- Invoice/billing information
- Customer engagement data

**Time Series Requirements:**
- At least 24 months of historical data
- Daily transaction granularity
- Consistent data format across time periods

**Geographic Requirements:**
- Country-level segmentation
- Sufficient data volume per country
- Focus on top revenue-generating countries

In [None]:
# 3. Automated Data Ingestion
print("Loading data from JSON files...")

# Define data directory (adjust path as needed)
data_directory = "../ai-workflow-capstone-master/cs-train"

# Load and process all data
df, data_summary = load_retail_data(data_directory)

print(f"\nData Loading Summary:")
print(f"Total records: {data_summary['total_records']:,}")
print(f"Date range: {data_summary['date_range']['start']} to {data_summary['date_range']['end']}")
print(f"Countries: {data_summary['countries']['total']}")
print(f"Total revenue: ${data_summary['revenue']['total']:,.2f}")
print(f"Average transaction: ${data_summary['revenue']['mean_transaction']:.2f}")

In [None]:
# Display first few rows to understand data structure
print("Data Structure:")
print(df.head())
print("\nData Info:")
print(df.info())
print("\nData Description:")
print(df.describe())

## 4. Data Investigation and EDA

Now we'll conduct comprehensive exploratory data analysis to understand relationships between data, target variable (revenue), and business metrics.

In [None]:
# Perform comprehensive EDA
print("Performing Exploratory Data Analysis...")

eda_results, hypothesis_tests = perform_eda(df)

print("\nEDA Results Summary:")
print(f"Total unique customers: {eda_results['data_summary']['basic_info']['unique_customers']:,}")
print(f"Analysis period: {eda_results['data_summary']['basic_info']['date_range']['days']} days")
print(f"Countries analyzed: {eda_results['data_summary']['basic_info']['unique_countries']}")

In [None]:
# Display top 10 countries by revenue
print("\nTop 10 Countries by Revenue:")
print(eda_results['country_analysis'])

print("\nTop 10 Countries List (for model focus):")
for i, country in enumerate(eda_results['top_countries'], 1):
    print(f"{i}. {country}")

In [None]:
# Hypothesis Testing Results
print("\nHypothesis Testing Results:")
print(f"\nH2 - Pareto Principle Test:")
print(f"Top 10 countries revenue percentage: {hypothesis_tests['pareto_principle']['top_10_percentage']:.1f}%")
print(f"Passes 80/20 rule: {hypothesis_tests['pareto_principle']['passes_80_20_rule']}")

print(f"\nH5 - Weekend vs Weekday Analysis:")
print(f"Weekend revenue percentage: {hypothesis_tests['weekend_vs_weekday']['weekend_percentage']:.1f}%")
print(f"Weekend revenue: ${hypothesis_tests['weekend_vs_weekday']['weekend_revenue']:,.2f}")
print(f"Weekday revenue: ${hypothesis_tests['weekend_vs_weekday']['weekday_revenue']:,.2f}")

In [None]:
# Additional analysis for model preparation
analyzer = EDAAnalyzer(df)

# Focus dataset on top 10 countries for model development
top_countries = eda_results['top_countries']
df_focused = df[df['country'].isin(top_countries)].copy()

print(f"\nFocused Dataset (Top 10 Countries):")
print(f"Records: {len(df_focused):,} ({len(df_focused)/len(df)*100:.1f}% of total)")
print(f"Revenue: ${df_focused['price'].sum():,.2f} ({df_focused['price'].sum()/df['price'].sum()*100:.1f}% of total)")

# Monthly aggregation for time series preparation
monthly_data = df_focused.groupby(['country', 'month_year']).agg({
    'price': 'sum',
    'customer_id': 'nunique',
    'invoice': 'nunique',
    'times_viewed': 'mean'
}).reset_index()

monthly_data.columns = ['country', 'month_year', 'monthly_revenue', 'unique_customers', 'unique_invoices', 'avg_views']

print(f"\nMonthly aggregated data shape: {monthly_data.shape}")
print(monthly_data.head())

## 5. Key Findings and Insights

### Data Quality Assessment
- Successfully loaded and processed transaction data from multiple JSON sources
- Implemented automated data ingestion with quality assurance checks
- Standardized column names and cleaned invoice IDs for consistency

### Revenue Patterns
- Identified top 10 countries contributing majority of revenue
- Discovered temporal patterns in transaction data
- Analyzed customer segmentation and behavior patterns

### Model Readiness
- Prepared time-series ready dataset with monthly aggregations
- Focused scope on top 10 countries as requested
- Established baseline metrics for model evaluation

In [None]:
# Save processed data for next phases
output_dir = "../data/processed/"
os.makedirs(output_dir, exist_ok=True)

# Save full processed dataset
df.to_csv(f"{output_dir}full_processed_data.csv", index=False)

# Save focused dataset (top 10 countries)
df_focused.to_csv(f"{output_dir}focused_data_top10.csv", index=False)

# Save monthly aggregated data
monthly_data.to_csv(f"{output_dir}monthly_aggregated_data.csv", index=False)

print("Processed data saved successfully!")
print(f"Files saved to: {output_dir}")

## 6. Recommendations for Part 2

Based on our data investigation, we recommend the following approaches for Part 2 (Model Iteration):

### Modeling Approaches to Compare:
1. **Time Series Models**: ARIMA, Seasonal ARIMA for capturing temporal patterns
2. **Machine Learning Models**: Random Forest, Gradient Boosting with engineered features
3. **Deep Learning**: LSTM networks for sequential pattern recognition
4. **Ensemble Methods**: Combination of above approaches

### Feature Engineering Priorities:
- Lag features (previous month revenue)
- Rolling averages and trends
- Seasonal decomposition components
- Country-specific patterns
- Customer behavior metrics

### Model Evaluation Strategy:
- Time-series cross-validation
- Country-specific performance metrics
- Business impact assessment (accuracy vs manager time savings)

In [None]:
print("Part 1: Data Investigation Completed Successfully!")
print("\nNext Steps:")
print("1. Review generated visualizations in reports/figures/")
print("2. Use processed data for Part 2: Model Iteration")
print("3. Focus modeling efforts on top 10 countries identified")
print("4. Implement time-series prediction models")
print("5. Prepare API for Part 3: Model Production")