# üöÄ AI MarketPulse ‚Äì AI for Market Trend Analysis

**Module E: AI Applications ‚Äì Individual Open Project**

---

This notebook presents a complete end-to-end AI system for analyzing retail/store sales data to:
- üìà Identify market trends (Rising, Falling, Stable)
- üîÆ Forecast future sales using Machine Learning
- üí° Generate actionable business insights from pricing and discount patterns
- ‚ö†Ô∏è Handle edge cases like cold-start products/stores

**Author**: AI MarketPulse Team  
**Date**: January 2026  
**Version**: 1.0

---
# 1. Problem Definition & Objective

## 1.1 Business Problem

In today's competitive retail landscape, understanding market trends is crucial for business success. Retailers face several challenges:

- **Inventory Management**: Overstocking leads to waste; understocking leads to lost sales
- **Pricing Strategy**: Determining optimal prices and discount strategies
- **Trend Detection**: Identifying which products are gaining or losing popularity
- **Demand Forecasting**: Predicting future sales to plan operations

## 1.2 Objectives

This AI system aims to address these challenges through:

1. **Trend Analysis**: Analyze historical sales data to identify rising, falling, and stable market trends
2. **Sales Forecasting**: Build a machine learning model to predict future sales
3. **Business Insights**: Generate actionable recommendations from pricing and discount patterns
4. **Edge Case Handling**: Implement robust fallback strategies for new/cold-start products

## 1.3 Approach

We will use a combination of:
- Statistical analysis for trend detection
- Random Forest Regression for forecasting
- Rule-based systems for insight generation

In [None]:
# =============================================================================
# IMPORTS AND CONFIGURATION
# =============================================================================

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
import requests
from io import StringIO
import warnings
from datetime import datetime, timedelta

# Configuration
warnings.filterwarnings('ignore')
plt.style.use('default')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10
np.random.seed(42)

print("=" * 60)
print("AI MarketPulse - Market Trend Analysis System")
print("=" * 60)
print(f"\nLibraries loaded successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

In [None]:
# =============================================================================
# HELPER FUNCTIONS FOR DATA LOADING
# =============================================================================

def generate_synthetic_data(n_records=5000):
    """
    Generate synthetic retail sales data when real dataset is unavailable.
    """
    print("Generating synthetic retail sales dataset...")
    
    stores = [f'S{i:03d}' for i in range(1, 11)]  # 10 stores
    products = [f'P{i:04d}' for i in range(1, 51)]  # 50 products
    categories = ['Electronics', 'Clothing', 'Groceries', 'Home & Garden', 'Sports']
    
    start_date = datetime(2024, 1, 1)
    end_date = datetime(2025, 12, 31)
    date_range = pd.date_range(start=start_date, end=end_date, freq='D')
    
    data = []
    for _ in range(n_records):
        date_idx = np.random.randint(0, len(date_range))
        date = date_range[date_idx]
        store = np.random.choice(stores)
        product = np.random.choice(products)
        category = np.random.choice(categories)
        
        base_prices = {'Electronics': 150, 'Clothing': 50, 'Groceries': 20, 
                       'Home & Garden': 80, 'Sports': 70}
        price = base_prices[category] * np.random.uniform(0.5, 2.0)
        
        has_discount = np.random.random() < 0.3
        discount = np.random.uniform(5, 30) if has_discount else 0
        
        month = date.month
        seasonal_factor = 1.0 + 0.3 * np.sin(2 * np.pi * month / 12)
        discount_factor = 1.0 + (discount / 100) * 0.5
        price_factor = 100 / (price + 50)
        
        base_sales = np.random.poisson(20)
        sales = int(base_sales * seasonal_factor * discount_factor * price_factor)
        sales = max(1, sales)
        
        data.append({
            'date': date,
            'store_id': store,
            'product_id': product,
            'category': category,
            'price': round(price, 2),
            'discount': round(discount, 2),
            'sales': sales
        })
    
    df = pd.DataFrame(data)
    print(f"‚úì Generated {len(df)} synthetic records")
    return df


def load_data():
    """
    Load retail sales data from public URL or generate synthetic data.
    """
    print("\n" + "=" * 60)
    print("DATA LOADING")
    print("=" * 60)
    
    print("\n‚ö† Using synthetic data for reliable demo.")
    return generate_synthetic_data()

print("‚úì Helper functions defined")

---
# 2. Data Understanding & Preparation

## 2.1 Data Loading

We will generate synthetic retail sales data with realistic patterns including:
- Seasonal variations
- Price-demand relationships
- Discount effects on sales

In [None]:
# =============================================================================
# LOAD DATA
# =============================================================================

df_raw = load_data()

print("\n" + "=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)
print(f"\nShape: {df_raw.shape[0]} rows √ó {df_raw.shape[1]} columns")
print(f"\nColumn Names: {list(df_raw.columns)}")
print(f"\nData Types:")
print(df_raw.dtypes)
print(f"\nFirst 5 rows:")
df_raw.head()

In [None]:
# =============================================================================
# EXPLORATORY DATA ANALYSIS
# =============================================================================

print("=" * 60)
print("EXPLORATORY DATA ANALYSIS")
print("=" * 60)

print("\nüìä Statistical Summary:")
print(df_raw.describe())

print("\n‚ùì Missing Values:")
missing = df_raw.isnull().sum()
print(missing[missing > 0] if missing.sum() > 0 else "No missing values found!")

print(f"\nüìã Duplicate Rows: {df_raw.duplicated().sum()}")

## 2.2 Data Preprocessing

The preprocessing pipeline includes:
1. **Date Parsing**: Convert date strings to datetime objects
2. **Duplicate Removal**: Remove exact duplicate rows
3. **Missing Value Handling**: Fill or drop missing values appropriately
4. **Outlier Capping**: Use IQR method to cap extreme values in sales
5. **Feature Engineering**: Create time-based and lag features

In [None]:
# =============================================================================
# DATA PREPROCESSING FUNCTION
# =============================================================================

def preprocess_data(df):
    """
    Comprehensive data preprocessing pipeline.
    """
    print("\n" + "=" * 60)
    print("DATA PREPROCESSING")
    print("=" * 60)
    
    df = df.copy()
    initial_rows = len(df)
    
    # Step 1: Parse date and sort
    print("\nüìÖ Step 1: Parsing dates and sorting chronologically...")
    df['date'] = pd.to_datetime(df['date'], errors='coerce')
    df = df.sort_values('date').reset_index(drop=True)
    print(f"   Date range: {df['date'].min()} to {df['date'].max()}")
    
    # Step 2: Remove duplicates
    print("\nüîÑ Step 2: Removing duplicates...")
    df = df.drop_duplicates()
    print(f"   Removed {initial_rows - len(df)} duplicate rows")
    
    # Step 3: Handle missing values
    print("\n‚ùì Step 3: Handling missing values...")
    numeric_cols = ['price', 'discount', 'sales']
    for col in numeric_cols:
        if df[col].isnull().sum() > 0:
            median_val = df[col].median()
            df[col] = df[col].fillna(median_val)
    df = df.dropna(subset=['date'])
    print(f"   Remaining rows: {len(df)}")
    
    # Step 4: Cap outliers using IQR method
    print("\nüìä Step 4: Capping outliers using IQR method...")
    Q1 = df['sales'].quantile(0.25)
    Q3 = df['sales'].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers_before = ((df['sales'] < lower_bound) | (df['sales'] > upper_bound)).sum()
    df['sales'] = df['sales'].clip(lower=max(0, lower_bound), upper=upper_bound)
    print(f"   IQR bounds: [{lower_bound:.2f}, {upper_bound:.2f}]")
    print(f"   Capped {outliers_before} outlier values")
    
    # Step 5: Add time features
    print("\nüïê Step 5: Adding time features...")
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    df['week'] = df['date'].dt.isocalendar().week.astype(int)
    df['dayofweek'] = df['date'].dt.dayofweek
    print("   Added: year, month, week, dayofweek")
    
    # Step 6: Add lag features
    print("\n‚èÆÔ∏è Step 6: Adding lag features...")
    df = df.sort_values(['store_id', 'product_id', 'date'])
    for lag in [1, 2, 4]:
        df[f'sales_lag_{lag}'] = df.groupby(['store_id', 'product_id'])['sales'].shift(lag)
    print("   Added: sales_lag_1, sales_lag_2, sales_lag_4")
    
    # Step 7: Add rolling features
    print("\nüìà Step 7: Adding rolling features...")
    df['sales_roll_mean_4'] = df.groupby(['store_id', 'product_id'])['sales'].transform(
        lambda x: x.rolling(window=4, min_periods=1).mean())
    df['sales_roll_std_4'] = df.groupby(['store_id', 'product_id'])['sales'].transform(
        lambda x: x.rolling(window=4, min_periods=1).std())
    df['sales_roll_std_4'] = df['sales_roll_std_4'].fillna(0)
    print("   Added: sales_roll_mean_4, sales_roll_std_4")
    
    # Fill remaining NaN in lag features
    lag_cols = ['sales_lag_1', 'sales_lag_2', 'sales_lag_4']
    df[lag_cols] = df[lag_cols].fillna(0)
    
    print("\n‚úÖ Preprocessing complete!")
    print(f"   Final shape: {df.shape}")
    
    return df

print("‚úì preprocess_data() function defined")

In [None]:
# =============================================================================
# APPLY PREPROCESSING
# =============================================================================

df = preprocess_data(df_raw)

print("\n" + "=" * 60)
print("PREPROCESSED DATA SAMPLE")
print("=" * 60)
print(f"\nNew columns added: {[col for col in df.columns if col not in df_raw.columns]}")
print(f"\nSample of preprocessed data:")
df.head(10)

In [None]:
# =============================================================================
# DATA VISUALIZATION
# =============================================================================

print("=" * 60)
print("DATA VISUALIZATION")
print("=" * 60)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Sales distribution
ax1 = axes[0, 0]
ax1.hist(df['sales'], bins=30, color='steelblue', edgecolor='black', alpha=0.7)
ax1.set_title('Sales Distribution', fontsize=12, fontweight='bold')
ax1.set_xlabel('Sales')
ax1.set_ylabel('Frequency')
ax1.axvline(df['sales'].mean(), color='red', linestyle='--', label=f'Mean: {df["sales"].mean():.1f}')
ax1.legend()

# Plot 2: Sales by Category
ax2 = axes[0, 1]
category_sales = df.groupby('category')['sales'].mean().sort_values(ascending=True)
ax2.barh(category_sales.index, category_sales.values, color='teal', edgecolor='black')
ax2.set_title('Average Sales by Category', fontsize=12, fontweight='bold')
ax2.set_xlabel('Average Sales')

# Plot 3: Monthly Sales Trend
ax3 = axes[1, 0]
monthly_sales = df.groupby(df['date'].dt.to_period('M'))['sales'].sum()
ax3.plot(monthly_sales.index.astype(str), monthly_sales.values, marker='o', 
         color='darkgreen', linewidth=2, markersize=4)
ax3.set_title('Monthly Sales Trend', fontsize=12, fontweight='bold')
ax3.set_xlabel('Month')
ax3.set_ylabel('Total Sales')
ax3.tick_params(axis='x', rotation=45)

# Plot 4: Price vs Sales Scatter
ax4 = axes[1, 1]
sample = df.sample(min(500, len(df)), random_state=42)
ax4.scatter(sample['price'], sample['sales'], alpha=0.5, color='coral', edgecolor='black', s=30)
ax4.set_title('Price vs Sales', fontsize=12, fontweight='bold')
ax4.set_xlabel('Price')
ax4.set_ylabel('Sales')

plt.tight_layout()
plt.show()

print("\n‚úì Data visualization complete")

---
# 3. Model / System Design

## 3.1 System Architecture

The AI MarketPulse system consists of three main modules:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    AI MARKETPULSE SYSTEM                        ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê        ‚îÇ
‚îÇ  ‚îÇ   MODULE 1   ‚îÇ   ‚îÇ   MODULE 2   ‚îÇ   ‚îÇ   MODULE 3   ‚îÇ        ‚îÇ
‚îÇ  ‚îÇ    TREND     ‚îÇ   ‚îÇ  FORECASTING ‚îÇ   ‚îÇ   INSIGHTS   ‚îÇ        ‚îÇ
‚îÇ  ‚îÇ   ANALYSIS   ‚îÇ   ‚îÇ    MODEL     ‚îÇ   ‚îÇ  GENERATION  ‚îÇ        ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò        ‚îÇ
‚îÇ                                                                 ‚îÇ
‚îÇ  ‚Ä¢ Monthly Aggregation    ‚Ä¢ RandomForest     ‚Ä¢ Price Correlation‚îÇ
‚îÇ  ‚Ä¢ MoM Growth %           ‚Ä¢ Time-based Split ‚Ä¢ Discount Uplift ‚îÇ
‚îÇ  ‚Ä¢ Trend Classification   ‚Ä¢ MAE/RMSE/MAPE    ‚Ä¢ Business Insights‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

## 3.2 Cold-Start Handling Strategy

For products/stores with **fewer than 4 historical records**:
- ‚ùå DO NOT use the ML forecasting model (insufficient data)
- ‚úÖ Use **baseline prediction**: Store/Category average sales
- üìù Log which groups used fallback method

---
# 4. Core Implementation

## 4.1 Trend Analysis Module

In [None]:
# =============================================================================
# TREND ANALYSIS MODULE
# =============================================================================

def trend_analysis(df):
    """
    Perform trend analysis on sales data.
    """
    print("\n" + "=" * 60)
    print("TREND ANALYSIS MODULE")
    print("=" * 60)
    
    # Step 1: Aggregate monthly sales
    print("\nüìä Step 1: Aggregating monthly sales...")
    df['year_month'] = df['date'].dt.to_period('M')
    
    monthly_agg = df.groupby(['store_id', 'product_id', 'category', 'year_month']).agg({
        'sales': 'sum', 'price': 'mean', 'discount': 'mean'
    }).reset_index()
    monthly_agg['year_month_str'] = monthly_agg['year_month'].astype(str)
    print(f"   Created {len(monthly_agg)} monthly aggregations")
    
    # Step 2: Compute MoM growth
    print("\nüìà Step 2: Computing Month-over-Month growth...")
    monthly_agg = monthly_agg.sort_values(['store_id', 'product_id', 'year_month'])
    monthly_agg['prev_sales'] = monthly_agg.groupby(['store_id', 'product_id'])['sales'].shift(1)
    monthly_agg['mom_growth'] = ((monthly_agg['sales'] - monthly_agg['prev_sales']) / 
                                  monthly_agg['prev_sales'].replace(0, np.nan) * 100)
    
    # Step 3: Classify trends
    print("\nüè∑Ô∏è Step 3: Classifying trends...")
    def classify_trend(growth):
        if pd.isna(growth): return 'Unknown'
        elif growth > 5: return 'Rising'
        elif growth < -5: return 'Falling'
        else: return 'Stable'
    
    monthly_agg['trend'] = monthly_agg['mom_growth'].apply(classify_trend)
    
    trend_summary = monthly_agg.groupby(['store_id', 'product_id', 'category']).agg({
        'mom_growth': 'mean', 'sales': 'sum'
    }).reset_index()
    trend_summary.columns = ['store_id', 'product_id', 'category', 'avg_growth', 'total_sales']
    trend_summary['trend'] = trend_summary['avg_growth'].apply(classify_trend)
    
    trend_counts = trend_summary['trend'].value_counts()
    print("\n   Trend Distribution:")
    for trend, count in trend_counts.items():
        print(f"   ‚Ä¢ {trend}: {count} product-store combinations")
    
    # Step 4: Top Rising and Falling
    print("\nüîù Step 4: Identifying top trends...")
    valid_trends = trend_summary[trend_summary['avg_growth'].notna()]
    top_rising = valid_trends.nlargest(10, 'avg_growth')
    top_falling = valid_trends.nsmallest(10, 'avg_growth')
    
    print("\n" + "‚îÄ" * 50)
    print("üìà TOP 10 RISING PRODUCTS")
    print("‚îÄ" * 50)
    for _, row in top_rising.iterrows():
        print(f"   {row['store_id']}-{row['product_id']} ({row['category']}): +{row['avg_growth']:.1f}%")
    
    print("\n" + "‚îÄ" * 50)
    print("üìâ TOP 10 FALLING PRODUCTS")
    print("‚îÄ" * 50)
    for _, row in top_falling.iterrows():
        print(f"   {row['store_id']}-{row['product_id']} ({row['category']}): {row['avg_growth']:.1f}%")
    
    # Step 5: Plot trend lines
    print("\nüìä Step 5: Plotting trend lines...")
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # Rising Trends
    ax1 = axes[0]
    colors_rising = plt.cm.Greens(np.linspace(0.4, 0.9, min(5, len(top_rising))))
    for idx, (_, row) in enumerate(top_rising.head(5).iterrows()):
        mask = ((monthly_agg['store_id'] == row['store_id']) & 
                (monthly_agg['product_id'] == row['product_id']))
        item_data = monthly_agg[mask].sort_values('year_month')
        if len(item_data) > 0:
            ax1.plot(item_data['year_month_str'], item_data['sales'], 
                    marker='o', linewidth=2, markersize=4, color=colors_rising[idx],
                    label=f"{row['store_id']}-{row['product_id']} (+{row['avg_growth']:.1f}%)")
    ax1.set_title('üìà Top 5 Rising Trends', fontsize=14, fontweight='bold')
    ax1.set_xlabel('Month')
    ax1.set_ylabel('Sales')
    ax1.tick_params(axis='x', rotation=45)
    ax1.legend(loc='upper left', fontsize=8)
    ax1.grid(True, alpha=0.3)
    
    # Falling Trends
    ax2 = axes[1]
    colors_falling = plt.cm.Reds(np.linspace(0.4, 0.9, min(5, len(top_falling))))
    for idx, (_, row) in enumerate(top_falling.head(5).iterrows()):
        mask = ((monthly_agg['store_id'] == row['store_id']) & 
                (monthly_agg['product_id'] == row['product_id']))
        item_data = monthly_agg[mask].sort_values('year_month')
        if len(item_data) > 0:
            ax2.plot(item_data['year_month_str'], item_data['sales'],
                    marker='o', linewidth=2, markersize=4, color=colors_falling[idx],
                    label=f"{row['store_id']}-{row['product_id']} ({row['avg_growth']:.1f}%)")
    ax2.set_title('üìâ Top 5 Falling Trends', fontsize=14, fontweight='bold')
    ax2.set_xlabel('Month')
    ax2.set_ylabel('Sales')
    ax2.tick_params(axis='x', rotation=45)
    ax2.legend(loc='upper right', fontsize=8)
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return monthly_agg, trend_summary, top_rising, top_falling

print("‚úì trend_analysis() function defined")

In [None]:
# =============================================================================
# RUN TREND ANALYSIS
# =============================================================================

monthly_agg, trend_summary, top_rising, top_falling = trend_analysis(df)

print("\n‚úÖ Trend analysis complete!")

## 4.2 Forecasting Model

In [None]:
# =============================================================================
# FORECASTING MODEL
# =============================================================================

def train_forecast_model(df, test_size=0.2):
    """
    Train a RandomForest model for sales forecasting.
    """
    print("\n" + "=" * 60)
    print("FORECASTING MODEL")
    print("=" * 60)
    
    feature_cols = ['year', 'month', 'week', 'dayofweek', 'price', 'discount',
                    'sales_lag_1', 'sales_lag_2', 'sales_lag_4',
                    'sales_roll_mean_4', 'sales_roll_std_4']
    target_col = 'sales'
    
    df_model = df.sort_values('date').reset_index(drop=True)
    df_model = df_model.dropna(subset=feature_cols + [target_col])
    
    X = df_model[feature_cols]
    y = df_model[target_col]
    
    # Time-based split (NO shuffle)
    split_idx = int(len(df_model) * (1 - test_size))
    X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
    y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]
    
    print(f"\nüìä Data Split:")
    print(f"   Training set: {len(X_train)} samples ({len(X_train)/len(X)*100:.1f}%)")
    print(f"   Test set: {len(X_test)} samples ({len(X_test)/len(X)*100:.1f}%)")
    
    # Train model
    print("\nüå≤ Training RandomForestRegressor...")
    model = RandomForestRegressor(n_estimators=100, max_depth=15, min_samples_split=5,
                                   min_samples_leaf=2, random_state=42, n_jobs=-1)
    model.fit(X_train, y_train)
    print("   ‚úì Model trained successfully")
    
    # Predictions and metrics
    predictions = model.predict(X_test)
    mae = mean_absolute_error(y_test, predictions)
    rmse = np.sqrt(mean_squared_error(y_test, predictions))
    y_test_safe = np.where(y_test == 0, 1, y_test)
    mape = np.mean(np.abs((y_test - predictions) / y_test_safe)) * 100
    
    metrics = {'MAE': mae, 'RMSE': rmse, 'MAPE': mape}
    
    print("\nüìà Model Performance Metrics:")
    print(f"   ‚Ä¢ MAE  (Mean Absolute Error): {mae:.2f}")
    print(f"   ‚Ä¢ RMSE (Root Mean Squared Error): {rmse:.2f}")
    print(f"   ‚Ä¢ MAPE (Mean Absolute Percentage Error): {mape:.2f}%")
    
    # Feature importance
    print("\nüéØ Top 5 Feature Importance:")
    importance = pd.DataFrame({'feature': feature_cols, 'importance': model.feature_importances_})
    importance = importance.sort_values('importance', ascending=False)
    for _, row in importance.head(5).iterrows():
        print(f"   ‚Ä¢ {row['feature']}: {row['importance']:.4f}")
    
    return model, X_train, X_test, y_train, y_test, predictions, metrics, df_model

print("‚úì train_forecast_model() function defined")

In [None]:
# =============================================================================
# TRAIN THE MODEL
# =============================================================================

model, X_train, X_test, y_train, y_test, predictions, metrics, df_model = train_forecast_model(df)

print("\n‚úÖ Model training complete!")

## 4.3 Cold-Start Handling with Fallback Prediction

In [None]:
# =============================================================================
# COLD-START HANDLING - PREDICT WITH FALLBACK
# =============================================================================

def predict_with_fallback(df, model, min_records=4):
    """
    Make predictions with fallback for cold-start products/stores.
    """
    print("\n" + "=" * 60)
    print("COLD-START HANDLING")
    print("=" * 60)
    
    feature_cols = ['year', 'month', 'week', 'dayofweek', 'price', 'discount',
                    'sales_lag_1', 'sales_lag_2', 'sales_lag_4',
                    'sales_roll_mean_4', 'sales_roll_std_4']
    
    group_counts = df.groupby(['store_id', 'product_id']).size().reset_index(name='record_count')
    cold_start_groups = group_counts[group_counts['record_count'] < min_records]
    normal_groups = group_counts[group_counts['record_count'] >= min_records]
    
    print(f"\nüìä Group Analysis:")
    print(f"   Total unique store-product combinations: {len(group_counts)}")
    print(f"   Groups with sufficient data (>= {min_records} records): {len(normal_groups)}")
    print(f"   Cold-start groups (< {min_records} records): {len(cold_start_groups)}")
    
    # Baselines
    store_category_avg = df.groupby(['store_id', 'category'])['sales'].mean().to_dict()
    category_avg = df.groupby('category')['sales'].mean().to_dict()
    overall_avg = df['sales'].mean()
    
    results = []
    fallback_groups = []
    
    for _, group_info in group_counts.iterrows():
        store_id = group_info['store_id']
        product_id = group_info['product_id']
        record_count = group_info['record_count']
        
        mask = (df['store_id'] == store_id) & (df['product_id'] == product_id)
        group_data = df[mask].copy()
        
        if record_count < min_records:
            category = group_data['category'].iloc[0] if len(group_data) > 0 else 'Unknown'
            baseline = store_category_avg.get((store_id, category), 
                       category_avg.get(category, overall_avg))
            prediction = baseline
            method = 'fallback_baseline'
            fallback_groups.append({
                'store_id': store_id, 'product_id': product_id, 'category': category,
                'record_count': record_count, 'baseline_prediction': round(baseline, 2)
            })
        else:
            latest_data = group_data.iloc[-1:].copy()
            latest_data[feature_cols] = latest_data[feature_cols].fillna(0)
            prediction = model.predict(latest_data[feature_cols])[0]
            method = 'ml_model'
        
        results.append({
            'store_id': store_id, 'product_id': product_id,
            'predicted_sales': round(prediction, 2), 'method': method, 'record_count': record_count
        })
    
    predictions_df = pd.DataFrame(results)
    fallback_df = pd.DataFrame(fallback_groups)
    
    print(f"\nüìà Prediction Summary:")
    print(f"   ML model predictions: {len(predictions_df[predictions_df['method'] == 'ml_model'])}")
    print(f"   Baseline fallback predictions: {len(predictions_df[predictions_df['method'] == 'fallback_baseline'])}")
    
    return predictions_df, fallback_df

print("‚úì predict_with_fallback() function defined")

## 4.4 Business Insights Generation

In [None]:
# =============================================================================
# INSIGHTS GENERATION MODULE
# =============================================================================

def generate_insights(df, trend_summary):
    """
    Generate business insights from pricing and discount patterns.
    """
    print("\n" + "=" * 60)
    print("INSIGHTS GENERATION MODULE")
    print("=" * 60)
    
    insights = {}
    recommendations = []
    
    # Price-Sales Correlation
    print("\nüí∞ Analyzing Price-Sales Relationship...")
    price_corr = df['price'].corr(df['sales'])
    insights['price_sales_correlation'] = round(price_corr, 4)
    print(f"   Price-Sales Correlation: {price_corr:.4f}")
    
    if price_corr < -0.3:
        recommendations.append("üìâ Strong negative correlation between price and sales. Consider strategic price reductions.")
    
    # Discount-Sales Correlation
    print("\nüè∑Ô∏è Analyzing Discount-Sales Relationship...")
    discount_corr = df['discount'].corr(df['sales'])
    insights['discount_sales_correlation'] = round(discount_corr, 4)
    print(f"   Discount-Sales Correlation: {discount_corr:.4f}")
    
    # Discount Uplift
    print("\nüìä Calculating Discount Uplift...")
    df_with = df[df['discount'] > 0]
    df_without = df[df['discount'] == 0]
    avg_with = df_with['sales'].mean() if len(df_with) > 0 else 0
    avg_without = df_without['sales'].mean() if len(df_without) > 0 else 0
    uplift = ((avg_with - avg_without) / avg_without * 100) if avg_without > 0 else 0
    
    insights['avg_sales_with_discount'] = round(avg_with, 2)
    insights['avg_sales_without_discount'] = round(avg_without, 2)
    insights['discount_uplift_percent'] = round(uplift, 2)
    
    print(f"   Average sales WITH discount: {avg_with:.2f}")
    print(f"   Average sales WITHOUT discount: {avg_without:.2f}")
    print(f"   Discount Uplift: {uplift:+.2f}%")
    
    if uplift > 10:
        recommendations.append(f"‚úÖ Discounts effective! {uplift:.1f}% sales uplift. Maintain strategic discount campaigns.")
    
    # Category Analysis
    print("\nüìÇ Analyzing Category Performance...")
    category_stats = df.groupby('category').agg({'sales': ['sum', 'mean', 'count']}).round(2)
    category_stats.columns = ['total_sales', 'avg_sales', 'transaction_count']
    category_stats = category_stats.sort_values('total_sales', ascending=False)
    insights['category_stats'] = category_stats
    
    top_cat = category_stats.index[0]
    bottom_cat = category_stats.index[-1]
    recommendations.append(f"üèÜ '{top_cat}' is top-performing. Allocate more inventory and marketing.")
    recommendations.append(f"üîç '{bottom_cat}' shows lowest sales. Investigate causes.")
    
    # Store Analysis
    print("\nüè™ Analyzing Store Performance...")
    store_stats = df.groupby('store_id').agg({'sales': ['sum', 'mean']}).round(2)
    store_stats.columns = ['total_sales', 'avg_sales']
    store_stats = store_stats.sort_values('total_sales', ascending=False)
    insights['store_stats'] = store_stats
    
    top_store = store_stats.index[0]
    bottom_store = store_stats.index[-1]
    recommendations.append(f"‚≠ê Store '{top_store}' is best performer. Replicate success factors.")
    recommendations.append(f"üìã Store '{bottom_store}' underperforms. Consider operational review.")
    
    # Trend insights
    print("\nüìà Generating Trend-Based Insights...")
    rising = len(trend_summary[trend_summary['trend'] == 'Rising'])
    falling = len(trend_summary[trend_summary['trend'] == 'Falling'])
    insights['trend_distribution'] = {'rising': rising, 'falling': falling}
    
    if rising > falling:
        recommendations.append(f"üåü Positive momentum: {rising} rising vs {falling} falling trends.")
    
    # Seasonal insight
    monthly_sales = df.groupby('month')['sales'].mean()
    peak_month = monthly_sales.idxmax()
    low_month = monthly_sales.idxmin()
    month_names = {1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun',
                   7:'Jul', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'}
    recommendations.append(f"üìÖ Peak month: {month_names[peak_month]}. Plan inventory buildup.")
    recommendations.append(f"üìÖ Low month: {month_names[low_month]}. Consider promotions.")
    
    insights['recommendations'] = recommendations
    return insights

print("‚úì generate_insights() function defined")

---
# 5. Evaluation & Analysis

## 5.1 Model Evaluation

In [None]:
# =============================================================================
# MODEL EVALUATION - ACTUAL VS PREDICTED
# =============================================================================

print("=" * 60)
print("MODEL EVALUATION")
print("=" * 60)

print("\nüìä Performance Metrics Summary:")
print("‚îÄ" * 40)
print(f"‚îÇ MAE  (Mean Absolute Error)      : {metrics['MAE']:.2f}")
print(f"‚îÇ RMSE (Root Mean Squared Error)  : {metrics['RMSE']:.2f}")
print(f"‚îÇ MAPE (Mean Absolute % Error)    : {metrics['MAPE']:.2f}%")
print("‚îÄ" * 40)

# Plot Actual vs Predicted
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

ax1 = axes[0]
ax1.scatter(y_test, predictions, alpha=0.5, color='steelblue', edgecolor='black', s=30)
ax1.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2, label='Perfect Prediction')
ax1.set_title('Actual vs Predicted Sales', fontsize=12, fontweight='bold')
ax1.set_xlabel('Actual Sales')
ax1.set_ylabel('Predicted Sales')
ax1.legend()
ax1.grid(True, alpha=0.3)

ax2 = axes[1]
residuals = y_test.values - predictions
ax2.hist(residuals, bins=30, color='coral', edgecolor='black', alpha=0.7)
ax2.axvline(x=0, color='red', linestyle='--', linewidth=2)
ax2.set_title('Residual Distribution', fontsize=12, fontweight='bold')
ax2.set_xlabel('Residual (Actual - Predicted)')
ax2.set_ylabel('Frequency')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n‚úì Evaluation plots generated")

## 5.2 Future Sales Forecast

In [None]:
# =============================================================================
# FUTURE SALES FORECAST (NEXT 4 MONTHS)
# =============================================================================

print("=" * 60)
print("FUTURE SALES FORECAST")
print("=" * 60)

last_date = df['date'].max()
print(f"\nLast date in dataset: {last_date.strftime('%Y-%m-%d')}")

future_months = [last_date + pd.DateOffset(months=i) for i in range(1, 5)]
print(f"Forecasting for: {[d.strftime('%Y-%m') for d in future_months]}")

recent_data = df[df['date'] > (last_date - pd.DateOffset(months=3))].copy()
forecast_base = recent_data.groupby(['store_id', 'category']).agg({
    'price': 'mean', 'discount': 'mean', 'sales': 'mean',
    'sales_lag_1': 'mean', 'sales_lag_2': 'mean', 'sales_lag_4': 'mean',
    'sales_roll_mean_4': 'mean', 'sales_roll_std_4': 'mean'
}).reset_index()

forecasts = []
for future_date in future_months:
    for _, row in forecast_base.iterrows():
        features = pd.DataFrame({
            'year': [future_date.year], 'month': [future_date.month],
            'week': [future_date.isocalendar()[1]], 'dayofweek': [future_date.dayofweek],
            'price': [row['price']], 'discount': [row['discount']],
            'sales_lag_1': [row['sales']], 'sales_lag_2': [row['sales_lag_1']],
            'sales_lag_4': [row['sales_lag_2']], 'sales_roll_mean_4': [row['sales_roll_mean_4']],
            'sales_roll_std_4': [row['sales_roll_std_4']]
        })
        predicted = model.predict(features)[0]
        forecasts.append({'forecast_month': future_date.strftime('%Y-%m'),
                          'store_id': row['store_id'], 'category': row['category'],
                          'predicted_sales': round(max(0, predicted), 2)})

forecast_df = pd.DataFrame(forecasts)
monthly_forecast = forecast_df.groupby('forecast_month').agg({'predicted_sales': 'sum'}).reset_index()
monthly_forecast.columns = ['Month', 'Predicted Total Sales']

print("\n" + "‚îÄ" * 50)
print("üìä MONTHLY FORECAST SUMMARY")
print("‚îÄ" * 50)
print(monthly_forecast.to_string(index=False))

# Plot
fig, ax = plt.subplots(figsize=(12, 5))
historical_monthly = df.groupby(df['date'].dt.to_period('M'))['sales'].sum()
hist_dates = historical_monthly.index.astype(str)
hist_values = historical_monthly.values

ax.plot(hist_dates[-12:], hist_values[-12:], marker='o', color='steelblue', linewidth=2, markersize=6, label='Historical Sales')
forecast_dates = monthly_forecast['Month'].values
forecast_values = monthly_forecast['Predicted Total Sales'].values
ax.plot([hist_dates[-1], forecast_dates[0]], [hist_values[-1], forecast_values[0]], color='coral', linewidth=2, linestyle='--')
ax.plot(forecast_dates, forecast_values, marker='s', color='coral', linewidth=2, markersize=8, label='Forecast')
ax.fill_between(forecast_dates, forecast_values * 0.85, forecast_values * 1.15, color='coral', alpha=0.2, label='Confidence Band (¬±15%)')
ax.set_title('üìà Sales Forecast - Next 4 Months', fontsize=14, fontweight='bold')
ax.set_xlabel('Month')
ax.set_ylabel('Total Sales')
ax.tick_params(axis='x', rotation=45)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n‚úÖ Forecast generation complete!")

## 5.3 Pricing & Discount Insights

In [None]:
# =============================================================================
# PRICING & DISCOUNT INSIGHTS
# =============================================================================

insights = generate_insights(df, trend_summary)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

ax1 = axes[0]
cat_stats = insights['category_stats']
colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(cat_stats)))
bars = ax1.barh(cat_stats.index, cat_stats['total_sales'], color=colors, edgecolor='black')
ax1.set_title('üìä Total Sales by Category', fontsize=12, fontweight='bold')
ax1.set_xlabel('Total Sales')
for i, bar in enumerate(bars):
    ax1.text(bar.get_width() + 50, bar.get_y() + bar.get_height()/2, f'{cat_stats["total_sales"].iloc[i]:,.0f}', va='center', fontsize=9)

ax2 = axes[1]
categories = ['With Discount', 'Without Discount']
values = [insights['avg_sales_with_discount'], insights['avg_sales_without_discount']]
colors = ['coral', 'steelblue']
bars = ax2.bar(categories, values, color=colors, edgecolor='black')
ax2.set_title('üè∑Ô∏è Discount Impact on Average Sales', fontsize=12, fontweight='bold')
ax2.set_ylabel('Average Sales')
for bar, val in zip(bars, values):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5, f'{val:.1f}', ha='center', fontsize=11, fontweight='bold')
uplift = insights['discount_uplift_percent']
ax2.annotate(f'Uplift: {uplift:+.1f}%', xy=(0.5, max(values) * 0.8), fontsize=12, fontweight='bold',
             color='green' if uplift > 0 else 'red', ha='center')

plt.tight_layout()
plt.show()

print("\n‚úì Pricing insights visualization complete")

## 5.4 Business Recommendations

In [None]:
# =============================================================================
# BUSINESS RECOMMENDATIONS
# =============================================================================

print("=" * 70)
print("üìã BUSINESS RECOMMENDATIONS")
print("=" * 70)

for i, rec in enumerate(insights['recommendations'], 1):
    print(f"\n{i}. {rec}")

print("\n" + "=" * 70)
print("üìä KEY METRICS SUMMARY")
print("=" * 70)
print(f"""
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Price-Sales Correlation    : {insights['price_sales_correlation']:>8.4f}                      ‚îÇ
‚îÇ  Discount-Sales Correlation : {insights['discount_sales_correlation']:>8.4f}                      ‚îÇ
‚îÇ  Discount Uplift            : {insights['discount_uplift_percent']:>8.2f}%                     ‚îÇ
‚îÇ  Avg Sales (with discount)  : {insights['avg_sales_with_discount']:>8.2f}                      ‚îÇ
‚îÇ  Avg Sales (no discount)    : {insights['avg_sales_without_discount']:>8.2f}                      ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
""")

## Edge Case Demonstration

This section demonstrates the cold-start handling mechanism.

In [None]:
# =============================================================================
# EDGE CASE DEMONSTRATION - COLD-START HANDLING
# =============================================================================

print("=" * 70)
print("‚ö†Ô∏è  EDGE CASE DEMONSTRATION: COLD-START HANDLING")
print("=" * 70)

print("\nüìù Creating cold-start scenario...")

cold_start_data = pd.DataFrame({
    'date': pd.to_datetime(['2025-12-01', '2025-12-02', '2025-12-15']),
    'store_id': ['S999', 'S999', 'S998'],
    'product_id': ['P9999', 'P9999', 'P9998'],
    'category': ['Electronics', 'Electronics', 'Clothing'],
    'price': [150.0, 150.0, 45.0],
    'discount': [10.0, 15.0, 0.0],
    'sales': [5, 8, 3],
    'year': [2025, 2025, 2025],
    'month': [12, 12, 12],
    'week': [49, 49, 50],
    'dayofweek': [0, 1, 0],
    'sales_lag_1': [0, 5, 0],
    'sales_lag_2': [0, 0, 0],
    'sales_lag_4': [0, 0, 0],
    'sales_roll_mean_4': [5.0, 6.5, 3.0],
    'sales_roll_std_4': [0.0, 2.1, 0.0]
})

df_with_cold_start = pd.concat([df, cold_start_data], ignore_index=True)
print(f"‚úì Added {len(cold_start_data)} cold-start records")
print(f"   New product-store combinations: S999-P9999, S998-P9998")

predictions_df, fallback_df = predict_with_fallback(df_with_cold_start, model, min_records=4)

print("\n" + "‚îÄ" * 70)
print("üÜò GROUPS USING BASELINE FALLBACK (< 4 historical records)")
print("‚îÄ" * 70)

if len(fallback_df) > 0:
    print(fallback_df.to_string(index=False))
    print("\n" + "‚îÄ" * 70)
    print("üìä FALLBACK PREDICTION DETAILS")
    print("‚îÄ" * 70)
    for _, row in fallback_df.iterrows():
        print(f"\n   Store: {row['store_id']}, Product: {row['product_id']}")
        print(f"   Category: {row['category']}")
        print(f"   Historical Records: {row['record_count']}")
        print(f"   ‚ùå ML Model: NOT USED (insufficient data)")
        print(f"   ‚úÖ Baseline Prediction: {row['baseline_prediction']:.2f} units")

print("\n" + "‚îÄ" * 70)
print("üìà PREDICTION METHOD SUMMARY")
print("‚îÄ" * 70)
method_counts = predictions_df['method'].value_counts()
for method, count in method_counts.items():
    pct = count / len(predictions_df) * 100
    print(f"   {method}: {count} groups ({pct:.1f}%)")

fig, ax = plt.subplots(figsize=(8, 5))
methods = predictions_df.groupby('method')['predicted_sales'].mean()
colors = ['steelblue' if m == 'ml_model' else 'coral' for m in methods.index]
bars = ax.bar(methods.index, methods.values, color=colors, edgecolor='black')
ax.set_title('üîÑ Average Predicted Sales by Method', fontsize=12, fontweight='bold')
ax.set_ylabel('Average Predicted Sales')
ax.set_xlabel('Prediction Method')
for bar, val in zip(bars, methods.values):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5, f'{val:.1f}', ha='center', fontsize=11, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\n‚úÖ Edge case demonstration complete!")

---
# 6. Ethical Considerations & Responsible AI

## 6.1 Bias and Fairness Risks

**Potential Biases:**
- **Regional Bias**: Model may perform better for stores with more historical data
- **Category Bias**: Categories with more data may have more accurate predictions
- **Temporal Bias**: Seasonal patterns may not generalize to unusual years

## 6.2 Dataset Limitations

- Synthetic data may not capture all real-world complexities
- Historical patterns may not predict unprecedented events
- External factors (economic conditions, competitors) not included

## 6.3 Privacy Compliance

‚úÖ **This system does NOT use any personal customer data.**

## 6.4 Responsible Use Guidelines

- Use predictions as one input among many for business decisions
- Regularly validate model performance against actual outcomes
- Do not make critical decisions solely based on AI predictions

## 6.5 AI Tools Disclosure

This project was developed using Python, pandas, NumPy, scikit-learn, and matplotlib. GitHub Copilot assisted in code development.

In [None]:
# =============================================================================
# BIAS ANALYSIS
# =============================================================================

print("=" * 70)
print("üîç BIAS ANALYSIS")
print("=" * 70)

print("\nüìä Model Performance by Category:")
print("‚îÄ" * 50)

test_indices = df_model.index[int(len(df_model) * 0.8):]
test_data = df_model.loc[test_indices].copy()
test_data['predicted'] = predictions

category_performance = test_data.groupby('category').apply(
    lambda x: pd.Series({
        'count': len(x),
        'actual_mean': x['sales'].mean(),
        'predicted_mean': x['predicted'].mean(),
        'mae': mean_absolute_error(x['sales'], x['predicted']),
        'mape': np.mean(np.abs((x['sales'] - x['predicted']) / x['sales'].replace(0, 1))) * 100
    })
).round(2)

print(category_performance)

max_mape = category_performance['mape'].max()
min_mape = category_performance['mape'].min()
disparity = max_mape - min_mape

print(f"\n‚öñÔ∏è Fairness Check:")
print(f"   MAPE range across categories: {min_mape:.1f}% - {max_mape:.1f}%")
print(f"   Disparity: {disparity:.1f} percentage points")

if disparity > 20:
    print("   ‚ö†Ô∏è WARNING: Significant performance disparity detected.")
else:
    print("   ‚úÖ Performance is relatively consistent across categories.")

print("\n‚úì Bias analysis complete")

---
# 7. Conclusion & Future Scope

## 7.1 Summary of Findings

This AI MarketPulse system successfully demonstrates:

1. **Data Pipeline**: Robust preprocessing with outlier handling and feature engineering
2. **Trend Analysis**: Accurate identification of rising, falling, and stable trends
3. **Sales Forecasting**: Random Forest model with reasonable accuracy
4. **Business Insights**: Automated generation of actionable recommendations
5. **Cold-Start Handling**: Graceful fallback for new products/stores

In [None]:
# =============================================================================
# FINAL SUMMARY AND RESULTS
# =============================================================================

print("=" * 70)
print("üìä FINAL RESULTS SUMMARY")
print("=" * 70)

print(f"""
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    AI MARKETPULSE RESULTS                          ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  üìà TREND ANALYSIS                                                 ‚îÇ
‚îÇ  ‚Ä¢ Rising Trends:  {insights['trend_distribution']['rising']:>4} product-store combinations            ‚îÇ
‚îÇ  ‚Ä¢ Falling Trends: {insights['trend_distribution']['falling']:>4} product-store combinations            ‚îÇ
‚îÇ                                                                    ‚îÇ
‚îÇ  üîÆ FORECASTING MODEL                                              ‚îÇ
‚îÇ  ‚Ä¢ Algorithm: RandomForestRegressor                                ‚îÇ
‚îÇ  ‚Ä¢ MAE:  {metrics['MAE']:>8.2f}                                             ‚îÇ
‚îÇ  ‚Ä¢ RMSE: {metrics['RMSE']:>8.2f}                                             ‚îÇ
‚îÇ  ‚Ä¢ MAPE: {metrics['MAPE']:>8.2f}%                                            ‚îÇ
‚îÇ                                                                    ‚îÇ
‚îÇ  üí° KEY INSIGHTS                                                   ‚îÇ
‚îÇ  ‚Ä¢ Price-Sales Correlation:    {insights['price_sales_correlation']:>8.4f}                       ‚îÇ
‚îÇ  ‚Ä¢ Discount Uplift:            {insights['discount_uplift_percent']:>8.2f}%                      ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
""")

print("\n" + "‚îÄ" * 70)
print("üìã TOP 5 RISING TRENDS")
print("‚îÄ" * 70)
print(top_rising[['store_id', 'product_id', 'category', 'avg_growth', 'total_sales']].head().to_string(index=False))

print("\n" + "‚îÄ" * 70)
print("üìã TOP 5 FALLING TRENDS")
print("‚îÄ" * 70)
print(top_falling[['store_id', 'product_id', 'category', 'avg_growth', 'total_sales']].head().to_string(index=False))

## 7.2 Known Limitations

| Limitation | Impact | Mitigation |
|------------|--------|------------|
| Synthetic data | May not capture real-world complexity | Integrate real datasets |
| No external factors | Cannot predict market disruptions | Add economic indicators |
| Cold-start accuracy | Baseline predictions less precise | Collaborative filtering |

## 7.3 Future Scope

1. **Deep Learning Models**: LSTM for complex temporal patterns
2. **External Data**: Economic indicators, weather, social media
3. **Real-time Streaming**: Live data processing
4. **Automated Alerting**: Notify stakeholders of trend changes

---

**Thank you for reviewing AI MarketPulse!**

*Module E: AI Applications ‚Äì Individual Open Project*

In [None]:
# =============================================================================
# NOTEBOOK EXECUTION COMPLETE
# =============================================================================

print("=" * 70)
print("üéâ AI MARKETPULSE NOTEBOOK EXECUTION COMPLETE!")
print("=" * 70)
print("""
All sections executed successfully:
  ‚úÖ 1. Problem Definition & Objective
  ‚úÖ 2. Data Understanding & Preparation  
  ‚úÖ 3. Model / System Design
  ‚úÖ 4. Core Implementation
  ‚úÖ 5. Evaluation & Analysis
  ‚úÖ 6. Ethical Considerations & Responsible AI
  ‚úÖ 7. Conclusion & Future Scope

Thank you for using AI MarketPulse!
""")
print("=" * 70)