# Retail Forecaster

## A Comprehensive System for Retail Sales Forecasting

**Date:** May 1, 2025  
**Last Updated:** May 1, 2025

---

## Overview

This notebook provides a detailed explanation of our retail sales prediction system. The application combines weather data and social media sentiment analysis with historical sales figures to create accurate sales forecasts for retailers in India. The entire system is packaged as a Streamlit web application with a user-friendly interface.

## 1. System Architecture

The application follows a modular architecture with several key components:

1. **Web Interface** (Streamlit): Provides user authentication, data upload, visualization, and prediction interfaces
2. **Data Processing Pipeline**: Handles data cleaning, feature engineering, and data integration
3. **Machine Learning Models**: Uses ensemble techniques to generate accurate predictions
4. **Visualization Engine**: Creates intuitive visualizations for data exploration and reporting
5. **Database Backend**: SQLite database for storing user data, models, and predictions

```
                                  +----------------+
                                  |                |
                                  |   Streamlit    |
                                  |   Web Interface|
                                  |                |
                                  +-------+--------+
                                          |
                                          v
                      +------------------+-------------------+
                      |                                      |
+ - - - - - - - - - - +     SQLite Database Backend         + - - - - - - - - - - +
|                     |                                      |                     |
|                     +--+------------------+---------------+                     |
|                        |                  |                                      |
|                        v                  v                                      v
+--------------------+   |  +-------------+   +--------------+    +----------------+
|                    |   |  |             |   |              |    |                |
| Data Processing    |<--+  | ML Model    |<--+ Visualization|    | External APIs  |
| Pipeline           |----->| Training    |--->| Engine      |    | (Weather, etc.)|
|                    |      |             |   |              |    |                |
+--------------------+      +-------------+   +--------------+    +----------------+
```

## 2. Data Sources

The system utilizes three primary data sources to build its predictive models:

### 2.1 Sales Data

Users upload historical sales data in CSV format with the following key fields:
- Date: When the sale occurred (YYYY-MM-DD format)
- Product_ID: Unique identifier for the product
- Category: Product category (e.g., Electronics, Clothing, Food)
- Quantity: Number of units sold
- Price: Per-unit price in Indian Rupees (₹)
- Total_Sales: Total sales amount (Quantity × Price) in ₹

### 2.2 Weather Data

Weather data is sourced from the OpenWeatherMap API with fallback to synthetic data when API access is unavailable:
- Date: Date of weather measurement
- Location: Store location (city name)
- Temperature: Average daily temperature in °C
- Weather_Condition: Description (Sunny, Rainy, Cloudy, etc.)

### 2.3 Social Media Sentiment Data

The system analyzes social media sentiment related to the brand or product categories:
- Date: Date of sentiment analysis
- Keywords: Words used for sentiment analysis (brand names, product types)
- Sentiment_Score: Numerical score from -1.0 (negative) to 1.0 (positive)

## 3. Data Processing Pipeline

Our advanced data processing pipeline transforms raw inputs into prediction-ready datasets:

In [None]:
# Conceptual representation of data processing flow
# This is not functional code, but demonstrates the pipeline structure

def process_data_pipeline(sales_df, weather_df, sentiment_df):
    # Step 1: Clean and preprocess individual datasets
    sales_processed = process_sales_data(sales_df)
    
    # Step 2: Combine datasets
    combined_data = combine_datasets(sales_processed, weather_df, sentiment_df)
    
    # Step 3: Feature engineering
    engineered_data = preprocess_data(combined_data)
    
    return engineered_data

### 3.1 Feature Engineering

The system implements extensive feature engineering to capture patterns affecting sales:

#### 3.1.1 Time-Based Features

- Year, Month, Day extraction
- Day of week (0-6, where 0 is Monday)
- Quarter (1-4)
- Day of year (1-366)
- Week of year (1-53)
- Weekend indicator (1 for weekend, 0 for weekday)
- Month start/end indicators
- Season (1:Winter, 2:Spring, 3:Summer, 4:Fall)

#### 3.1.2 Indian Context-Specific Features

The system includes features specifically relevant to Indian retail patterns:

- Indian festival indicators (Diwali, Holi, Navratri)
- Financial year-end indicator (March in India)

#### 3.1.3 Weather Features

- Temperature bins (Freezing, Cold, Mild, Warm, Hot)
- Weather-day interactions (e.g., Rainy_Weekend)
- Weather-season interactions
- Rain and snow indicators

#### 3.1.4 Sentiment Features

- Sentiment bins (Very Negative to Very Positive)
- Sentiment lag features (1-day and 7-day)

#### 3.1.5 Sales Pattern Features

- Sales lag features (1-day, 7-day, 30-day)
- Moving averages (7-day, 30-day)
- Expanding mean (cumulative average)
- Sales volatility (7-day standard deviation)

## 4. Machine Learning Models

The application uses an ensemble approach, combining multiple models to improve prediction accuracy:

In [None]:
# Conceptual representation of ensemble model implementation
# Not functional code, but demonstrates the ensemble architecture

def create_ensemble_model(X_train, y_train):
    # Base models
    model1 = RandomForestRegressor(n_estimators=100, n_jobs=-1,
                                   max_depth=20, min_samples_split=5,
                                   min_samples_leaf=2)
    
    model2 = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=7,
                         min_child_weight=1, subsample=0.8, colsample_bytree=0.8,
                         gamma=0)
    
    model3 = CatBoostRegressor(iterations=100, learning_rate=0.1, depth=6,
                              loss_function='RMSE', eval_metric='RMSE',
                              random_strength=0.1)
    
    # Train base models
    model1.fit(X_train, y_train)
    model2.fit(X_train, y_train)
    model3.fit(X_train, y_train)
    
    # Create ensemble model
    ensemble = VotingRegressor([
        ('rf', model1),
        ('xgb', model2),
        ('catboost', model3)
    ])
    
    # Train ensemble model
    ensemble.fit(X_train, y_train)
    
    return ensemble

### 4.1 Model Performance Metrics

The system evaluates model performance using multiple metrics to provide a comprehensive assessment:

- **R² Score**: Proportion of variance explained (0-1, higher is better)
- **Mean Absolute Error (MAE)**: Average absolute difference between predicted and actual sales in ₹
- **Mean Squared Error (MSE)**: Average squared error, penalizing larger mistakes more heavily
- **Root Mean Squared Error (RMSE)**: Square root of MSE, in the same units as the target variable (₹)
- **Mean Absolute Percentage Error (MAPE)**: Average percentage difference between predicted and actual values

### 4.2 Model Selection Criteria

While the application uses an ensemble by default for optimal performance, different models are better suited for different scenarios:

- **Small datasets** (<1000 rows): Linear Regression or SVR
- **Medium datasets**: Random Forest
- **Large datasets**: XGBoost
- **Many categorical features**: CatBoost
- **Maximum accuracy needed**: Ensemble (combines all models)

## 5. Visualization System

The application features intuitive visualizations designed to be easily understandable ("even a small kid can understand"):

In [None]:
# Example visualization function
def plot_sales_forecast(historical_data, predicted_data):
    # Set up figure and style
    plt.figure(figsize=(12, 8))
    plt.style.use('seaborn-whitegrid')
    
    # Plot settings for readability
    plt.rcParams['font.size'] = 12
    plt.rcParams['axes.labelsize'] = 14
    plt.rcParams['axes.titlesize'] = 16
    plt.rcParams['xtick.labelsize'] = 12
    plt.rcParams['ytick.labelsize'] = 12

    # Custom colors for better differentiation
    historical_color = '#1f77b4'  # Blue
    prediction_color = '#ff7f0e'  # Orange
    confidence_color = '#ffbf80'  # Light orange
    
    # Format currency for Indian Rupees
    def rupee_format(x, pos):
        return f'₹{x:,.0f}'
    
    formatter = FuncFormatter(rupee_format)
    
    # Add plot elements with clear labels and colors
    # Add explanation text for non-technical users
    
    return fig

### 5.1 Key Visualizations

The system provides several key visualizations:

1. **Sales Trends**: Time-series charts showing sales patterns over time by category
2. **Weather Impact**: Visualizations showing how weather conditions affect sales
3. **Sentiment Impact**: Charts displaying the relationship between social media sentiment and sales
4. **Feature Importance**: Bar charts showing which factors most strongly influence sales predictions
5. **Sales Forecast**: Future sales predictions with confidence intervals and clear markings for weekends and special events

Each visualization includes:
- Clear titles and subtitles
- Simplified language for non-technical users
- Consistent color schemes
- Indian Rupee (₹) formatting
- Explanatory text

## 6. Database System

The application uses a local SQLite database with SQLAlchemy ORM for data persistence:

In [None]:
# Conceptual database schema (not functional code)

class User(Base):
    __tablename__ = 'users'
    id = Column(Integer, primary_key=True)
    email = Column(String, unique=True, nullable=False)
    password = Column(String, nullable=False)
    display_name = Column(String)

class SalesData(Base):
    __tablename__ = 'sales'
    id = Column(Integer, primary_key=True)
    date = Column(Date, nullable=False)
    user_id = Column(Integer, ForeignKey('users.id'))
    product_id = Column(String)
    category = Column(String)
    quantity = Column(Integer)
    price = Column(Float)

class WeatherData(Base):
    __tablename__ = 'weather'
    id = Column(Integer, primary_key=True)
    date = Column(Date, nullable=False)
    location = Column(String)
    user_id = Column(Integer, ForeignKey('users.id'))
    temperature = Column(Float)
    condition = Column(String)

class SentimentData(Base):
    __tablename__ = 'sentiment'
    id = Column(Integer, primary_key=True)
    date = Column(Date, nullable=False)
    user_id = Column(Integer, ForeignKey('users.id'))
    keywords = Column(String)
    sentiment_score = Column(Float)

class Model(Base):
    __tablename__ = 'models'
    id = Column(Integer, primary_key=True)
    name = Column(String, nullable=False)
    user_id = Column(Integer, ForeignKey('users.id'))
    type = Column(String, nullable=False)
    metrics = Column(String)  # Serialized metrics
    model_data = Column(String)  # Serialized model data
    created_at = Column(Date, default=datetime.now)

## 7. User Interface Design

The Streamlit web application features a clean, intuitive interface organized into several key pages:

### 7.1 Login/Registration Page
- Simple email/password authentication
- Registration for new users

### 7.2 Home Dashboard
- Overview of uploaded data
- Quick access to key functions
- System status indicators

### 7.3 Data Upload Page
- File upload interface for sales, weather, and sentiment data
- Data format guidelines
- Data preview and validation

### 7.4 Data Visualization Page
- Interactive charts and graphs
- Filter controls for time periods and categories
- Insight generation

### 7.5 Sales Prediction Page
- Prediction parameter controls
- Forecast results with visualizations
- Confidence interval display
- Model explanation features

### 7.6 My Models Page
- Access to saved models
- Model performance metrics
- Model comparison tools

### 7.7 About Page
- System information
- Documentation
- Contact details

## 8. Improving Model Accuracy

To achieve the highest possible model accuracy, the system implements several advanced techniques:

### 8.1 Data Quality Improvements
- Robust date parsing with multi-stage fallbacks
- Outlier detection and handling using IQR method
- Missing value imputation strategies

### 8.2 Feature Engineering
- Polynomial features for capturing non-linear relationships
- Interaction features between weather and time variables
- Lag features to capture time-series patterns

### 8.3 Model Optimization
- Ensemble modeling combining multiple algorithms
- Hyperparameter optimization
- Cross-validation for robust evaluation

### 8.4 User Guidance
- Clear instructions for data format and quality
- Feedback on data quality issues
- Suggestions for improving prediction accuracy

## 9. Deployment Architecture

The application is deployed as a Streamlit web application with the following components:

```
├── app.py                 # Main Streamlit application entry point
├── utils/                 # Utility modules
│   ├── __init__.py
│   ├── data_processing.py # Data processing functions
│   ├── database.py        # Database functions and models
│   ├── model.py           # Machine learning models
│   ├── sentiment_analysis.py # Sentiment analysis functions
│   ├── visualization.py   # Visualization functions
│   ├── weather_api.py     # Weather API interface
│   └── data/              # Data generators
│       └── data_generator.py
├── .streamlit/            # Streamlit configuration
│   └── config.toml        # Server configuration
└── sales_prediction.db    # SQLite database file
```

The application uses the following technology stack:
- **Python 3.11**: Core programming language
- **Streamlit**: Web framework
- **Pandas/NumPy**: Data processing
- **Scikit-learn/XGBoost/CatBoost**: Machine learning
- **Matplotlib/Seaborn**: Visualization
- **SQLAlchemy**: Database ORM
- **TextBlob**: Sentiment analysis
- **Requests**: HTTP API access

## 10. Future Enhancements

The system has several planned future enhancements:

1. **Advanced Time Series Models**: Integration of ARIMA, Prophet, and deep learning models
2. **Competitor Analysis**: Incorporating competitor data into predictions
3. **Economic Indicators**: Adding macroeconomic data like inflation and consumer confidence
4. **Multi-Store Support**: Enhanced features for businesses with multiple locations
5. **Real-time Updating**: Continuous model updating as new data becomes available
6. **Explainable AI**: More detailed model explanations for business users
7. **Mobile App Integration**: Companion mobile application for notifications and quick insights
8. **Supply Chain Integration**: Connecting predictions to inventory management systems

## Conclusion

This sales prediction system provides retailers with a powerful tool for forecasting sales based on multiple data sources. By combining historical sales data with weather information and social media sentiment, the system generates highly accurate predictions that can help businesses optimize inventory, staffing, and marketing efforts.

The streamlined user interface, intuitive visualizations, and automated ensemble modeling make the system accessible to users without technical expertise, while the advanced feature engineering and model optimization techniques ensure the highest possible prediction accuracy.