# US Superstore Sales Analysis & Forecasting ~ 2019

### Executive Summary

This analysis evaluates one year of U.S. retail sales data to uncover performance drivers, customer behavior patterns, and product-level trends, and to build a predictive model for monthly product sales. The project integrates descriptive analytics, feature engineering, and machine learning to support data-driven inventory planning and revenue optimization.

Key findings show strong seasonal concentration in Q4, heavy revenue reliance on a small number of high-value products, and predictable demand patterns across most product categories. A Random Forest forecasting model outperformed linear approaches, achieving lower error and higher explanatory power, though it still systematically underpredicts high-volume sales.

These insights provide actionable guidance for inventory management, demand planning, and strategic growth, while also identifying clear opportunities for model and data improvements.


### Business Objectives

1. Understand overall sales performance and seasonal trends.
2. Identify top-performing products, locations, and customer behavior patterns.
3. Quantify revenue concentration risk across products and regions.
4. Build a forecasting model to predict monthly product-level sales.
5. Translate analytical results into actionable business recommendations.

### Data Overview

The dataset contains `185916` valid transactional retail sales data for the full year of 2019, with the following core fields:

- Order ID
- Product
- Quantity Ordered
- Price Each
- Order Date
- Purchase Address

The data was cleaned to remove missing values, incorrect timestamps, and malformed records. Additional derived fields such as sales value, city, state, and time-based features were created to support analysis and modeling.

### Key Insights

### Revenue & Volume Performance
Total annual revenue reached approximately \\$34.5M across 178,000+ orders and over 209,000 units sold, indicating a high-volume, fast-moving retail operation.

### Seasonal Trends
Sales increased steadily throughout the year, with a sharp surge in Q4. December alone generated the highest monthly revenue, confirming that holiday demand is the primary driver of annual sales.

### Product Performance
Revenue is heavily concentrated among a small group of high-value electronics. A few top products account for a disproportionately large share of total sales, creating both strong growth drivers and concentration risk.

### Geographic Performance
California dominates regional performance, with cities like San Francisco and Los Angeles contributing a significant share of total revenue. This indicates strong market penetration on the West Coast compared to other regions.

### Customer Purchase Behavior
Customer activity peaks during late morning and early evening hours, with demand remaining stable across days of the week, suggesting consistent purchasing behavior rather than strong weekday/weekend effects.

### Modeling Approach

The forecasting task was framed as a product-level, monthly time-series prediction problem. Sales were aggregated by product and month, and feature engineering was applied, including:

- Time features (month number, quarter)
- Lag features (previous 1–2 months’ sales)
- Rolling averages (3-month historical mean)

Two models were trained and evaluated:

- Linear Regression (baseline)
- Random Forest Regressor (nonlinear ensemble model)

A time-based train-test split was used to preserve temporal order and avoid data leakage.

### Model Performance Summary

The Random Forest model outperformed Linear Regression across all evaluation metrics:

- Lower MAE, indicating smaller average prediction errors.
- Lower RMSE, indicating fewer extreme forecasting mistakes.
- Higher R², indicating stronger explanatory power.

However, diagnostic analysis revealed that while Random Forest performs well for low-to-mid sales volumes, it systematically underpredicts high sales values and exhibits increasing uncertainty at higher volumes.

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Model</th>
      <th>MAE</th>
      <th>RMSE</th>
      <th>R²</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Linear Regression</td>
      <td>99272.353488</td>
      <td>137304.756770</td>
      <td>0.741726</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Random Forest</td>
      <td>58033.208879</td>
      <td>102825.751292</td>
      <td>0.855152</td>
    </tr>
  </tbody>
</table>
</div>

### Business Implications

1. Inventory Risk  
   The model’s tendency to underpredict high-demand periods poses a risk of understocking during peak seasons, particularly in Q4.

2. Revenue Concentration Risk  
   Heavy reliance on a small number of high-value products increases vulnerability to supply disruptions or demand shocks.

3. Regional Strategy  
   The dominance of California suggests opportunities to replicate high-performing strategies in underpenetrated states and cities.

4. Demand Planning  
   The predictable time-of-day purchasing behavior can be leveraged to optimize staffing, logistics, and promotional timing.

### Model Limitations & Next Steps

Despite strong overall performance, the forecasting models exhibit several important limitations.

First, the dataset is structurally small from a time-series perspective, containing only 12 monthly observations per product. This severely limits the model’s ability to learn long-term seasonal patterns, trends, or cyclical behavior, and increases the risk of overfitting.

Second, the Random Forest model shows a systematic underprediction bias, particularly at higher sales volumes. This is evident in the residual plots and error distribution, where residuals are consistently positive and grow larger as actual sales increase. This means the model is conservative and tends to underestimate peak demand, which could lead to understocking or missed revenue opportunities.

Third, the model does not incorporate external drivers of demand, such as promotions, holidays, marketing campaigns, pricing changes, or macroeconomic conditions. Without these contextual features, the model is limited to learning only from historical sales patterns, which constrains predictive accuracy.

Finally, the evaluation is based on a single time-based split, rather than rolling cross-validation or backtesting, which reduces confidence in how well the model would generalize to unseen future periods.

### Next Steps

To improve model reliability and business value, the following steps are recommended:

1. **Expand the time horizon**

    Acquire multi-year sales data to allow the model to learn robust seasonal patterns, trends, and long-term behavior.

2. **Incorporate external features**

    Add variables such as promotions, holidays, pricing changes, marketing spend, and regional events to better explain demand fluctuations.

3. **Use specialized time-series models**
    
    Explore models designed for temporal forecasting, such as SARIMA, Prophet, XGBoost with lag features, or LSTM-based approaches for sequence learning.

4. **Apply rolling backtesting**

    Implement walk-forward validation to better evaluate how the model performs across multiple forecasting windows.

5. **Segment forecasting strategy**
    
    Develop separate models for high-volume versus low-volume products to reduce bias and improve accuracy where it matters most.

6. **Optimize for business risk**
    
    Shift from pure accuracy metrics to cost-sensitive evaluation, prioritizing errors that cause lost sales or operational disruptions.


### Final Recommendation

The Random Forest model should be adopted as the baseline forecasting tool for monthly product-level demand, with the understanding that its predictions are conservative and less reliable for high-volume products. Decision-makers should apply safety buffers when planning inventory for peak seasons.

To unlock higher business value, the organization should prioritize expanding historical data coverage, integrating external demand drivers, and transitioning to specialized time-series forecasting models. These improvements would significantly enhance predictive accuracy, reduce operational risk, and support more confident strategic planning.