In [None]:
# FreshRetailNet Multi-Region Demand Forecasting  
### Data Science Portfolio Project – Summary Overview

This project builds a complete **end-to-end Machine learning forecasting pipeline** 
for a large multi-region fresh retail dataset called FreshRetailNet-50K using
Google Cloud Storage, BigQuery, Python, and Power BI.

**Objective:**  
Predict daily and hourly product demand at the `(store_id, product_id, date, and or hourly)` 
level to reduce waste, predict staffing needs, predict number of linehaul drivers or linehaul needs, 
and prevent stockouts, and improve inventory decisions.

**Technologies:**  
Hugging Face, Jupyter, Python, Google Cloud Storage, Google BigQuery (SQL), Vertex AI, Pandas/matplotlib/Scikit-learn, Power BI.


In [None]:
## 1. Business Problem

Fresh retailers struggle with overstock (waste) and understock (lost sales). 
They also struggle with accurate labor planning for both warehouses, store locations, and linehauls/trailers
Accurate hourly forecasts help optimize:

- Ordering and replenishment  
- Labor planning  
- Stockout prevention  
- Reduction of perishable waste
- Tailer planning

**Key Question:**
“How many units will each product sell in each store for each day?”
“How many units will each product sell in each store for each hour of the day?”



In [None]:
## 2. Data Pipeline (High-Level)

1. **Raw Data (GCS):** Parquet file containing multi-region retail sales importd into GCS from Hugging Face.  
2. **BigQuery:**
   - Load raw parquet into staging table  
   - Create a deduplicated, partitioned (by date) and clustered (store, product) table  
   - Join with holiday table to include the holiday feature as a possibly useful feature in the future.  
   - Produce final ML table: `FreshRetailData_for_machine_learning_multi_region`
   - Produce sampled version of final ML table that will be used for analysis and model building: `fresh_retail_sampled`
3. **Jupyter Modeling:**
   - Load data via BigQuery client
   - Remove outliers using advanced triple exponential smoothing algorithms
   - Clean and engineer features using triple exponential smoothing algorithms, regression, and moving average
   - Prepare and transform data into a structured format optimized for machine learning modeling 
   - Train and compare forecasting models
4. **Hour of day curve**
    - the data exploration, feature engineering, and model development is conducted at the daily level
    - Therefore an hourly level of the forecast needs to be created
    - Develop a hour of day curve based by dividing the hourly volume by the summed hourly volume for each day.
    - Do this for the most recent 3 weeks of actuals and then calculate a weighted average of the past 3 weeks
    - Apply the hourly curve to the daily forecast to get the more granular hourly forecast.
5. **Power BI Dashboard:**
   - Visualize forecasts vs. actuals at both the daily and hourly level  
   - Provide store/product filtering



In [None]:
## 3. Dataset & Feature Engineering

**Key fields used:**
- `store_id`, `product_id`, `date_feature`, `hour`, `hours_sale`

**Other important fields:**
- Time features: week_sat_fri, week_day_name, date_feature, hour, Weekend Classifier  
- Lag features: 1-day and 7-day lags  
- Rolling means: 7-day and 14-day SMAs     
# The lagged forecasted features created by the triple exponential smoothing algorithm 
#are used to provide additional information and patterns for the machine learning model.

These features capture **trend**, **seasonality**, **level**



In [None]:
## 4. Modeling Summary

### Classical Time-Series Analysis
- Triple Exponential Smoothing (SES) was used for outlier removal and feature engineering.
- Many features were engineered from the SES in this analysis

#Several features were built and compared:
- Linear regression and moving average algorithms were run at the 
product by store id level to create features

# Engineered features
       'week_day_name', 'Forecast', 'Level',
       'Trend', 'Season', 'date_rank', 'SMA_7' (moving average), 'Regression',
       'Average Seasonality', 'Forecast_7day_lag', 'Level_7day_lag',
       'Trend_7day_lag', 'Season_7day_lag', 'Regression_7day_lag',
       'SMA_7_7day_lag', 'Weekend Classifier'


### Machine Learning
Multiple machine learning models were explored and tested by the data scientist
such as Decision Tree, Random Forest, Gradient Boosting and more
during the develpment of this model, but only one 
was selected to be used for this production model and is available for view 
within the jupyter notebook of the production model, the Neural Network.

- Neural Network (MLPRegressor with tuned parameters)

### Evaluation
Train/test split respecting time order.  
Metrics used: **RMSE**, **MAE**, BIAS.

**Top two performing models:**
- **Neural Network (MLP)** (only one shown) 
- **Gradient Boosting (not included in this production model)

Both consistently outperformed baselines and classical models.


In [None]:
## 5. Key Results
## store by product selection: store_id = 7 and product_id = 4
# This combination is selected as an example here due to its past data showing clear 
#patterns of seasonality, Trend, as well as changes in level
# More details are available in the notebook called 
# Retail_Sales_Production_Model
| Model                  | RMSE (Test) | MAE (Test) |
|-----------------------|-------------|-----------| 
| Neural Network (MLP)  | 25%        | 17%    |

**Interpretation:**  
Tree-based and neural methods capture nonlinear relationships between 
the engineered features than classical time-series models.
