# Walmart Weekly Sales Forecasting  
## Notebook 3: Model Evaluation and Final Results

**Objective:** Validate the best-performing model using cross-validation and hyperparameter tuning, then summarize findings and business recommendations.


In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import KFold, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

full_df = pd.read_csv("/content/sample_data/walmart_prepared.csv")


### 7.1 Model Evaluation

**Model:** RandomForestRegressor (100 trees)  
**Target:** Weekly_Sales  

**Validation metrics:**
- RMSE: 5063.05627104496
- R²: 0.9499675485087723

**Interpretation:**
- RMSE measures the average prediction error in the same units as sales.
A lower RMSE means the model is predicting weekly sales more accurately in dollar terms.

- R² shows what proportion of the variability in weekly sales is explained by the model. Values closer to 1 indicate better explanatory power.


These results will serve as a **baseline**.

#Summary

Based on the initial EDA and baseline modeling:

- Walmart weekly sales show **strong seasonality**, with clear spikes around major holidays.
- **Holiday weeks** and **promotional markdowns** are associated with higher sales.
- **Store Type** and **Store Size** significantly influence the level of weekly sales.

- Economic indicators (CPI, Fuel_Price, Unemployment) show weaker direct correlation with sales, but may still be useful in combination with other features.

- The baseline Random Forest model (RMSE = _X_, R² = _Y_) captures a substantial portion of the variability in weekly sales and provides a solid starting point for further improvement.

These insights support the original problem statement:  
**Historical patterns, store characteristics, holidays, and promotions can be used to forecast sales and guide inventory and promotional strategy.**


### Cross-Validation and Grid Search (Random Forest)

We tune the Random Forest model and evaluate it using cross-validation to improve generalization and reduce overfitting risk.


In [None]:
# Define features/target consistent with Notebook 2
feature_cols = [
    "Store","Dept","IsHoliday","Temperature","Fuel_Price","CPI","Unemployment","Size",
    "Year","Month","WeekOfYear"
]
target_col = "Weekly_Sales"

# Ensure engineered columns exist
missing = [c for c in feature_cols + [target_col] if c not in full_df.columns]
if missing:
    raise ValueError(f"Missing required columns: {missing}")

X = full_df[feature_cols]
y = full_df[target_col]

param_grid = {
    "n_estimators": [100, 200],
    "max_depth": [None, 20, 40],
    "min_samples_split": [2, 5],
    "min_samples_leaf": [1, 2]
}

rf = RandomForestRegressor(random_state=42, n_jobs=-1)

cv = KFold(n_splits=5, shuffle=True, random_state=42)
grid = GridSearchCV(
    rf,
    param_grid=param_grid,
    scoring="neg_root_mean_squared_error",
    cv=cv,
    n_jobs=-1
)
grid.fit(X, y)

best_model = grid.best_estimator_
best_rmse = -grid.best_score_

print("Best Params:", grid.best_params_)
print("CV RMSE:", round(best_rmse, 2))


### Final Recommendations (Nontechnical)

- Use the best-performing model to forecast weekly sales and support inventory planning.  
- Prioritize holiday periods and promotional windows due to strong seasonal effects.  
- Consider building department-level models for improved accuracy and targeted decisions.
