## 📘 Project Phase: Final Model Expansion

In this notebook, I will build on the previously improved XGBoost model (which achieved an RMSLE of 0.2381) by integrating additional data sources provided in the competition:

- `stores.csv` – store-level metadata (e.g. city, type, cluster)
- `oil.csv` – daily oil prices (macroeconomic signal)
- `transactions.csv` – store-level daily customer volume

### 🛠️ Planned Steps:
1. **Merge external datasets** into the training and test sets.
2. **Engineer new features** from these sources (e.g. oil trends, transaction lags).
3. **Refine the model** by re-tuning hyperparameters using TimeSeriesSplit CV.
4. **Evaluate** the impact of the new data on forecast accuracy (RMSLE).
5. **Re-train on full data** and prepare final Kaggle submission.

The goal is to push the model beyond current performance by capturing more contextual and behavioral signals.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBRegressor

In [4]:
df = pd.read_csv("train.csv", parse_dates=['date'])
df.set_index('date', inplace=True)

holidays = pd.read_csv('holidays_events.csv', parse_dates=['date'])
holidays.set_index('date', inplace=True)

oil_price = pd.read_csv('oil.csv', parse_dates=['date'])
oil_price.set_index('date', inplace=True)

stores = pd.read_csv('stores.csv')

transactions = pd.read_csv('transactions.csv', parse_dates=['date'])
transactions.set_index('date', inplace=True)