# 02 – Model Baseline (Univariate)

This notebook creates a **basic time-series forecasting baseline** using the **`last_updated`** timestamp to preserve time order.  
We predict a target (e.g., `temperature_celsius`) from its **lag-1 value** (yesterday’s value), building:
- **Naive (persistence)** model
- *(Optional)* **Linear Regression** baseline

We evaluate with **RMSE**, **MAE**, **MAPE**, and **R²**.  
This sets a **performance floor** for later multivariate and advanced models and meets the technical assessment requirements.


In [3]:
# Load the basics and set the path

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
import sys; sys.path.append("..")
from src.data import ASSETS_DIR

CLEAN_PATH_CSV = ASSETS_DIR / "clean_weather.csv"
TIME_COL = "last_updated"         
TARGET = "temperature_celsius"     


In [4]:
# Load & time-index by lastupdated

df = pd.read_csv(CLEAN_PATH_CSV)

if TIME_COL not in df.columns:
    raise ValueError(f"Expected a '{TIME_COL}' column. Columns: {df.columns.tolist()}")

if TARGET not in df.columns:
    raise ValueError(f"Target '{TARGET}' not found. Columns: {df.columns.tolist()}")

df[TIME_COL] = pd.to_datetime(df[TIME_COL], errors="coerce")
df = df.sort_values(TIME_COL).set_index(TIME_COL)

df.head(3)


Unnamed: 0_level_0,location_name,country,latitude,longitude,temperature_celsius,feels_like_celsius,humidity,pressure_mb,wind_kph,precip_mm,cloud,uv_index,year,month,dayofyear,dow,sin_doy,cos_doy
last_updated,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2024-05-16 09:45:00+00:00,London,United Kingdom,51.52,-0.11,14.0,14.5,88,1005.0,4.0,0.025,50,3.0,2024,5,137,3,0.706727,-0.707487
2024-05-16 15:15:00+00:00,London,United Kingdom,51.52,-0.11,15.0,15.1,77,1005.0,11.2,0.01,50,3.0,2024,5,137,3,0.706727,-0.707487
2024-05-16 17:45:00+00:00,Tokyo,Japan,35.69,139.69,24.0,25.3,47,1001.0,33.1,0.0,25,2.5,2024,5,137,3,0.706727,-0.707487


In [10]:
# Create an univariate lag feature

df["lag1"] = df[TARGET].shift(1)
df = df.dropna(subset=["lag1", TARGET])
len(df), df[[TARGET, "lag1"]].head(3)

(908,
                            temperature_celsius  lag1
 last_updated                                        
 2024-05-16 15:15:00+00:00                 15.0  14.0
 2024-05-16 17:45:00+00:00                 24.0  15.0
 2024-05-16 23:00:00+00:00                 18.3  24.0)

In [11]:
# Train/Test split by time (no shuffle)

split = int(len(df) * 0.8)
train = df.iloc[:split]
test  = df.iloc[split:]

X_train, y_train = train[["lag1"]], train[TARGET]
X_test,  y_test  = test[["lag1"]],  test[TARGET]

(len(train), len(test)), (train.index.min(), train.index.max()), (test.index.min(), test.index.max())

((726, 182),
 (Timestamp('2024-05-16 15:15:00+0000', tz='UTC'),
  Timestamp('2025-05-15 18:00:00+0000', tz='UTC')),
 (Timestamp('2025-05-16 10:00:00+0000', tz='UTC'),
  Timestamp('2025-08-14 17:15:00+0000', tz='UTC')))

In [None]:
# Naive: y_hat_t = y_{t-1}
y_pred_naive = X_test["lag1"].values
rmse_naive = np.sqrt(mean_squared_error(y_test, y_pred_naive))
mae_naive  = mean_absolute_error(y_test, y_pred_naive)
mape_naive = (np.abs((y_test.values - y_pred_naive) / np.clip(np.abs(y_test.values), 1e-8, None)).mean()) * 100
r2_naive   = r2_score(y_test, y_pred_naive)

# Linear regression on lag-1
lr = LinearRegression().fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
mae_lr  = mean_absolute_error(y_test, y_pred_lr)
mape_lr = (np.abs((y_test.values - y_pred_lr) / np.clip(np.abs(y_test.values), 1e-8, None)).mean()) * 100
r2_lr   = r2_score(y_test, y_pred_lr)

pd.DataFrame([
    {"model":"Naive(lag-1)", "RMSE":rmse_naive, "MAE":mae_naive, "MAPE":mape_naive, "R2":r2_naive},
    {"model":"LinearRegression(lag-1)", "RMSE":rmse_lr, "MAE":mae_lr, "MAPE":mape_lr, "R2":r2_lr},
])

Unnamed: 0,model,RMSE,MAE,MAPE,R2
0,Naive(lag-1),10.537776,9.582418,44.218608,-1.85094
1,LinearRegression(lag-1),8.978484,7.656789,32.31766,-1.069646


## 📊 Baseline Model Interpretation

- Both **Naive(lag-1)** and **Linear(lag-1)** show high error and **negative R²**, meaning they underperform a simple mean predictor.
- Linear improves over Naive but still lacks predictive power with just one lag.
- This is expected for temperature: patterns depend on **seasonality** and **other weather features**.

**Why this is fine:** This notebook sets a **performance floor** and meets the assessment:
- Uses `last_updated` for time-series analysis
- Builds a basic forecasting baseline
- Evaluates with multiple metrics

**Next:** Add more lags (e.g., 2, 7, 14), seasonal signals (`sin_doy`, `cos_doy`), and weather features (humidity, wind, pressure) in multivariate baselines.
