## Baseline Check: Naive Approach

I need to know what the "lazy" prediction scores.

**The Logic:** Next year's yield is probably just the same as this year's yield.
If my complex models can't beat this, they are useless.

* **Crop:** Rice
* **Test set:** 2019 onwards
* **Model:** `Prediction = avg_yield_rice_1y`
    
Checking RMSE, MAE, R2, plus percentage errors (MAPE/RMSPE) to see how far off we are in relative terms.

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# grabbing the data
df = pd.read_parquet('Parquet/XY_v3.parquet')

# just focusing on rice for now
CROP = 'rice'
TARGET = f'Y_{CROP}'
NAIVE_FEAT = f'avg_yield_{CROP}_1y' # this is just yield(t-1)

# quick look to make sure cols exist
print(f"Target: {TARGET}")
print(f"Using feature: {NAIVE_FEAT}")

Target: Y_rice
Using feature: avg_yield_rice_1y


In [2]:
# drop rows where we don't even have a target (can't validite those)
df_clean = df.dropna(subset=[TARGET]).copy()

# splitting off the test set (2019+)
# not doing a train split because naive models don't need training
test_mask = df_clean['year'] >= 2019
test_df = df_clean[test_mask].copy()

print(f"Test rows: {len(test_df)}")

Test rows: 570


### Calculating Errors
Need to handle NaNs in the lag column. If `avg_yield_rice_1y` is missing (maybe new area?), I can't make a naive prediction, so I'll drop those rows for the metric calc.

In [3]:
# set actuals and predictions
y_true = test_df[TARGET]
y_pred = test_df[NAIVE_FEAT]

# filter out missing predictions so sklearn doesn't crash
mask = ~y_pred.isna() & ~y_true.isna()

y_true_clean = y_true[mask]
y_pred_clean = y_pred[mask]

print(f"Evaluating on {sum(mask)} valid samples...")

# --- Metrics ---

# 1. RMSE
rmse = np.sqrt(mean_squared_error(y_true_clean, y_pred_clean))

# 2. MAE
mae = mean_absolute_error(y_true_clean, y_pred_clean)

# 3. R2
r2 = r2_score(y_true_clean, y_pred_clean)

# 4. MAPE (Mean Absolute Percentage Error)
# manual calc just to be safe with zeros
mape = np.mean(np.abs((y_true_clean - y_pred_clean) / y_true_clean)) * 100

# 5. RMSPE (Root Mean Squared Percentage Error)
# (y - y_hat) / y -> square it -> mean -> root
rmspe = np.sqrt(np.mean(np.square((y_true_clean - y_pred_clean) / y_true_clean))) * 100


print("\n--- Naive Baseline Results ---")
print(f"RMSE:  {rmse:.4f}")
print(f"MAE:   {mae:.4f}")
print(f"R2:    {r2:.4f}")
print("-" * 20)
print(f"MAPE:  {mape:.2f}%")
print(f"RMSPE: {rmspe:.2f}%")

Evaluating on 569 valid samples...

--- Naive Baseline Results ---
RMSE:  535.7621
MAE:   290.1016
R2:    0.9418
--------------------
MAPE:  7.76%
RMSPE: 14.38%
