# SARIMA Modeling for Hydro Energy Forecasting
**Objective**: Evaluate SARIMA model performance for forecasting hydro energy generation in New Zealand.

This notebook contributes to **RQ1**: _Which model (SARIMA or ANN) provides the most accurate forecast for renewable energy generation in New Zealand?_

We focus on univariate SARIMA modeling using historical hydro generation data.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from sklearn.metrics import mean_squared_error, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-vintage')

## Load Hydro Generation Data

In [None]:
# Load dataset
hydro_df = pd.read_csv('hydro_data.csv', parse_dates=['DATE'])
hydro_df = hydro_df.sort_values('DATE')
hydro_df.set_index('DATE', inplace=True)
hydro_df = hydro_df.asfreq('D')  # Ensure daily frequency

# Preview
hydro_df['GENERATION'].plot(title='Daily Hydro Energy Generation', figsize=(12,4))
plt.ylabel('MWh')
plt.show()

## Stationarity Check using Augmented Dickey-Fuller Test

In [None]:
result = adfuller(hydro_df['GENERATION'].dropna())
print(f'ADF Statistic: {result[0]:.4f}')
print(f'p-value: {result[1]:.4f}')
if result[1] < 0.05:
    print('✅ Series is stationary')
else:
    print('⚠️ Series is non-stationary — differencing required')

## Differencing to Achieve Stationarity

In [None]:
# Apply first-order differencing if needed
hydro_df['DIFF_GEN'] = hydro_df['GENERATION'].diff()
hydro_df['DIFF_GEN'].dropna().plot(title='Differenced Series', figsize=(12,4))
plt.ylabel('Differenced MWh')
plt.show()

# ADF test after differencing
result_diff = adfuller(hydro_df['DIFF_GEN'].dropna())
print(f'ADF Statistic (Differenced): {result_diff[0]:.4f}')
print(f'p-value: {result_diff[1]:.4f}')

## ACF and PACF Plots for Order Selection

In [None]:
fig, ax = plt.subplots(2, 1, figsize=(12, 8))
plot_acf(hydro_df['DIFF_GEN'].dropna(), ax=ax[0], lags=40)
plot_pacf(hydro_df['DIFF_GEN'].dropna(), ax=ax[1], lags=40)
ax[0].set_title('ACF of Differenced Series')
ax[1].set_title('PACF of Differenced Series')
plt.tight_layout()
plt.show()

## SARIMA Model Fitting

In [None]:
# Fit SARIMA model based on ACF/PACF inspection or AIC minimization
model = SARIMAX(hydro_df['GENERATION'], 
                order=(1,1,1), 
                seasonal_order=(1,1,1,7), 
                enforce_stationarity=False, 
                enforce_invertibility=False)

results = model.fit(disp=False)
print(results.summary())

## Model Diagnostics

In [None]:
results.plot_diagnostics(figsize=(12,8))
plt.tight_layout()
plt.show()

## Forecasting Future Hydro Generation

In [None]:
# Forecast the next 30 days
forecast = results.get_forecast(steps=30)
pred_ci = forecast.conf_int()

# Plot
ax = hydro_df['GENERATION'].plot(label='Observed', figsize=(12, 6))
forecast.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='lightblue', alpha=0.4)
plt.title('Hydro Energy Forecast (Next 30 Days)')
plt.legend()
plt.show()

## Forecast Evaluation

In [None]:
# Use train-test split for actual evaluation
train = hydro_df['GENERATION'][:-30]
test = hydro_df['GENERATION'][-30:]

model_eval = SARIMAX(train, order=(1,1,1), seasonal_order=(1,1,1,7), 
                     enforce_stationarity=False, enforce_invertibility=False)
results_eval = model_eval.fit(disp=False)

forecast_eval = results_eval.get_forecast(steps=30)
pred = forecast_eval.predicted_mean

# Metrics
mae = mean_absolute_error(test, pred)
rmse = np.sqrt(mean_squared_error(test, pred))
mape = np.mean(np.abs((test - pred) / test)) * 100

print(f'MAE: {mae:.2f}')
print(f'RMSE: {rmse:.2f}')
print(f'MAPE: {mape:.2f}%')

### 🔍 Interpretation (RQ1)
The SARIMA model provides a baseline for forecasting hydro energy generation. The performance metrics — MAE, RMSE, and MAPE — will be compared against the ANN model to evaluate forecasting accuracy (RQ1).

A future extension will involve SARIMAX using lagged climate features to assess RQ2.