# Advanced Time Series Forecasting — NeuralProphet + Optuna

This notebook implements the project **exactly according to the project conditions** you provided:
- Use **NeuralProphet** as the forecasting model.
- Perform **blocked cross-validation** suitable for time series.
- Use **Optuna** for hyperparameter optimization.
- Provide **production-ready** code, detailed markdown explanations, and final deliverables (top-5 hyperparameters, holdout evaluation, plots, and CSV export).

This notebook has been adapted to match your project submission format and documentation.  
(Referenced project README uploaded by you: `/mnt/data/README.md`.)

**Local README path:** `/mnt/data/README.md`

Use this as the canonical project README for final submission.

## 1) Imports and configuration

In [None]:
"""
neuralprophet_optuna_project.py

Advanced Time Series Forecasting project:
- Generates a synthetic daily time series (4 years)
- Implements NeuralProphet baseline
- Uses Optuna for hyperparameter optimization with blocked CV
- Evaluates with RMSE and MAE and outputs top-5 hyperparameter configs

Requirements (install before running):
    pip install neuralprophet optuna pandas numpy matplotlib scikit-learn

Note: NeuralProphet sometimes requires PyTorch. Ensure your environment supports it:
    pip install torch --index-url https://download.pytorch.org/whl/cpu

Usage:
    python neuralprophet_optuna_project.py
"""

from __future__ import annotations

import os
import csv
import math
import json
from typing import Dict, List, Tuple

import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error, mean_absolute_error
import optuna

# Attempt to import NeuralProphet; if missing, instruct the user
try:
    from neuralprophet import NeuralProphet, set_log_level
except Exception as ex:
    raise ImportError(
        "NeuralProphet is not installed or failed to import. "
        "Install with `pip install neuralprophet` and ensure PyTorch is available."
    ) from ex

import matplotlib.pyplot as plt

set_log_level("ERROR")



## 2) Synthetic dataset generation
Generates a 4+ year daily timeseries with seasonality, holidays, changepoints, and noise.

In [None]:
def generate_synthetic_daily_series(
    start_date: str = "2017-01-01",
    end_date: str = "2021-12-31",
    seed: int = 42,
) -> pd.DataFrame:
    """
    Generate a synthetic daily time series with:
      - yearly seasonality
      - weekly seasonality
      - synthetic holiday effects
      - multiple changepoints (abrupt level shifts)
      - Gaussian noise

    Returns:
        DataFrame with columns ['ds','y'] where 'ds' is datetime and 'y' is observed value.
    """
    rng = np.random.default_rng(seed)
    dates = pd.date_range(start=start_date, end=end_date, freq="D")
    n = len(dates)
    t = np.arange(n) / 365.0  # time in years

    # Yearly seasonal component (smooth)
    yearly = 10.0 * np.sin(2 * math.pi * t)  # amplitude 10

    # Weekly seasonality (weekday effects)
    weekday = np.array([0.0 if d.weekday() < 5 else 3.0 for d in dates])

    # Trend (linear + changepoints)
    trend = 0.5 * t  # gentle upward trend
    # Add changepoints: at specific indices add shifts
    changepoints = [
        int(0.6 * n),
        int(0.35 * n),
    ]
    trend_shift = np.zeros(n)
    for cp in changepoints:
        trend_shift[cp:] += rng.normal(5.0, 1.0)  # abrupt shift

    # Synthetic holiday effect: create some date list and add pulses
    holidays = [
        pd.Timestamp("2017-12-25"),
        pd.Timestamp("2018-12-25"),
        pd.Timestamp("2019-12-25"),
        pd.Timestamp("2020-12-25"),
        pd.Timestamp("2021-12-25"),
    ]
    holiday_effect = np.zeros(n)
    for i, d in enumerate(dates):
        if d in holidays:
            holiday_effect[i] += 8.0 + rng.normal(0.0, 1.0)

    # Noise
    noise = rng.normal(0.0, 2.5, size=n)

    y = 20.0 + 3.0 * trend + trend_shift + yearly + weekday + holiday_effect + noise

    df = pd.DataFrame({"ds": dates, "y": y})
    return df



## 3) Blocked cross-validation helper
Create time-based folds with no leakage.

In [None]:
def blocked_time_series_folds(
    df: pd.DataFrame, n_folds: int = 3, test_size_days: int = 90
) -> List[Tuple[pd.DataFrame, pd.DataFrame]]:
    """
    Create blocked (time-based) cross-validation folds.
    Each fold uses earlier data for train and the next contiguous block for validation.
    The final held-out test block is returned for final evaluation separately.

    Returns:
        list of (train_df, val_df)
    """
    dates = df["ds"].sort_values().unique()
    total_days = len(dates)
    fold_size = (total_days - test_size_days) // (n_folds + 1)

    folds = []
    for i in range(n_folds):
        train_end_idx = (i + 1) * fold_size
        val_start_idx = train_end_idx
        val_end_idx = val_start_idx + fold_size
        train_dates = dates[:train_end_idx]
        val_dates = dates[val_start_idx:val_end_idx]
        train_df = df[df["ds"].isin(train_dates)].reset_index(drop=True)
        val_df = df[df["ds"].isin(val_dates)].reset_index(drop=True)
        folds.append((train_df, val_df))
    return folds



## 4) Training & evaluation function
Train NeuralProphet using given hyperparameters and evaluate RMSE/MAE.

In [None]:
def train_evaluate_np(
    train_df: pd.DataFrame,
    val_df: pd.DataFrame,
    params: Dict,
    verbose: bool = False,
) -> Tuple[float, float, NeuralProphet]:
    """
    Train NeuralProphet on train_df using params and evaluate on val_df.
    Returns (rmse, mae, model)
    """
    # Setup model configuration from params
    model = NeuralProphet(
        n_changepoints=int(params.get("n_changepoints", 10)),
        changepoints_range=params.get("changepoints_range", 0.8),
        yearly_seasonality=params.get("yearly_seasonality", True),
        weekly_seasonality=params.get("weekly_seasonality", True),
        daily_seasonality=False,
        seasonality_mode=params.get("seasonality_mode", "additive"),
        learning_rate=params.get("learning_rate", 1.0),
        epochs=int(params.get("epochs", 50)),
        ar_sparsity=params.get("ar_sparsity", 0.0),
        loss_func=params.get("loss_func", "MSE"),
    )

    # Fit model
    model.fit(train_df, freq="D", progress=None)

    # Make predictions on validation
    future = val_df[["ds"]].copy()
    forecast = model.predict(future)
    y_true = val_df["y"].values
    y_pred = forecast["yhat1"].values

    rmse = math.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    if verbose:
        print(f"RMSE: {rmse:.4f}, MAE: {mae:.4f}")
    return rmse, mae, model



## 5) Baseline evaluation and Optuna optimization
This cell defines the baseline parameters and the Optuna study (objective).

In [None]:
def baseline_and_optimize(
    df: pd.DataFrame,
    n_folds: int = 3,
    test_size_days: int = 90,
    n_trials: int = 40,
    seed: int = 42,
) -> Dict:
    """
    Run baseline evaluation and Optuna optimization. Returns dictionary of results including top trials.
    """
    # Reserve final holdout test (last test_size_days)
    df_sorted = df.sort_values("ds").reset_index(drop=True)
    holdout = df_sorted.iloc[-test_size_days:].reset_index(drop=True)
    in_sample = df_sorted.iloc[: -test_size_days].reset_index(drop=True)

    # Baseline: default NeuralProphet with modest epochs
    baseline_params = {
        "n_changepoints": 10,
        "changepoints_range": 0.8,
        "seasonality_mode": "additive",
        "learning_rate": 1.0,
        "epochs": 50,
        "ar_sparsity": 0.0,
        "loss_func": "MSE",
        "yearly_seasonality": True,
        "weekly_seasonality": True,
    }

    # Cross-validation folds
    folds = blocked_time_series_folds(in_sample, n_folds=n_folds, test_size_days=test_size_days)

    # Baseline CV
    baseline_rmses = []
    baseline_maes = []
    for (tr, va) in folds:
        rmse, mae, _ = train_evaluate_np(tr, va, baseline_params)
        baseline_rmses.append(rmse)
        baseline_maes.append(mae)
    baseline_rmse_mean = float(np.mean(baseline_rmses))
    baseline_mae_mean = float(np.mean(baseline_maes))

    print(f"Baseline CV RMSE: {baseline_rmse_mean:.4f}, MAE: {baseline_mae_mean:.4f}")

    # Optuna study
    def objective(trial: optuna.trial.Trial) -> float:
        params = {
            "n_changepoints": trial.suggest_int("n_changepoints", 5, 50),
            "changepoints_range": trial.suggest_float("changepoints_range", 0.5, 0.95),
            "seasonality_mode": trial.suggest_categorical("seasonality_mode", ["additive", "multiplicative"]),
            "learning_rate": trial.suggest_loguniform("learning_rate", 1e-4, 1.0),
            "epochs": trial.suggest_int("epochs", 30, 200),
            "ar_sparsity": trial.suggest_float("ar_sparsity", 0.0, 0.9),
            "loss_func": "MSE",
            "yearly_seasonality": True,
            "weekly_seasonality": True,
        }

        # Evaluate across folds and return mean RMSE
        cv_rmse = []
        for (tr, va) in folds:
            try:
                rmse, _, _ = train_evaluate_np(tr, va, params)
            except Exception as e:
                # If training fails, return large penalty
                print("Trial training failed:", e)
                return 1e6
            cv_rmse.append(rmse)
        return float(np.mean(cv_rmse))

    study = optuna.create_study(direction="minimize", sampler=optuna.samplers.TPESampler(seed=seed))
    study.optimize(objective, n_trials=n_trials, show_progress_bar=True)

    print("Optimization completed. Best trial:")
    print(study.best_trial.params)

    # Evaluate top 5 trials on holdout test set and record metrics
    trials_df = study.trials_dataframe()
    # Sort by value (mean CV RMSE)
    best_trials = sorted(study.trials, key=lambda t: t.value)[:5]

    top5_results = []
    for t in best_trials:
        params = dict(t.params)
        # add defaults for keys not present
        params.setdefault("loss_func", "MSE")
        params["yearly_seasonality"] = True
        params["weekly_seasonality"] = True
        params["epochs"] = int(params.get("epochs", 50))
        # retrain on entire in-sample (i.e., before holdout)
        rmse, mae, model = train_evaluate_np(in_sample, holdout, params)
        top5_results.append(
            {
                "trial": t.number,
                "value_cv_rmse": float(t.value),
                "holdout_rmse": float(rmse),
                "holdout_mae": float(mae),
                "params": params,
            }
        )

    results = {
        "baseline": {
            "cv_rmse": baseline_rmse_mean,
            "cv_mae": baseline_mae_mean,
            "params": baseline_params,
        },
        "study": study,
        "top5": top5_results,
        "holdout": {"df": holdout},
    }
    return results



## 6) Helpers to save results and plot forecasts

In [None]:
def save_top5_to_csv(top5: List[Dict], csv_path: str = "top5_hyperparams.csv") -> None:
    """
    Save the top5 results to a CSV file (one json-encoded params cell).
    """
    with open(csv_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(["trial", "cv_rmse", "holdout_rmse", "holdout_mae", "params_json"])
        for r in top5:
            writer.writerow(
                [r["trial"], r["value_cv_rmse"], r["holdout_rmse"], r["holdout_mae"], json.dumps(r["params"])]
            )
    print(f"Top 5 hyperparameter results saved to {csv_path}")



In [None]:
def plot_forecast_comparison(model: NeuralProphet, df_true: pd.DataFrame, save_path: str = "forecast.png") -> None:
    """
    Generate a comparison plot of the model forecast vs actual for df_true (must include 'ds').
    """
    future = df_true[["ds"]].copy()
    forecast = model.predict(future)
    plt.figure(figsize=(12, 5))
    plt.plot(df_true["ds"], df_true["y"], label="actual")
    plt.plot(df_true["ds"], forecast["yhat1"], label="forecast")
    plt.legend()
    plt.xlabel("Date")
    plt.ylabel("y")
    plt.title("Forecast vs Actual (holdout)")
    plt.tight_layout()
    plt.savefig(save_path)
    print(f"Forecast plot saved to {save_path}")



## 7) Run the full pipeline

**Notes before running**
- Ensure `neuralprophet`, `optuna`, `pandas`, `numpy`, `matplotlib`, `scikit-learn`, and `torch` are installed.
- For a quick demo use `n_trials=8`. For submission-quality runs, set `n_trials=50` or more.
- The script will:
  - generate the synthetic dataset,
  - run baseline CV,
  - run Optuna optimization,
  - save `top5_hyperparams.csv`,
  - save `best_model_forecast.png`.

Execute the cell below to run the experiment.


In [None]:

# Run the pipeline (demo)
df = generate_synthetic_daily_series(start_date="2017-01-01", end_date="2021-12-31", seed=123)
print(f"Dataset rows: {len(df)}; range: {df['ds'].min()} to {df['ds'].max()}")

results = baseline_and_optimize(df, n_folds=3, test_size_days=90, n_trials=8, seed=42)

# Print summary
print("Baseline CV RMSE: {:.4f}, MAE: {:.4f}".format(results["baseline"]["cv_rmse"], results["baseline"]["cv_mae"]))
print("\nTop 5 trials (summary):")
for r in results["top5"]:
    print(f"Trial {r['trial']}: CV_RMSE={r['value_cv_rmse']:.4f}, Holdout_RMSE={r['holdout_rmse']:.4f}, Holdout_MAE={r['holdout_mae']:.4f}")
    print("Params:", r["params"])
    print("-" * 60)

# Save and show files location
save_top5_to_csv(results["top5"], csv_path="top5_hyperparams.csv")

# Retrain best on in-sample and plot vs holdout
best_params = results["top5"][0]["params"]
_, _, best_model = train_evaluate_np(df.iloc[:-90].reset_index(drop=True), results["holdout"]["df"], best_params)
try:
    plot_forecast_comparison(best_model, results["holdout"]["df"], save_path="best_model_forecast.png")
except Exception as e:
    print("Plotting failed:", e)

print("Artifacts saved: top5_hyperparams.csv, best_model_forecast.png (if plotting succeeded).")


**Project README you provided:**

See the uploaded README for project specs: `/mnt/data/README.md`

Citation: fileciteturn0file0