## **Stock Price Prediction - NIFTY 50**

### **Notebook 04: Classical Time Series Forecasting (ARIMA, SARIMA, Prophet)**

[![Python](https://img.shields.io/badge/Python-3.8%2B-blue)](https://www.python.org/) [![Statsmodels](https://img.shields.io/badge/Statsmodels-Latest-red)](https://www.statsmodels.org/) [![Prophet](https://img.shields.io/badge/Prophet-Latest-lightblue)](https://facebook.github.io/prophet/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

---

**Part of the comprehensive learning series:** [Stock Price Prediction - NIFTY 50](https://github.com/prakash-ukhalkar/stock-price-prediction-nifty50)

**Learning Objectives:**
- Implement classical time series forecasting models (ARIMA, SARIMA, Prophet)
- Establish baseline performance metrics for comparison with ML/DL models
- Apply automatic hyperparameter tuning for optimal model selection
- Evaluate model performance using standard time series metrics
- Understand statistical foundations of econometric forecasting

**Dataset Scope:** Apply classical models to stationary log returns. Establish benchmarks for advanced modeling.

---

* This notebook implements and evaluates **classical time series models** for NIFTY-50 returns. 

* These models provide a crucial benchmark against which the performance of later Machine Learning (ML) and Deep Learning (DL) models will be compared. 

* We utilize the **Log Returns** for modeling, as they are a stationary time series, satisfying the core assumption of ARIMA.

## 1. Setup and Data Loading

We load the clean, feature-engineered training and testing data prepared in Notebook 03.

In [17]:
# Cell 1: Import Libraries
import pandas as pd
import numpy as np
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX # Used for final SARIMA fit
from pmdarima import auto_arima # For automatic ARIMA hyperparameter search
from prophet import Prophet # Facebook Prophet model
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Define Paths
TRAIN_DATA_PATH = '../data/processed/nifty50_train_features.csv'
TEST_DATA_PATH = '../data/processed/nifty50_test.csv'
MODEL_RESULTS_PATH = '../models/classical_model_results.csv'

# Load the training and testing datasets (Log Returns is the primary target)
train_df = pd.read_csv(TRAIN_DATA_PATH, index_col='Date', parse_dates=True)
test_df = pd.read_csv(TEST_DATA_PATH, index_col='Date', parse_dates=True)

y_train = train_df['Log_Return'].dropna() # Target series for classical models
y_test = test_df['Log_Return'].dropna()   # Actual values for testing

forecast_steps = len(y_test)

print(f"Training Target Series loaded. Size: {y_train.shape[0]}")
print(f"Testing Target Series loaded. Size: {y_test.shape[0]}")

Training Target Series loaded. Size: 57311
Testing Target Series loaded. Size: 14340


## 2. Model I: ARIMA (AutoRegressive Integrated Moving Average)

* ARIMA models use lagged observations (AR), differencing (I), and lagged forecast errors (MA). 

* We use `auto_arima` to search for the optimal combination of non-seasonal orders ($p, d, q$) that minimizes the **Akaike Information Criterion (AIC)**. 

* We set $d=0$ since we model the already stationary **Log Returns**.

In [19]:
# Cell 2: Find Optimal ARIMA Order using auto_arima

print("Searching for optimal ARIMA(p,d,q) order...")

arima_model_fit = auto_arima(y_train, 
                             start_p=1, start_q=1, 
                             max_p=5, max_q=5, 
                             m=1, # Non-seasonal model
                             d=0, 
                             suppress_warnings=True, 
                             stepwise=True, 
                             error_action='ignore')

print("--- Auto ARIMA Results ---")
print(arima_model_fit.summary())
optimal_order_arima = arima_model_fit.order
print(f"Optimal ARIMA order: {optimal_order_arima}")

# Generate Forecast using the best ARIMA model
arima_forecast = arima_model_fit.predict(n_periods=forecast_steps)

# CRITICAL FIX: Convert prediction result to a NumPy array's values before mapping to Series 
# This ensures a clean array is used, preventing NaN/alignment issues in evaluation.
arima_forecast_series = pd.Series(arima_forecast.values, index=y_test.index)

print(f"ARIMA Forecast generated for {forecast_steps} periods.")

Searching for optimal ARIMA(p,d,q) order...
--- Auto ARIMA Results ---
--- Auto ARIMA Results ---
                               SARIMAX Results                                
Dep. Variable:                      y   No. Observations:                57311
Model:               SARIMAX(2, 0, 1)   Log Likelihood              148571.640
Date:                Sun, 19 Oct 2025   AIC                        -297133.280
Time:                        23:12:58   BIC                        -297088.498
Sample:                             0   HQIC                       -297119.340
                              - 57311                                         
Covariance Type:                  opg                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept   2.102e-05   1.29e-05      1.628      0.103   -4.28e-06    4.63e-05
ar.L1          0.9426      0.003 

## 3. Model II: SARIMA (Seasonal ARIMA)

* SARIMA is used to capture seasonal dependencies. Since the `auto_arima` search is highly memory-intensive for long daily series with $m=21$ (a memory error risk), we use a safe, two-step approach:
  
  1.  **Search on Subset:** Use the last year of data to efficiently find the optimal seasonal parameters.
  
  2.  **Fit on Full Data:** Use the identified parameters to fit the final SARIMAX model on the entire `y_train` dataset.

In [20]:
# Cell 3: Find Optimal SARIMA Order (Memory-Safe Windowing)
print("SARIMA search is memory-intensive. Using the last 252 trading days (approx. 1 year) of data to search for parameters...")

# Create a subset of the training data for parameter search only
y_train_subset = y_train.tail(252) 

# --- Step 1: Search on Subset ---
# Restricting max_P/Q = 1 to manage memory
sarima_search_fit = auto_arima(y_train_subset, 
                              start_p=0, start_q=0, 
                              max_p=2, max_q=2, 
                              m=21, # Seasonal period (approx. 1 month of trading days)
                              start_P=0, start_Q=0,
                              max_P=1, max_Q=1, 
                              d=0, D=0,
                              suppress_warnings=True, 
                              stepwise=True, 
                              error_action='ignore',
                              trace=False)

optimal_sarima_order = sarima_search_fit.order
optimal_seasonal_order = sarima_search_fit.seasonal_order

print(f"Optimal SARIMA order found on subset: {optimal_sarima_order} x {optimal_seasonal_order} (m=21)")


# --- Step 2: Fit Model on FULL Training Data ---
sarima_full_model = SARIMAX(y_train, 
                            order=optimal_sarima_order, 
                            seasonal_order=optimal_seasonal_order, 
                            enforce_stationarity=False, 
                            enforce_invertibility=False)

sarima_model_fit = sarima_full_model.fit(disp=False)

print("--- Full SARIMA Model Fit Summary (using optimal parameters) ---")
print(sarima_model_fit.summary())

# --- Step 3: Generate Forecast using Integer Indices (Necessary for OOS prediction) ---

# Calculate integer steps relative to the end of the training data
n_train = len(y_train)
start_step = n_train
end_step = n_train + forecast_steps - 1

# Use integer steps for out-of-sample prediction, as required by statsmodels
sarima_forecast = sarima_model_fit.predict(start=start_step, end=end_step)
sarima_forecast_series = pd.Series(sarima_forecast.values, index=y_test.index)

print(f"SARIMA Forecast generated using integer indices {start_step} to {end_step}.")
print(f"Total {forecast_steps} periods mapped to test index.")

SARIMA search is memory-intensive. Using the last 252 trading days (approx. 1 year) of data to search for parameters...
Optimal SARIMA order found on subset: (1, 0, 1) x (0, 0, 0, 21) (m=21)
Optimal SARIMA order found on subset: (1, 0, 1) x (0, 0, 0, 21) (m=21)
--- Full SARIMA Model Fit Summary (using optimal parameters) ---
--- Full SARIMA Model Fit Summary (using optimal parameters) ---
                               SARIMAX Results                                
Dep. Variable:             Log_Return   No. Observations:                57311
Model:               SARIMAX(1, 0, 1)   Log Likelihood              148548.498
Date:                Sun, 19 Oct 2025   AIC                        -297090.995
Time:                        23:14:13   BIC                        -297064.127
Sample:                             0   HQIC                       -297082.632
                              - 57311                                         
Covariance Type:                  opg                  

## 4. Model III: Facebook Prophet

* Prophet is an additive model that easily handles trend and multiple seasonalities. 

* It requires the data to be in a DataFrame with columns named `ds` (DateStamp) and `y` (target value).

In [21]:
# Cell 4: Prepare Data and Fit Prophet Model

# Prophet data preparation
prophet_df_train = y_train.reset_index()
prophet_df_train.columns = ['ds', 'y']

# CRITICAL FIX: Remove Timezone before fitting Prophet
if prophet_df_train['ds'].dt.tz is not None:
    prophet_df_train['ds'] = prophet_df_train['ds'].dt.tz_localize(None)

print("Prophet training data head (Timezone removed):")
print(prophet_df_train.head())

# Initialize and Fit Model
prophet_model = Prophet(yearly_seasonality=True, daily_seasonality=False)
prophet_model.fit(prophet_df_train)

# Create future dataframe for forecasting
prophet_future = prophet_model.make_future_dataframe(periods=forecast_steps, include_history=False)

print("Prophet model fitted and future dataframe created.")

Prophet training data head (Timezone removed):
          ds         y
0 2020-01-02  0.013467
1 2020-01-03 -0.012738
2 2020-01-03 -0.006049
3 2020-01-03  0.011214
4 2020-01-03  0.021944


23:15:39 - cmdstanpy - INFO - Chain [1] start processing
23:16:02 - cmdstanpy - INFO - Chain [1] done processing
23:16:02 - cmdstanpy - INFO - Chain [1] done processing


Prophet model fitted and future dataframe created.


In [22]:
# Cell 5: Generate Prophet Forecast

prophet_forecast_results = prophet_model.predict(prophet_future)

# Extract the main forecast (yhat) and map to the test index
prophet_forecast = prophet_forecast_results['yhat'].values
prophet_forecast_series = pd.Series(prophet_forecast, index=y_test.index)

print("Prophet Forecast Head (Log Returns):\n", prophet_forecast_series.head())

Prophet Forecast Head (Log Returns):
 Date
2024-08-22 00:00:00+05:30    0.001505
2024-08-22 00:00:00+05:30    0.001962
2024-08-22 00:00:00+05:30    0.008448
2024-08-22 00:00:00+05:30   -0.000470
2024-08-22 00:00:00+05:30    0.003674
dtype: float64


## 5. Preliminary Evaluation and Results Consolidation

* We calculate standard error metrics (MSE, MAE, RMSE) for the forecasts generated by the three classical models against the actual test data (`y_test`).

In [23]:
# Cell 6: Define Evaluation Function (Robust against NaNs)
def evaluate_model(y_true, y_pred, model_name):
    
    # 1. Synchronize the series indices and drop NaNs from both simultaneously
    # This fixes the NaN/alignment issue in evaluation.
    y_aligned = pd.concat([y_true, y_pred], axis=1).dropna()
    y_true_clean = y_aligned.iloc[:, 0].values
    y_pred_clean = y_aligned.iloc[:, 1].values
    
    # Check for empty data after dropping NaNs
    if y_true_clean.size == 0:
        print(f"Warning: Model {model_name} generated all NaNs or mismatch.")
        return {'Model': model_name, 'MSE': np.nan, 'MAE': np.nan, 'RMSE': np.nan}
        
    mse = mean_squared_error(y_true_clean, y_pred_clean)
    mae = mean_absolute_error(y_true_clean, y_pred_clean)
    rmse = np.sqrt(mse)
    
    results = {'Model': model_name, 'MSE': mse, 'MAE': mae, 'RMSE': rmse}
    return results

# Evaluate Models
arima_results = evaluate_model(y_test, arima_forecast_series, 'ARIMA')
sarima_results = evaluate_model(y_test, sarima_forecast_series, 'SARIMA')
prophet_results = evaluate_model(y_test, prophet_forecast_series, 'Prophet')

results_df = pd.DataFrame([arima_results, sarima_results, prophet_results])

print("--- Classical Model Evaluation (Log Returns) ---")
print(results_df.set_index('Model'))

--- Classical Model Evaluation (Log Returns) ---
              MSE       MAE      RMSE
Model                                
ARIMA    0.000261  0.010981  0.016141
SARIMA   0.000260  0.010954  0.016126
Prophet  0.000273  0.011414  0.016520


## 6. Final Save

The evaluation results are saved to the `models/` directory for later consolidation and comparison in Notebook 10.

In [25]:
# Cell 7: Save Classical Model Results
results_df.to_csv(MODEL_RESULTS_PATH, index=False)

print(f"\nClassical model evaluation results saved to: {MODEL_RESULTS_PATH}")
print("Proceed to Notebook 05 for Machine Learning (KNN and SVR) implementation.")


Classical model evaluation results saved to: ../models/classical_model_results.csv
Proceed to Notebook 05 for Machine Learning (KNN and SVR) implementation.


## Summary

### What We Accomplished:

  1. **ARIMA Modeling**: Implemented automated ARIMA with optimal (p,d,q) parameter search

  2. **SARIMA Analysis**: Extended to seasonal patterns with memory-safe parameter optimization

  3. **Prophet Forecasting**: Applied Facebook's robust time series framework with timezone handling

  4. **Model Evaluation**: Calculated MSE, MAE, and RMSE with robust NaN handling

  5. **Baseline Establishment**: Created benchmarks for subsequent ML/DL model evaluation

  6. **Results Export**: Saved performance metrics for comprehensive model comparison

### Key Classical Model Insights:

  - **ARIMA Performance**: Optimal parameters identified through AIC minimization
  
  - **SARIMA Optimization**: Memory-safe approach using data subset for parameter search
  
  - **Prophet Robustness**: Handled trend and seasonality patterns with timezone normalization
  
  - **Stationarity Advantage**: Log returns satisfied classical model assumptions
  
  - **Evaluation Framework**: Robust metrics calculation with NaN alignment handling

### Technical Implementation Notes:

  - **Memory Management**: SARIMA parameter search optimized for large datasets
  
  - **Data Alignment**: Proper index synchronization for accurate evaluation
  
  - **Timezone Handling**: Prophet compatibility ensured through timezone removal
  
  - **Forecast Validation**: Integer-based out-of-sample prediction for SARIMA

### Next Steps:

**Notebook 05**: We'll advance to traditional Machine Learning models including:
- K-Nearest Neighbors (KNN) for pattern recognition
- Support Vector Regression (SVR) for non-linear relationships
- Performance comparison with classical benchmarks
- Feature importance analysis using engineered indicators

---

### *Next Notebook Preview*

Having established classical time series benchmarks, we'll now explore how traditional machine learning algorithms perform on our feature-engineered dataset, leveraging the technical indicators and lag variables created in Notebook 03.

---

#### About This Project

This notebook is part of the **Stock Price Prediction - NIFTY 50** repository - a comprehensive machine learning pipeline for predicting stock prices using classical to advanced techniques including ARIMA, LSTM, XGBoost, and evolutionary optimization.

**Repository:** [`stock-price-prediction-nifty50`](https://github.com/prakash-ukhalkar/stock-price-prediction-nifty50)

**Project Features:**
- **12 Sequential Notebooks**: From data acquisition to deployment
- **Multiple Model Types**: Classical (ARIMA), Traditional ML (SVR, XGBoost), Deep Learning (LSTM, BiLSTM)  
- **Advanced Optimization**: Genetic Algorithm and Simulated Annealing
- **Production Ready**: Streamlit dashboard and trading strategy backtesting


#### **Author**

**Prakash Ukhalkar**  
[![GitHub](https://img.shields.io/badge/GitHub-prakash--ukhalkar-blue?style=flat&logo=github)](https://github.com/prakash-ukhalkar)

---

<div align="center">
  <sub>Built with care for the quantitative finance and data science community</sub>
</div>