## **Stock Price Prediction - NIFTY 50**

### **Notebook 06: Ensemble Machine Learning (XGBoost and Random Forest)**

[![Python](https://img.shields.io/badge/Python-3.8%2B-blue)](https://www.python.org/) [![XGBoost](https://img.shields.io/badge/XGBoost-Latest-red)](https://xgboost.readthedocs.io/) [![Scikit-Learn](https://img.shields.io/badge/Scikit--Learn-Latest-orange)](https://scikit-learn.org/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

---

**Part of the comprehensive learning series:** [Stock Price Prediction - NIFTY 50](https://github.com/prakash-ukhalkar/stock-price-prediction-nifty50)

**Learning Objectives:**
- Implement ensemble ML algorithms (XGBoost and Random Forest) for financial prediction
- Apply tree-based models that handle non-linear relationships without feature scaling
- Utilize advanced hyperparameter tuning for gradient boosting optimization
- Analyze feature importance from technical indicators and lag variables
- Establish strong ensemble baselines before deep learning implementation

**Dataset Scope:** Apply ensemble learning to feature-engineered data. Leverage tree-based models for complex pattern recognition.

---

* This notebook implements **Ensemble Machine Learning** algorithms, focusing on **Random Forest** and **XGBoost**. 

* These tree-based models excel at capturing complex, non-linear relationships in noisy financial data without requiring feature scaling, establishing a strong high-level benchmark for the comparative analysis. 

* We utilize the feature-engineered data from Notebook 03.

## 1. Setup and Data Loading

We load the feature-engineered training data (`nifty50_train_features.csv`) and the original raw test data (`nifty50_test.csv`), which will be re-engineered here for prediction.

In [1]:
# Cell 1: Import Libraries
import pandas as pd
import numpy as np
import os
import ta # Needed for test data feature re-engineering
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Define Paths
TRAIN_FEATURES_PATH = '../data/processed/nifty50_train_features.csv'
TEST_DATA_PATH = '../data/processed/nifty50_test.csv'
MODEL_RESULTS_PATH = '../models/ml_model_results.csv'
CLASSICAL_RESULTS_PATH = '../models/classical_model_results.csv'

# Load data
df_train = pd.read_csv(TRAIN_FEATURES_PATH, index_col='Date', parse_dates=True)
df_test = pd.read_csv(TEST_DATA_PATH, index_col='Date', parse_dates=True)

TARGET_COL = 'Log_Return'

# Clean and Define X_train, y_train
y_train = df_train[TARGET_COL]
X_train = df_train.drop(columns=[TARGET_COL])
non_numeric_cols = X_train.select_dtypes(include=['object', 'category']).columns
if not non_numeric_cols.empty:
    X_train = X_train.drop(columns=non_numeric_cols)

print(f"Training Features loaded. Shape: {X_train.shape}")

Training Features loaded. Shape: (57311, 31)


## 2. Time Series Cross-Validation Setup

We use **TimeSeriesSplit** to ensure robust and non-leaky validation during hyperparameter tuning.

In [2]:
# Cell 2: Initialize TimeSeriesSplit
N_SPLITS = 5 
tscv = TimeSeriesSplit(n_splits=N_SPLITS)

print(f"Initialized TimeSeriesSplit with {N_SPLITS} folds for tuning.")

Initialized TimeSeriesSplit with 5 folds for tuning.


## 3. Model I: Random Forest Regressor

**Explanation:** Random Forest builds an ensemble of decorrelated decision trees, using the average of their predictions. This method is highly effective for reducing the variance and overfitting often seen in single tree models.

In [3]:
# Cell 3: Random Forest Hyperparameter Tuning

rf_model = RandomForestRegressor(random_state=42, n_jobs=-1)

# Note: Scaling is not required for tree-based models
rf_param_grid = {
    'n_estimators': [50, 100], 
    'max_depth': [5, 10], 
    'min_samples_leaf': [5, 10]
}

rf_grid = GridSearchCV(
    estimator=rf_model,
    param_grid=rf_param_grid,
    cv=tscv, 
    scoring='neg_mean_squared_error',
    n_jobs=-1, 
    verbose=1
)

print("Starting Random Forest GridSearchCV...")
rf_grid.fit(X_train, y_train)

rf_best_model = rf_grid.best_estimator_
print("\nRandom Forest Best Parameters:", rf_grid.best_params_)
print("Random Forest Best Score (Neg MSE):", rf_grid.best_score_)

Starting Random Forest GridSearchCV...
Fitting 5 folds for each of 8 candidates, totalling 40 fits

Random Forest Best Parameters: {'max_depth': 10, 'min_samples_leaf': 5, 'n_estimators': 100}
Random Forest Best Score (Neg MSE): -2.778991730885983e-07


## 4. Model II: XGBoost Regressor

**Explanation:** XGBoost (Extreme Gradient Boosting) is a highly efficient implementation of gradient boosting. It sequentially builds new decision trees to correct the errors of the preceding model, often yielding state-of-the-art results for tabular regression tasks.

In [4]:
# Cell 4: XGBoost Hyperparameter Tuning

xgb_model = XGBRegressor(random_state=42, objective='reg:squarederror', n_jobs=-1)

xgb_param_grid = {
    'n_estimators': [50, 100], 
    'max_depth': [3, 5], 
    'learning_rate': [0.05, 0.1]
}

xgb_grid = GridSearchCV(
    estimator=xgb_model,
    param_grid=xgb_param_grid,
    cv=tscv, 
    scoring='neg_mean_squared_error',
    n_jobs=-1, 
    verbose=1
)

print("Starting XGBoost GridSearchCV...")
xgb_grid.fit(X_train, y_train)

xgb_best_model = xgb_grid.best_estimator_
print("\nXGBoost Best Parameters:", xgb_grid.best_params_)
print("XGBoost Best Score (Neg MSE):", xgb_grid.best_score_)

Starting XGBoost GridSearchCV...
Fitting 5 folds for each of 8 candidates, totalling 40 fits

XGBoost Best Parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}
XGBoost Best Score (Neg MSE): -3.841978118487281e-06


## 5. Model Evaluation on Final Test Set

We evaluate the optimized Ensemble models on the unseen test data. As seen in Notebook 05, the raw test data must be re-engineered before prediction to align feature columns with the trained models.

In [5]:
# Cell 5: Prepare Test Data, Re-Engineer Features, and Generate Predictions

# 1. Feature Re-Engineering (MUST match Notebook 03/05 logic)
df_test_features = df_test.copy()

LAG_PERIODS = [1, 2, 3, 5, 10]
for lag in LAG_PERIODS:
    df_test_features[f'Close_Lag_{lag}'] = df_test_features['Close'].shift(lag)
    df_test_features[f'Return_Lag_{lag}'] = df_test_features['Log_Return'].shift(lag)

WINDOW_TREND = [10, 20, 50] 
for window in WINDOW_TREND:
    df_test_features[f'SMA_{window}'] = ta.trend.sma_indicator(df_test_features['Close'], window=window, fillna=False)
    df_test_features[f'EMA_{window}'] = ta.trend.ema_indicator(df_test_features['Close'], window=window, fillna=False)
macd = ta.trend.MACD(df_test_features['Close'], window_fast=12, window_slow=26, window_sign=9, fillna=False)
df_test_features['MACD_Line'] = macd.macd()
df_test_features['MACD_Signal'] = macd.macd_signal()
RSI_WINDOW = 14 
df_test_features[f'RSI_{RSI_WINDOW}'] = ta.momentum.rsi(df_test_features['Close'], window=RSI_WINDOW, fillna=False)
df_test_features['MFI'] = ta.volume.money_flow_index(df_test_features['High'], df_test_features['Low'], df_test_features['Close'], df_test_features['Volume'], window=14, fillna=False)
df_test_features['ATR'] = ta.volatility.average_true_range(df_test_features['High'], df_test_features['Low'], df_test_features['Close'], window=14, fillna=False)

df_test_features = df_test_features.dropna()

# 2. Final Data Cleanup and Alignment

y_test = df_test_features[TARGET_COL]
X_test = df_test_features.drop(columns=[TARGET_COL])

# --- CRITICAL FIX: Ensure X_test is cleaned of non-numeric columns and aligned (as trained) ---
non_numeric_cols_test = X_test.select_dtypes(include=['object', 'category']).columns
if not non_numeric_cols_test.empty:
    X_test = X_test.drop(columns=non_numeric_cols_test)

# Align the order and presence of features to match the exact set used to train the models (X_train.columns)
X_test = X_test[X_train.columns]

print(f"Test data re-engineered and cleaned. Final X_test shape: {X_test.shape}")

# 3. Prediction
rf_preds = rf_best_model.predict(X_test)
xgb_preds = xgb_best_model.predict(X_test)

print("Predictions generated for Random Forest and XGBoost on the test set.")

Test data re-engineered and cleaned. Final X_test shape: (14254, 31)
Predictions generated for Random Forest and XGBoost on the test set.


In [6]:
# Cell 6: Final Evaluation and Consolidation

def evaluate_and_consolidate(y_true, y_pred, model_name, path):
    
    # Note: Synchronization is simplified since y_true and y_pred are already aligned via the dropna() on the test features.
    
    mse = mean_squared_error(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    
    new_results = pd.DataFrame([{'Model': model_name, 'MSE': mse, 'MAE': mae, 'RMSE': rmse}])
        
    # Load previous results (from classical/traditional ML models) and append new results
    if os.path.exists(path):
        results_df = pd.read_csv(path)
    else:
        # Check for classical results file if the ML file is new
        if os.path.exists(CLASSICAL_RESULTS_PATH):
            results_df = pd.read_csv(CLASSICAL_RESULTS_PATH)
        else:
            results_df = pd.DataFrame(columns=['Model', 'MSE', 'MAE', 'RMSE'])
        
    results_df = pd.concat([results_df, new_results], ignore_index=True)
    results_df = results_df.drop_duplicates(subset=['Model'], keep='last')
    results_df.to_csv(path, index=False)
    return results_df


# Evaluate and Save Results
evaluate_and_consolidate(y_test, rf_preds, 'RandomForest', MODEL_RESULTS_PATH)
final_results = evaluate_and_consolidate(y_test, xgb_preds, 'XGBoost', MODEL_RESULTS_PATH)

print("--- Ensemble ML Model Evaluation (Log Returns) on Test Set ---")
print(final_results.set_index('Model'))

print("Results consolidated in: ", MODEL_RESULTS_PATH)
print("Proceed to Notebook 07 for Deep Learning models (ANN and LSTM Basics).")

--- Ensemble ML Model Evaluation (Log Returns) on Test Set ---
                   MSE       MAE      RMSE
Model                                     
ARIMA         0.000261  0.010981  0.016141
SARIMA        0.000260  0.010954  0.016126
Prophet       0.000273  0.011414  0.016520
KNN           0.000185  0.008851  0.013611
SVR           0.000026  0.004653  0.005069
RandomForest  0.000004  0.000026  0.001980
XGBoost       0.000007  0.000278  0.002640
Results consolidated in:  ../models/ml_model_results.csv
Proceed to Notebook 07 for Deep Learning models (ANN and LSTM Basics).


## Summary

### What We Accomplished:

  1. **Ensemble Model Implementation**: Applied Random Forest and XGBoost for advanced prediction

  2. **Tree-Based Advantage**: Leveraged models that handle non-linear relationships without scaling

  3. **Hyperparameter Optimization**: Used GridSearchCV with TimeSeriesSplit for robust tuning

  4. **Feature Engineering Pipeline**: Re-engineered test data to match training feature set

  5. **Performance Evaluation**: Established strong ensemble benchmarks for deep learning comparison

  6. **Model Consolidation**: Integrated results with previous classical and traditional ML models

### Key Ensemble Learning Insights:

  - **Random Forest Performance**: Bootstrap aggregation reduced overfitting and improved generalization
  
  - **XGBoost Effectiveness**: Gradient boosting captured complex feature interactions efficiently
  
  - **Feature Handling**: Tree-based models naturally handled mixed-scale technical indicators
  
  - **Non-Linear Patterns**: Ensemble methods effectively modeled complex financial relationships
  
  - **Computational Efficiency**: Parallel processing enabled faster hyperparameter optimization

### Technical Implementation Notes:

  - **No Feature Scaling**: Tree-based models eliminated preprocessing requirements
  
  - **Advanced Hyperparameters**: Tuned depth, estimators, and regularization parameters
  
  - **Feature Engineering**: Consistent technical indicator calculation across train/test sets
  
  - **Cross-Validation**: Maintained temporal integrity during model selection

### Model Performance Framework:

  - **Benchmark Establishment**: Created strong ensemble baselines for deep learning comparison
  
  - **Feature Importance**: Tree-based models provide interpretable feature rankings
  
  - **Robustness**: Ensemble methods showed resilience to financial data noise
  
  - **Scalability**: Models demonstrated computational efficiency on large datasets

### Next Steps:

**Notebook 07**: We'll transition to Deep Learning models including:
- Artificial Neural Networks (ANN) for non-linear pattern recognition
- Basic LSTM implementation for sequential modeling
- Advanced neural architectures for time series prediction
- Comparison of deep learning with ensemble benchmarks

---

### *Next Notebook Preview*

Having established strong ensemble learning benchmarks, we'll now explore deep learning's potential to capture even more complex temporal patterns and non-linear relationships in our financial time series data.

---

#### About This Project

This notebook is part of the **Stock Price Prediction - NIFTY 50** repository - a comprehensive machine learning pipeline for predicting stock prices using classical to advanced techniques including ARIMA, LSTM, XGBoost, and evolutionary optimization.

**Repository:** [`stock-price-prediction-nifty50`](https://github.com/prakash-ukhalkar/stock-price-prediction-nifty50)

**Project Features:**
- **12 Sequential Notebooks**: From data acquisition to deployment
- **Multiple Model Types**: Classical (ARIMA), Traditional ML (SVR, XGBoost), Deep Learning (LSTM, BiLSTM)  
- **Advanced Optimization**: Genetic Algorithm and Simulated Annealing
- **Production Ready**: Streamlit dashboard and trading strategy backtesting

**Notebook Sequence:**
1. COMPLETE - Data Acquisition and Preprocessing
2. COMPLETE - EDA and Time Series Foundations
3. COMPLETE - Feature Engineering and Technical Analysis
4. COMPLETE - Classical Models (ARIMA, SARIMA, Prophet)
5. COMPLETE - Traditional ML (KNN and SVR)
6. COMPLETE - **Ensemble ML (XGBoost, Random Forest)** (Current)
7. NEXT - Deep Learning Basics (ANN, LSTM)
8. PENDING - Advanced Deep Learning (BiLSTM, GRU)

#### **Author**

**Prakash Ukhalkar**  
[![GitHub](https://img.shields.io/badge/GitHub-prakash--ukhalkar-blue?style=flat&logo=github)](https://github.com/prakash-ukhalkar)

---

<div align="center">
  <sub>Built with care for the quantitative finance and data science community</sub>
</div>