## **Stock Price Prediction - NIFTY 50**

### **Notebook 05: Traditional Machine Learning (KNN and SVR)**

[![Python](https://img.shields.io/badge/Python-3.8%2B-blue)](https://www.python.org/) [![Scikit-Learn](https://img.shields.io/badge/Scikit--Learn-Latest-orange)](https://scikit-learn.org/) [![Pandas](https://img.shields.io/badge/Pandas-Latest-green)](https://pandas.pydata.org/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

---

**Part of the comprehensive learning series:** [Stock Price Prediction - NIFTY 50](https://github.com/prakash-ukhalkar/stock-price-prediction-nifty50)

**Learning Objectives:**
- Implement traditional ML algorithms (KNN and SVR) for financial prediction
- Apply Time Series Cross-Validation for robust hyperparameter tuning
- Utilize feature engineering from technical indicators and lag variables
- Compare supervised learning approach with classical time series methods
- Establish ML baselines for subsequent deep learning models

**Dataset Scope:** Apply supervised learning to feature-engineered training data. Predict log returns using KNN and SVR.

---

* This notebook implements two foundational Machine Learning algorithms —**k-Nearest Neighbors (k-NN)** and **Support Vector Regression (SVR)**—to predict the next day's **Log Return**. 

* This shifts our approach from statistical time series modeling to **supervised learning**, utilizing the rich set of technical indicators and lagged variables created in Notebook 03. 

* We use **Time Series Cross-Validation (TSCV)** for robust model tuning.

## 1. Setup and Data Loading

We load the clean, feature-engineered training data (`nifty50_train_features.csv`) and the original raw test data (`nifty50_test.csv`). Note that the raw test data will require feature re-engineering (Step 6).

In [7]:
# Cell 1: Import Libraries
import pandas as pd
import numpy as np
import os
import ta # Needed here for test data feature re-engineering
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Define Paths
TRAIN_FEATURES_PATH = '../data/processed/nifty50_train_features.csv'
TEST_DATA_PATH = '../data/processed/nifty50_test.csv'
MODEL_RESULTS_PATH = '../models/ml_model_results.csv'
CLASSICAL_RESULTS_PATH = '../models/classical_model_results.csv' 

# Load data
df_train = pd.read_csv(TRAIN_FEATURES_PATH, index_col='Date', parse_dates=True)
# Load RAW test data (will be engineered later)
df_test = pd.read_csv(TEST_DATA_PATH, index_col='Date', parse_dates=True)

print(f"Feature-engineered Training Data loaded. Shape: {df_train.shape}")
print(f"Raw Test Data loaded. Shape: {df_test.shape}")

Feature-engineered Training Data loaded. Shape: (57311, 33)
Raw Test Data loaded. Shape: (14340, 12)


## 2. Data Preparation and Target Definition (Training Set)

The target variable ($y$) is the **Log Return**. We apply a crucial check to drop non-numeric columns from $X_{train}$ to prevent the `ValueError` during scaling.

In [8]:
# Cell 2: Define X and y for Training (WITH TYPE CHECK AND CLEANUP)

TARGET_COL = 'Log_Return'
y_train = df_train[TARGET_COL]

# X includes all other features (OHLCV, Lags, TAs)
X_train = df_train.drop(columns=[TARGET_COL])

print(f"X_train initial shape: {X_train.shape}")

# --- CRITICAL FIX: Drop Non-Numeric Columns (Fixes ValueError: could not convert string to float) ---
non_numeric_cols = X_train.select_dtypes(include=['object', 'category']).columns

if not non_numeric_cols.empty:
    print(f"Dropping non-numeric columns from X_train: {non_numeric_cols.tolist()}")
    X_train = X_train.drop(columns=non_numeric_cols)
else:
    print("No non-numeric columns detected in X_train.")

print(f"\nX_train final shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")

X_train initial shape: (57311, 32)
Dropping non-numeric columns from X_train: ['Symbol']

X_train final shape: (57311, 31)
y_train shape: (57311,)


## 3. Time Series Cross-Validation Setup

We utilize **TimeSeriesSplit** to prevent **data leakage**, ensuring all model tuning uses training data that strictly precedes the validation data.

In [9]:
# Cell 3: Initialize TimeSeriesSplit
N_SPLITS = 5 
tscv = TimeSeriesSplit(n_splits=N_SPLITS)

print(f"Initialized TimeSeriesSplit with {N_SPLITS} folds for tuning.")

Initialized TimeSeriesSplit with 5 folds for tuning.


## 4. Model I: k-Nearest Neighbors Regressor (k-NN)

We use a `Pipeline` to ensure the `MinMaxScaler` is applied before k-NN, as this model is sensitive to feature scale.

In [10]:
# Cell 4: KNN Hyperparameter Tuning with GridSearchCV and TSCV

knn_pipeline = Pipeline([('scaler', MinMaxScaler()), ('knn', KNeighborsRegressor())])

knn_param_grid = {
    'knn__n_neighbors': [5, 10, 20], 
    'knn__weights': ['uniform', 'distance'], 
    'knn__p': [1, 2] 
}

knn_grid = GridSearchCV(
    estimator=knn_pipeline,
    param_grid=knn_param_grid,
    cv=tscv, 
    scoring='neg_mean_squared_error',
    n_jobs=-1, 
    verbose=1
)

print("Starting KNN GridSearchCV...")
knn_grid.fit(X_train, y_train)

knn_best_model = knn_grid.best_estimator_
print("\nKNN Best Parameters:", knn_grid.best_params_)
print("KNN Best Score (Neg MSE):", knn_grid.best_score_)


Starting KNN GridSearchCV...
Fitting 5 folds for each of 12 candidates, totalling 60 fits

KNN Best Parameters: {'knn__n_neighbors': 10, 'knn__p': 1, 'knn__weights': 'distance'}
KNN Best Score (Neg MSE): -0.00022666404417937333

KNN Best Parameters: {'knn__n_neighbors': 10, 'knn__p': 1, 'knn__weights': 'distance'}
KNN Best Score (Neg MSE): -0.00022666404417937333


## 5. Model II: Support Vector Regression (SVR)

SVR, particularly with the RBF kernel, is effective for non-linear financial data. We tune the regularization (C), error margin (epsilon), and kernel coefficient (gamma).

In [11]:
# Cell 5: SVR Hyperparameter Tuning with GridSearchCV and TSCV

svr_pipeline = Pipeline([('scaler', MinMaxScaler()), ('svr', SVR())])

svr_param_grid = {
    'svr__kernel': ['rbf'], 
    'svr__C': [0.1, 1, 10], 
    'svr__epsilon': [0.01, 0.1], 
    'svr__gamma': ['scale', 0.01]
}

svr_grid = GridSearchCV(
    estimator=svr_pipeline,
    param_grid=svr_param_grid,
    cv=tscv,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

print("Starting SVR GridSearchCV...")
svr_grid.fit(X_train, y_train)

svr_best_model = svr_grid.best_estimator_
print("\nSVR Best Parameters:", svr_grid.best_params_)
print("SVR Best Score (Neg MSE):", svr_grid.best_score_)


Starting SVR GridSearchCV...
Fitting 5 folds for each of 12 candidates, totalling 60 fits

SVR Best Parameters: {'svr__C': 1, 'svr__epsilon': 0.01, 'svr__gamma': 0.01, 'svr__kernel': 'rbf'}
SVR Best Score (Neg MSE): -1.1234023823023779e-05

SVR Best Parameters: {'svr__C': 1, 'svr__epsilon': 0.01, 'svr__gamma': 0.01, 'svr__kernel': 'rbf'}
SVR Best Score (Neg MSE): -1.1234023823023779e-05


## 6. Model Evaluation on Final Test Set

We evaluate the optimized models on the unseen test data. **CRITICAL FIX:** Because the raw test data (`df_test`) lacks the features engineered in Notebook 03, we must re-engineer and align the test features here to match the columns used during model training.

In [12]:
# Cell 6: Prepare Test Data, Re-Engineer Features, and Generate Predictions

# 1. Feature Re-Engineering (MUST match Notebook 03 logic)
df_test_features = df_test.copy()

LAG_PERIODS = [1, 2, 3, 5, 10]
for lag in LAG_PERIODS:
    df_test_features[f'Close_Lag_{lag}'] = df_test_features['Close'].shift(lag)
    df_test_features[f'Return_Lag_{lag}'] = df_test_features['Log_Return'].shift(lag)

WINDOW_TREND = [10, 20, 50] 
for window in WINDOW_TREND:
    df_test_features[f'SMA_{window}'] = ta.trend.sma_indicator(df_test_features['Close'], window=window, fillna=False)
    df_test_features[f'EMA_{window}'] = ta.trend.ema_indicator(df_test_features['Close'], window=window, fillna=False)
macd = ta.trend.MACD(df_test_features['Close'], window_fast=12, window_slow=26, window_sign=9, fillna=False)
df_test_features['MACD_Line'] = macd.macd()
df_test_features['MACD_Signal'] = macd.macd_signal()
RSI_WINDOW = 14 
df_test_features[f'RSI_{RSI_WINDOW}'] = ta.momentum.rsi(df_test_features['Close'], window=RSI_WINDOW, fillna=False)
df_test_features['MFI'] = ta.volume.money_flow_index(df_test_features['High'], df_test_features['Low'], df_test_features['Close'], df_test_features['Volume'], window=14, fillna=False)
df_test_features['ATR'] = ta.volatility.average_true_range(df_test_features['High'], df_test_features['Low'], df_test_features['Close'], window=14, fillna=False)

df_test_features = df_test_features.dropna()

# 2. Final Data Cleanup and Alignment

# Extract X_test and y_test from the newly featured data
y_test = df_test_features[TARGET_COL]
X_test = df_test_features.drop(columns=[TARGET_COL])

# --- CRITICAL FIX 3: Ensure X_test is cleaned of non-numeric columns and aligned (Fixes: Unseen/Missing Features) ---
non_numeric_cols_test = X_test.select_dtypes(include=['object', 'category']).columns
if not non_numeric_cols_test.empty:
    X_test = X_test.drop(columns=non_numeric_cols_test)

# Align the order and presence of features to match the exact set used to train the models (X_train.columns)
X_test = X_test[X_train.columns]

print(f"Test data re-engineered and cleaned. Final X_test shape: {X_test.shape}")

# 3. Prediction
knn_preds = knn_best_model.predict(X_test)
svr_preds = svr_best_model.predict(X_test)

print("Predictions generated for KNN and SVR on the test set.")

Test data re-engineered and cleaned. Final X_test shape: (14254, 31)
Predictions generated for KNN and SVR on the test set.
Predictions generated for KNN and SVR on the test set.


In [13]:
# Cell 7: Final Evaluation and Consolidation

def evaluate_and_consolidate(y_true, y_pred, model_name, path):
    
    # Synchronization is inherent here as X_test/y_test were cleaned in step 6.
    
    mse = mean_squared_error(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    
    new_results = pd.DataFrame([{'Model': model_name, 'MSE': mse, 'MAE': mae, 'RMSE': rmse}])
        
    # Load previous results (from classical models) and append new results
    if os.path.exists(path):
        results_df = pd.read_csv(path)
    elif os.path.exists(CLASSICAL_RESULTS_PATH):
        # Fallback to load classical results if the ML file doesn't exist yet
        results_df = pd.read_csv(CLASSICAL_RESULTS_PATH)
    else:
        results_df = pd.DataFrame(columns=['Model', 'MSE', 'MAE', 'RMSE'])
        
    results_df = pd.concat([results_df, new_results], ignore_index=True)
    results_df = results_df.drop_duplicates(subset=['Model'], keep='last')
    results_df.to_csv(path, index=False)
    return results_df


# Evaluate and Save Results
evaluate_and_consolidate(y_test, knn_preds, 'KNN', MODEL_RESULTS_PATH)
final_results = evaluate_and_consolidate(y_test, svr_preds, 'SVR', MODEL_RESULTS_PATH)

print("--- Traditional ML Model Evaluation (Log Returns) on Test Set ---")
print(final_results.set_index('Model'))

print("Results consolidated in: ", MODEL_RESULTS_PATH)
print("Proceed to Notebook 06 for Ensemble ML models (XGBoost/Random Forest).")

--- Traditional ML Model Evaluation (Log Returns) on Test Set ---
              MSE       MAE      RMSE
Model                                
ARIMA    0.000261  0.010981  0.016141
SARIMA   0.000260  0.010954  0.016126
Prophet  0.000273  0.011414  0.016520
KNN      0.000185  0.008851  0.013611
SVR      0.000026  0.004653  0.005069
Results consolidated in:  ../models/ml_model_results.csv
Proceed to Notebook 06 for Ensemble ML models (XGBoost/Random Forest).


## Summary

### What We Accomplished:

  1. **Supervised Learning Transition**: Shifted from statistical to ML-based approach using engineered features

  2. **KNN Implementation**: Applied distance-based learning with optimal hyperparameter tuning

  3. **SVR Modeling**: Implemented non-linear regression with RBF kernel optimization

  4. **Time Series Cross-Validation**: Used TSCV to prevent data leakage during hyperparameter search

  5. **Feature Engineering Pipeline**: Re-engineered test data features to match training set

  6. **Performance Evaluation**: Calculated robust metrics on unseen test data

### Key Machine Learning Insights:

  - **Feature Engineering Value**: Technical indicators and lag variables enhanced predictive power
  
  - **KNN Performance**: Distance-based learning captured local patterns in feature space
  
  - **SVR Effectiveness**: Non-linear kernel regression handled complex financial relationships
  
  - **Cross-Validation**: TSCV ensured temporal integrity during model selection
  
  - **Scaling Importance**: Normalization critical for distance-based algorithms

### Technical Implementation Notes:

  - **Pipeline Architecture**: Integrated scaling and modeling for clean workflow
  
  - **Hyperparameter Tuning**: GridSearchCV with temporal splits for robust optimization
  
  - **Data Alignment**: Proper synchronization between predictions and test targets
  
  - **Feature Consistency**: Ensured test data features match training features exactly

### Next Steps:

**Notebook 06**: We'll advance to ensemble Machine Learning models including:
- XGBoost for gradient boosting performance
- Random Forest for ensemble learning
- Advanced hyperparameter optimization
- Feature importance analysis and interpretation

---

### *Next Notebook Preview*

Building on traditional ML foundations, we'll explore ensemble methods that combine multiple learners for enhanced prediction accuracy, representing the next evolution in our modeling progression toward deep learning.

---

#### About This Project

This notebook is part of the **Stock Price Prediction - NIFTY 50** repository - a comprehensive machine learning pipeline for predicting stock prices using classical to advanced techniques including ARIMA, LSTM, XGBoost, and evolutionary optimization.

**Repository:** [`stock-price-prediction-nifty50`](https://github.com/prakash-ukhalkar/stock-price-prediction-nifty50)

**Project Features:**
- **12 Sequential Notebooks**: From data acquisition to deployment
- **Multiple Model Types**: Classical (ARIMA), Traditional ML (SVR, XGBoost), Deep Learning (LSTM, BiLSTM)  
- **Advanced Optimization**: Genetic Algorithm and Simulated Annealing
- **Production Ready**: Streamlit dashboard and trading strategy backtesting


#### **Author**

**Prakash Ukhalkar**  
[![GitHub](https://img.shields.io/badge/GitHub-prakash--ukhalkar-blue?style=flat&logo=github)](https://github.com/prakash-ukhalkar)

---

<div align="center">
  <sub>Built with care for the quantitative finance and data science community</sub>
</div>