## **Stock Price Prediction - NIFTY 50**

### **Notebook 03: Feature Engineering and Technical Analysis**

[![Python](https://img.shields.io/badge/Python-3.8%2B-blue)](https://www.python.org/) [![Pandas](https://img.shields.io/badge/Pandas-Latest-green)](https://pandas.pydata.org/) [![TA-Lib](https://img.shields.io/badge/TA--Lib-Latest-purple)](https://github.com/mrjbq7/ta-lib) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

---

**Part of the comprehensive learning series:** [Stock Price Prediction - NIFTY 50](https://github.com/prakash-ukhalkar/stock-price-prediction-nifty50)

**Learning Objectives:**
- Transform raw OHLCV data into sophisticated technical indicators
- Create lagged features for time series prediction
- Generate momentum indicators (RSI, MACD) and trend indicators (MA, EMA)
- Implement volume and volatility-based features
- Prepare feature-rich dataset for machine learning models

**Dataset Scope:** Engineer features from training data. Create technical indicators and lag variables for modeling.

---


## 1. Setup and Data Loading

* We load the clean training data (`nifty50_train.csv`) generated in Notebook 02, as feature engineering should be performed only on the training set to prevent **data leakage** into the test set.

In [1]:
# Cell 1: Import Libraries and Install 'ta' (if necessary)
import pandas as pd
import numpy as np
import ta # For Technical Analysis
import os
import sys

# Set path to be able to import custom utility functions from src/
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..', 'src')))
# Assuming a basic __init__.py in src/features for now. 
# For simplicity, TAs are generated directly using the 'ta' library here.

TRAIN_DATA_PATH = '../data/processed/nifty50_train.csv'

# Load the training dataset
df_train = pd.read_csv(TRAIN_DATA_PATH, index_col='Date', parse_dates=True)

print(f"Training data loaded successfully. Shape: {df_train.shape}")
print("Initial Columns: ", df_train.columns.tolist())

Training data loaded successfully. Shape: (57360, 12)
Initial Columns:  ['Open', 'High', 'Low', 'Close', 'Volume', 'Dividends', 'Stock Splits', 'Symbol', 'Log_Return', 'Simple_Return', 'Price_Change', 'Price_Change_Pct']


## 2. Feature Engineering Step I: Lag Features

**Fundamental Concept:** 

  * For time series forecasting, the most powerful predictors of the future price are often the price and returns from the immediate past. 

  * **Lag features** transform the time series problem into a supervised learning problem.

  * We create lags for the Closing Price and Log Returns for various steps back to capture short-term memory.

In [2]:
# Cell 2: Create Lagged Price Features
LAG_PERIODS = [1, 2, 3, 5, 10]

for lag in LAG_PERIODS:
    # Lagged Closing Price: Price from 'lag' days ago
    df_train[f'Close_Lag_{lag}'] = df_train['Close'].shift(lag)
    
    # Lagged Log Returns: Return from 'lag' days ago
    df_train[f'Return_Lag_{lag}'] = df_train['Log_Return'].shift(lag)
    
print(f"Created lagged features for periods: {LAG_PERIODS}")
print(f"Current columns: {df_train.shape[1]}")

Created lagged features for periods: [1, 2, 3, 5, 10]
Current columns: 22


## 3. Feature Engineering Step II: Technical Indicators (TA)

**Research Objective:** 

  * The research project explicitly targets indicators such as **Moving Averages (MA)**, **Relative Strength Index (RSI)**, and **Volume**. 
  
  * TAs act as secondary features that summarize price and volume action, providing momentum, volatility, and trend signals for the machine learning models.

  * We use the dedicated `ta` Python library to generate industry-standard indicators efficiently.

### 3.1 Trend Indicators (Moving Averages)

* Moving Averages smooth the price data to identify the trend direction. 

* We include Simple Moving Average (SMA) and Exponential Moving Average (EMA).

In [3]:
# Cell 3: Trend Indicators
WINDOW_TREND = [10, 20, 50] # Short, Medium, Long-term windows

for window in WINDOW_TREND:
    # Simple Moving Average (SMA)
    df_train[f'SMA_{window}'] = ta.trend.sma_indicator(df_train['Close'], window=window, fillna=False)
    # Exponential Moving Average (EMA)
    df_train[f'EMA_{window}'] = ta.trend.ema_indicator(df_train['Close'], window=window, fillna=False)
    
# Moving Average Convergence Divergence (MACD) 
macd = ta.trend.MACD(df_train['Close'], window_fast=12, window_slow=26, window_sign=9, fillna=False)
df_train['MACD_Line'] = macd.macd()
df_train['MACD_Signal'] = macd.macd_signal()

print("Trend Indicators (MA, EMA, MACD) created.")

Trend Indicators (MA, EMA, MACD) created.


### 3.2 Momentum Indicators (RSI)

* Momentum indicators measure the speed and change of price movements. 

* The **Relative Strength Index (RSI)**  is the most popular, indicating overbought or oversold conditions.

In [4]:
# Cell 4: Momentum Indicators
RSI_WINDOW = 14 # Standard 14-day RSI
df_train[f'RSI_{RSI_WINDOW}'] = ta.momentum.rsi(df_train['Close'], window=RSI_WINDOW, fillna=False)

print("Momentum Indicators (RSI) created.")

Momentum Indicators (RSI) created.


### 3.3 Volume and Volatility Indicators

* Volume provides context on the strength of price movements. 

* Volatility indicators (like ATR) measure price fluctuation.

In [5]:
# Cell 5: Volume and Volatility Indicators

# Volume-based: Money Flow Index (MFI) 
df_train['MFI'] = ta.volume.money_flow_index(df_train['High'], df_train['Low'], df_train['Close'], df_train['Volume'], window=14, fillna=False)

# Volatility: Average True Range (ATR)
df_train['ATR'] = ta.volatility.average_true_range(df_train['High'], df_train['Low'], df_train['Close'], window=14, fillna=False)

print("Volume (MFI) and Volatility (ATR) Indicators created.")

Volume (MFI) and Volatility (ATR) Indicators created.


## 4. Final Feature Set Cleanup

* After generating the Technical Indicators, the first few rows of the DataFrame will contain `NaN` values because they require historical data (e.g., 50 days of data for the 50-day SMA). 

* We must drop these initial rows, as they cannot be used for training. 

* We also drop the initial raw price/volume columns, as we will primarily be using the derived features and the target variable (`Log_Return`).

In [6]:
# Cell 6: Drop NaN values created by windowed features
df_train_features = df_train.dropna()

print(f"Rows dropped due to TA windows: {df_train.shape[0] - df_train_features.shape[0]}")
print(f"Final Training Data Shape: {df_train_features.shape}")
print(f"Final Feature Set (Columns): {df_train_features.columns.tolist()}")
print("\nFinal Data Head:\n", df_train_features.head())

Rows dropped due to TA windows: 49
Final Training Data Shape: (57311, 33)
Final Feature Set (Columns): ['Open', 'High', 'Low', 'Close', 'Volume', 'Dividends', 'Stock Splits', 'Symbol', 'Log_Return', 'Simple_Return', 'Price_Change', 'Price_Change_Pct', 'Close_Lag_1', 'Return_Lag_1', 'Close_Lag_2', 'Return_Lag_2', 'Close_Lag_3', 'Return_Lag_3', 'Close_Lag_5', 'Return_Lag_5', 'Close_Lag_10', 'Return_Lag_10', 'SMA_10', 'EMA_10', 'SMA_20', 'EMA_20', 'SMA_50', 'EMA_50', 'MACD_Line', 'MACD_Signal', 'RSI_14', 'MFI', 'ATR']

Final Data Head:
                                   Open         High          Low        Close  \
Date                                                                            
2020-01-02 00:00:00+05:30   935.219909   950.422151   935.214918   947.894287   
2020-01-03 00:00:00+05:30   743.390894   743.390894   725.577296   730.549683   
2020-01-03 00:00:00+05:30  4109.593667  4129.962198  4059.981839  4092.329102   
2020-01-03 00:00:00+05:30   117.492517   118.999434   1

## 5. Saving the Feature-Rich Dataset

* This feature-rich dataset is saved. 

* It will be the input for all subsequent Machine Learning (ML) and Deep Learning (DL) model notebooks (Notebooks 04 through 09).

In [7]:
# Cell 7: Save the feature-engineered training data
FEATURE_TRAIN_DATA_PATH = '../data/processed/nifty50_train_features.csv'
df_train_features.to_csv(FEATURE_TRAIN_DATA_PATH)

print(f"\nSuccessfully saved feature-engineered training data to: {FEATURE_TRAIN_DATA_PATH}")
print("Ready to begin Classical Model Forecasting in Notebook 04.")


Successfully saved feature-engineered training data to: ../data/processed/nifty50_train_features.csv
Ready to begin Classical Model Forecasting in Notebook 04.


## Summary

### What We Accomplished:

  1. **Lag Features**: Created lagged price and return variables (1, 2, 3, 5, 10 periods)

  2. **Trend Indicators**: Generated SMA, EMA, and MACD across multiple timeframes

  3. **Momentum Indicators**: Implemented RSI for overbought/oversold signals

  4. **Volume & Volatility**: Added MFI and ATR for market strength analysis

  5. **Data Cleaning**: Removed NaN values from windowed calculations

  6. **Feature Export**: Saved enriched dataset for subsequent modeling notebooks

### Key Technical Features Created:

  - **Price Lags**: Close_Lag_1 through Close_Lag_10 for temporal dependencies
  
  - **Return Lags**: Return_Lag_1 through Return_Lag_10 for momentum analysis
  
  - **Moving Averages**: SMA/EMA (10, 20, 50) for trend identification
  
  - **MACD System**: MACD_Line and MACD_Signal for trend convergence
  
  - **Momentum**: RSI_14 for overbought/oversold conditions
  
  - **Volume/Volatility**: MFI and ATR for market context

### Feature Engineering Results:

  - **Original Columns**: Basic OHLCV data
  
  - **Enhanced Features**: 20+ technical indicators and lag variables
  
  - **Clean Dataset**: NaN values removed, ready for modeling
  
  - **Temporal Structure**: Maintains chronological order for time series modeling

### Next Steps:

**Notebook 04**: We'll begin classical time series modeling including:
- ARIMA model implementation and tuning
- Seasonal decomposition analysis
- Model evaluation and validation
- Baseline performance establishment

---

### *Next Notebook Preview*

With our comprehensive feature set engineered, we're ready to implement classical time series models. Starting with ARIMA, we'll establish baseline performance metrics and validate our feature engineering approach through traditional econometric methods.

---

#### About This Project

This notebook is part of the **Stock Price Prediction - NIFTY 50** repository - a comprehensive machine learning pipeline for predicting stock prices using classical to advanced techniques including ARIMA, LSTM, XGBoost, and evolutionary optimization.

**Repository:** [`stock-price-prediction-nifty50`](https://github.com/prakash-ukhalkar/stock-price-prediction-nifty50)

**Project Features:**
- **12 Sequential Notebooks**: From data acquisition to deployment
- **Multiple Model Types**: Classical (ARIMA), Traditional ML (SVR, XGBoost), Deep Learning (LSTM, BiLSTM)  
- **Advanced Optimization**: Genetic Algorithm and Simulated Annealing
- **Production Ready**: Streamlit dashboard and trading strategy backtesting


#### **Author**

**Prakash Ukhalkar**  
[![GitHub](https://img.shields.io/badge/GitHub-prakash--ukhalkar-blue?style=flat&logo=github)](https://github.com/prakash-ukhalkar)

---

<div align="center">
  <sub>Built with care for the quantitative finance and data science community</sub>
</div>