## **Stock Price Prediction - NIFTY 50**

### **Notebook 01: Data Acquisition and Preprocessing**

[![Python](https://img.shields.io/badge/Python-3.8%2B-blue)](https://www.python.org/) [![Pandas](https://img.shields.io/badge/Pandas-Latest-green)](https://pandas.pydata.org/) [![TensorFlow](https://img.shields.io/badge/TensorFlow-2.10%2B-orange)](https://tensorflow.org/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

---

**Part of the comprehensive learning series:** [Stock Price Prediction - NIFTY 50](https://github.com/prakash-ukhalkar/stock-price-prediction-nifty50)

**Learning Objectives:**
- Master financial data acquisition from NSE using `yfinance` and `nsepy`
- Understand data preprocessing techniques for financial time series
- Learn to calculate and interpret log returns for stationarity
- Implement robust data cleaning and validation procedures
- Export processed datasets for subsequent modeling notebooks

**Dataset Scope:** Fetch, merge, clean NIFTY50 data (2020-2025). Calculate Log Returns.

---

## 1. Import Required Libraries

* **Purpose:** Import essential libraries for financial data processing, visualization, and analysis.

* **Key Libraries:**
  - `pandas`: Data manipulation and analysis
  - `numpy`: Numerical computing
  - `yfinance`: Yahoo Finance API for stock data
  - `plotly`: Interactive visualizations
  - `matplotlib/seaborn`: Statistical plotting

* **Note:** We suppress warnings to keep output clean during data processing.

In [2]:
# Core Data Handling
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Data Acquisition
import yfinance as yf
from nsepy import get_history
from datetime import datetime, timedelta
import requests

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# System and File Operations
import os
import sys
from pathlib import Path

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8')

print("Libraries imported successfully!")
print(f"Current date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

Libraries imported successfully!
Current date: 2025-10-16 21:50:52


## 2. Define NIFTY 50 Stock Symbols

* **Purpose:** Define the complete list of NIFTY 50 constituent stocks for data acquisition.

* **Context:** The NIFTY 50 is India's premier stock market index representing the top 50 companies by market capitalization. 
  
  * We use Yahoo Finance symbols (with .NS suffix) for NSE-listed stocks.

* **Implementation Note:** The list includes all major sectors like Banking (HDFCBANK, ICICIBANK), IT (TCS, INFY), Energy (RELIANCE, ONGC), etc.

In [3]:
# NIFTY 50 Stock Symbols (Updated as of 2024)
NIFTY_50_SYMBOLS = [
    'RELIANCE.NS', 'TCS.NS', 'HDFCBANK.NS', 'ICICIBANK.NS', 'HINDUNILVR.NS',
    'INFY.NS', 'ITC.NS', 'SBIN.NS', 'BHARTIARTL.NS', 'ASIANPAINT.NS',
    'MARUTI.NS', 'HCLTECH.NS', 'AXISBANK.NS', 'LT.NS', 'SUNPHARMA.NS',
    'TITAN.NS', 'ULTRACEMCO.NS', 'WIPRO.NS', 'NESTLEIND.NS', 'POWERGRID.NS',
    'NTPC.NS', 'TECHM.NS', 'ONGC.NS', 'M&M.NS', 'TATAMOTORS.NS',
    'KOTAKBANK.NS', 'HDFCLIFE.NS', 'BAJFINANCE.NS', 'SBILIFE.NS', 'DRREDDY.NS',
    'INDUSINDBK.NS', 'ADANIENT.NS', 'GRASIM.NS', 'CIPLA.NS', 'BRITANNIA.NS',
    'COALINDIA.NS', 'TATASTEEL.NS', 'APOLLOHOSP.NS', 'HINDALCO.NS', 'DIVISLAB.NS',
    'HEROMOTOCO.NS', 'ADANIPORTS.NS', 'UPL.NS', 'BAJAJFINSV.NS', 'JSWSTEEL.NS',
    'EICHERMOT.NS', 'TATACONSUM.NS', 'LTIM.NS', 'BAJAJ-AUTO.NS', 'BPCL.NS'
]

print(f"Total NIFTY 50 symbols defined: {len(NIFTY_50_SYMBOLS)}")
print("\nSample symbols:")
for i, symbol in enumerate(NIFTY_50_SYMBOLS[:50]):
    print(f"{i+1:2d}. {symbol}")

Total NIFTY 50 symbols defined: 50

Sample symbols:
 1. RELIANCE.NS
 2. TCS.NS
 3. HDFCBANK.NS
 4. ICICIBANK.NS
 5. HINDUNILVR.NS
 6. INFY.NS
 7. ITC.NS
 8. SBIN.NS
 9. BHARTIARTL.NS
10. ASIANPAINT.NS
11. MARUTI.NS
12. HCLTECH.NS
13. AXISBANK.NS
14. LT.NS
15. SUNPHARMA.NS
16. TITAN.NS
17. ULTRACEMCO.NS
18. WIPRO.NS
19. NESTLEIND.NS
20. POWERGRID.NS
21. NTPC.NS
22. TECHM.NS
23. ONGC.NS
24. M&M.NS
25. TATAMOTORS.NS
26. KOTAKBANK.NS
27. HDFCLIFE.NS
28. BAJFINANCE.NS
29. SBILIFE.NS
30. DRREDDY.NS
31. INDUSINDBK.NS
32. ADANIENT.NS
33. GRASIM.NS
34. CIPLA.NS
35. BRITANNIA.NS
36. COALINDIA.NS
37. TATASTEEL.NS
38. APOLLOHOSP.NS
39. HINDALCO.NS
40. DIVISLAB.NS
41. HEROMOTOCO.NS
42. ADANIPORTS.NS
43. UPL.NS
44. BAJAJFINSV.NS
45. JSWSTEEL.NS
46. EICHERMOT.NS
47. TATACONSUM.NS
48. LTIM.NS
49. BAJAJ-AUTO.NS
50. BPCL.NS


## 3. Data Acquisition Functions

* **Purpose:** Create robust, reusable functions for fetching financial data with comprehensive error handling.

* **Key Features:**
  - **Retry Mechanism:** Automatically retries failed requests up to 3 times
  - **Error Handling:** Graceful handling of network issues and invalid symbols  
  - **Data Validation:** Ensures non-empty datasets before processing
  - **Flexible Parameters:** Configurable date ranges and retry limits

* **Function Design:** The `fetch_stock_data()` function uses yfinance API and includes proper exception handling for production-ready data collection.

In [4]:
def fetch_stock_data(symbol, start_date='2020-01-01', end_date=None, max_retries=3):
    """
    Fetch stock data for a given symbol using yfinance.
    
    Parameters:
    - symbol: Stock symbol (e.g., 'RELIANCE.NS')
    - start_date: Start date for data (default: '2020-01-01')
    - end_date: End date for data (default: today)
    - max_retries: Maximum number of retry attempts
    
    Returns:
    - DataFrame with stock data or None if failed
    """
    if end_date is None:
        end_date = datetime.now().strftime('%Y-%m-%d')
    
    for attempt in range(max_retries):
        try:
            # Fetch data using yfinance
            ticker = yf.Ticker(symbol)
            data = ticker.history(start=start_date, end=end_date)
            
            if not data.empty:
                # Add symbol column
                data['Symbol'] = symbol.replace('.NS', '')
                data.reset_index(inplace=True)
                
                print(f"Successfully fetched {len(data)} records for {symbol}")
                return data
            else:
                print(f"No data found for {symbol}")
                
        except Exception as e:
            print(f"Attempt {attempt + 1} failed for {symbol}: {str(e)}")
            if attempt < max_retries - 1:
                print(f"Retrying in 2 seconds...")
                import time
                time.sleep(2)
    
    print(f"Failed to fetch data for {symbol} after {max_retries} attempts")
    return None

def fetch_nifty_index_data(start_date='2020-01-01', end_date=None):
    """
    Fetch NIFTY 50 index data.
    """
    print("Fetching NIFTY 50 Index data...")
    return fetch_stock_data('^NSEI', start_date, end_date)

print("Data acquisition functions defined successfully!")

Data acquisition functions defined successfully!


## 4. Batch Data Collection

* **Purpose:** Systematically fetch historical data for all NIFTY 50 stocks with comprehensive progress tracking.

* **Implementation Strategy:**
  - **Sequential Processing:** Fetch data one stock at a time to avoid API rate limits
  - **Progress Monitoring:** Track successful vs failed downloads with detailed logging
  - **Error Segregation:** Separate successful and failed symbols for later analysis
  - **Memory Management:** Store data in list format before consolidation

* **Expected Outcome:** Complete dataset spanning 2020-2025 for all available NIFTY 50 constituents.

In [5]:
# Set date range for data collection
START_DATE = '2020-01-01'
END_DATE = datetime.now().strftime('%Y-%m-%d')

print(f"Starting data collection for NIFTY 50 stocks")
print(f"Date range: {START_DATE} to {END_DATE}")
print(f"Total symbols to process: {len(NIFTY_50_SYMBOLS)}")
print("-" * 60)

# Initialize containers
all_stock_data = []
failed_symbols = []
successful_symbols = []

# Fetch data for each symbol
for i, symbol in enumerate(NIFTY_50_SYMBOLS, 1):
    print(f"\n[{i:2d}/{len(NIFTY_50_SYMBOLS)}] Processing {symbol}...")
    
    data = fetch_stock_data(symbol, START_DATE, END_DATE)
    
    if data is not None and not data.empty:
        all_stock_data.append(data)
        successful_symbols.append(symbol)
    else:
        failed_symbols.append(symbol)
    
    # Progress update every 10 stocks
    if i % 10 == 0:
        print(f"\nProgress: {i}/{len(NIFTY_50_SYMBOLS)} completed ({i/len(NIFTY_50_SYMBOLS)*100:.1f}%)")

# Summary
print("\n" + "="*60)
print("DATA COLLECTION SUMMARY")
print("="*60)
print(f"Successful: {len(successful_symbols)} stocks")
print(f"Failed: {len(failed_symbols)} stocks")
print(f"Success Rate: {len(successful_symbols)/len(NIFTY_50_SYMBOLS)*100:.1f}%")

if failed_symbols:
    print(f"\nFailed symbols: {', '.join(failed_symbols)}")

Starting data collection for NIFTY 50 stocks
Date range: 2020-01-01 to 2025-10-16
Total symbols to process: 50
------------------------------------------------------------

[ 1/50] Processing RELIANCE.NS...
Successfully fetched 1435 records for RELIANCE.NS

[ 2/50] Processing TCS.NS...
Successfully fetched 1435 records for TCS.NS

[ 3/50] Processing HDFCBANK.NS...
Successfully fetched 1435 records for HDFCBANK.NS

[ 4/50] Processing ICICIBANK.NS...
Successfully fetched 1435 records for ICICIBANK.NS

[ 5/50] Processing HINDUNILVR.NS...
Successfully fetched 1435 records for HINDUNILVR.NS

[ 6/50] Processing INFY.NS...
Successfully fetched 1435 records for INFY.NS

[ 7/50] Processing ITC.NS...
Successfully fetched 1435 records for ITC.NS

[ 8/50] Processing SBIN.NS...
Successfully fetched 1435 records for SBIN.NS

[ 9/50] Processing BHARTIARTL.NS...
Successfully fetched 1435 records for BHARTIARTL.NS

[10/50] Processing ASIANPAINT.NS...
Successfully fetched 1435 records for ASIANPAINT.NS


## 5. Combine and Structure Data

* **Purpose:** Consolidate individual stock datasets into a unified master DataFrame for analysis.

* **Data Operations:**
  - **Concatenation:** Merge all individual DataFrames using pandas concat
  - **Sorting:** Order data by Symbol and Date for chronological analysis  
  - **Indexing:** Reset index for clean sequential numbering
  - **Validation:** Verify data integrity and completeness

* **Output Structure:** Master dataset with columns: Date, Open, High, Low, Close, Volume, Symbol

In [6]:
# Combine all stock data
if all_stock_data:
    # Concatenate all DataFrames
    combined_data = pd.concat(all_stock_data, ignore_index=True)
    
    # Sort by Symbol and Date
    combined_data = combined_data.sort_values(['Symbol', 'Date']).reset_index(drop=True)
    
    print("MASTER DATASET CREATED")
    print("-" * 40)
    print(f"Total Records: {len(combined_data):,}")
    print(f"Unique Stocks: {combined_data['Symbol'].nunique()}")
    print(f"Date Range: {combined_data['Date'].min()} to {combined_data['Date'].max()}")
    print(f"Memory Usage: {combined_data.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    
    # Display sample data
    print("\nSample Data:")
    display(combined_data.head(10))
    
    # Data info
    print("\nDataset Info:")
    combined_data.info()
    
else:
    print("No data collected. Please check your internet connection and symbol list.")

MASTER DATASET CREATED
----------------------------------------
Total Records: 71,750
Unique Stocks: 50
Date Range: 2020-01-01 00:00:00+05:30 to 2025-10-15 00:00:00+05:30
Memory Usage: 8.78 MB

Sample Data:


Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits,Symbol
0,2020-01-01 00:00:00+05:30,206.963633,208.399503,204.636524,205.824844,1553127,0.0,0.0,ADANIENT
1,2020-01-02 00:00:00+05:30,205.973393,211.122725,205.478265,209.142212,2991937,0.0,0.0,ADANIENT
2,2020-01-03 00:00:00+05:30,208.201475,210.28102,203.794836,206.270477,2512421,0.0,0.0,ADANIENT
3,2020-01-06 00:00:00+05:30,205.725814,205.725814,195.823248,197.605713,4353179,0.0,0.0,ADANIENT
4,2020-01-07 00:00:00+05:30,198.595991,203.695807,198.595991,202.06189,2966120,0.0,0.0,ADANIENT
5,2020-01-08 00:00:00+05:30,199.041584,201.566742,192.654432,199.536713,5762654,0.0,0.0,ADANIENT
6,2020-01-09 00:00:00+05:30,199.932817,206.864614,199.932817,205.973389,3496063,0.0,0.0,ADANIENT
7,2020-01-10 00:00:00+05:30,205.97338,210.924663,203.893835,207.062668,2769132,0.0,0.0,ADANIENT
8,2020-01-13 00:00:00+05:30,208.59756,214.143003,207.557803,211.518829,3527267,0.0,0.0,ADANIENT
9,2020-01-14 00:00:00+05:30,211.370296,213.400325,209.785894,212.261536,1775182,0.0,0.0,ADANIENT



Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71750 entries, 0 to 71749
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype                       
---  ------        --------------  -----                       
 0   Date          71750 non-null  datetime64[ns, Asia/Kolkata]
 1   Open          71750 non-null  float64                     
 2   High          71750 non-null  float64                     
 3   Low           71750 non-null  float64                     
 4   Close         71750 non-null  float64                     
 5   Volume        71750 non-null  int64                       
 6   Dividends     71750 non-null  float64                     
 7   Stock Splits  71750 non-null  float64                     
 8   Symbol        71750 non-null  object                      
dtypes: datetime64[ns, Asia/Kolkata](1), float64(6), int64(1), object(1)
memory usage: 4.9+ MB


## 6. Data Quality Assessment

* **Purpose:** Conduct comprehensive data quality analysis to identify issues before proceeding with modeling.

* **Quality Checks:**
  - **Missing Values:** Identify columns with null/NaN values and calculate percentages
  - **Duplicates:** Detect duplicate records based on Symbol-Date combinations  
  - **Data Completeness:** Analyze record counts per stock for consistency
  - **Anomaly Detection:** Check for negative prices, zero volumes, and other data inconsistencies

* **Statistical Validation:** Ensure data ranges are reasonable and consistent with market expectations.

In [7]:
# Data Quality Assessment
print("DATA QUALITY ASSESSMENT")
print("=" * 50)

# Missing values analysis
missing_data = combined_data.isnull().sum()
print("\nMissing Values:")
for col, count in missing_data.items():
    if count > 0:
        percentage = (count / len(combined_data)) * 100
        print(f"  {col}: {count:,} ({percentage:.2f}%)")

# Check for duplicate records
duplicates = combined_data.duplicated(['Symbol', 'Date']).sum()
print(f"\nDuplicate Records: {duplicates}")

# Data completeness by symbol
data_completeness = combined_data.groupby('Symbol').size().describe()
print("\nData Completeness by Symbol:")
print(data_completeness)

# Check for negative or zero prices (data anomalies)
price_anomalies = {
    'Negative Open': (combined_data['Open'] <= 0).sum(),
    'Negative High': (combined_data['High'] <= 0).sum(),
    'Negative Low': (combined_data['Low'] <= 0).sum(),
    'Negative Close': (combined_data['Close'] <= 0).sum(),
    'Zero Volume': (combined_data['Volume'] == 0).sum()
}

print("\nPrice Anomalies:")
for anomaly, count in price_anomalies.items():
    if count > 0:
        print(f"  {anomaly}: {count}")
    else:
        print(f"  {anomaly}: None found")

# Date range consistency
date_ranges = combined_data.groupby('Symbol')['Date'].agg(['min', 'max', 'count'])
print("\nDate Range Summary:")
print(f"  Earliest Date: {date_ranges['min'].min()}")
print(f"  Latest Date: {date_ranges['max'].max()}")
print(f"  Min Records/Stock: {date_ranges['count'].min()}")
print(f"  Max Records/Stock: {date_ranges['count'].max()}")

DATA QUALITY ASSESSMENT

Missing Values:

Duplicate Records: 0

Data Completeness by Symbol:
count      50.0
mean     1435.0
std         0.0
min      1435.0
25%      1435.0
50%      1435.0
75%      1435.0
max      1435.0
dtype: float64

Price Anomalies:
  Negative Open: None found
  Negative High: None found
  Negative Low: None found
  Negative Close: None found
  Zero Volume: 61

Date Range Summary:
  Earliest Date: 2020-01-01 00:00:00+05:30
  Latest Date: 2025-10-15 00:00:00+05:30
  Min Records/Stock: 1435
  Max Records/Stock: 1435


## 7. Calculate Log Returns

* **Purpose:** Transform raw price data into stationary log returns suitable for time series modeling.

* **Mathematical Foundation:** 
  - **Log Returns Formula:** 
  $$
  LogReturn_t = \ln(\frac{Close_t}{Close_{t-1}})
  $$
  - **Stationarity:** Log returns are generally stationary, unlike raw prices
  - **Benefits:** Better statistical properties, symmetric treatment of gains/losses

* **Implementation Details:**
  - Group-wise calculation per stock to maintain chronological order
  - Handle missing values appropriately (first observation per stock)
  - Calculate both log returns and simple returns for comparison

In [8]:
def calculate_log_returns(df):
    """
    Calculate log returns for each stock in the dataset.
    
    Log Return = ln(Price_t / Price_t-1)
    """
    df = df.copy()
    df = df.sort_values(['Symbol', 'Date'])
    
    # Calculate log returns for each stock separately
    df['Log_Return'] = df.groupby('Symbol')['Close'].transform(
        lambda x: np.log(x / x.shift(1))
    )
    
    # Calculate simple returns as well for comparison
    df['Simple_Return'] = df.groupby('Symbol')['Close'].pct_change()
    
    # Calculate additional return metrics
    df['Price_Change'] = df.groupby('Symbol')['Close'].diff()
    df['Price_Change_Pct'] = df['Price_Change'] / df.groupby('Symbol')['Close'].shift(1) * 100
    
    return df

# Apply log return calculation
print("Calculating Log Returns...")
combined_data_with_returns = calculate_log_returns(combined_data)

# Remove first day for each stock (NaN values due to lag calculation)
combined_data_with_returns = combined_data_with_returns.dropna(subset=['Log_Return'])

print("Log Returns calculated successfully!")
print(f"Records with returns: {len(combined_data_with_returns):,}")

# Display sample with returns
print("\nSample Data with Returns:")
sample_cols = ['Date', 'Symbol', 'Close', 'Log_Return', 'Simple_Return', 'Price_Change_Pct']
display(combined_data_with_returns[sample_cols].head(10))

Calculating Log Returns...
Log Returns calculated successfully!
Records with returns: 71,700

Sample Data with Returns:


Unnamed: 0,Date,Symbol,Close,Log_Return,Simple_Return,Price_Change_Pct
1,2020-01-02 00:00:00+05:30,ADANIENT,209.142212,0.015989,0.016117,1.611743
2,2020-01-03 00:00:00+05:30,ADANIENT,206.270477,-0.013826,-0.013731,-1.373101
3,2020-01-06 00:00:00+05:30,ADANIENT,197.605713,-0.042915,-0.042007,-4.200681
4,2020-01-07 00:00:00+05:30,ADANIENT,202.06189,0.0223,0.022551,2.255085
5,2020-01-08 00:00:00+05:30,ADANIENT,199.536713,-0.012576,-0.012497,-1.249705
6,2020-01-09 00:00:00+05:30,ADANIENT,205.973389,0.031749,0.032258,3.22581
7,2020-01-10 00:00:00+05:30,ADANIENT,207.062668,0.005275,0.005288,0.528845
8,2020-01-13 00:00:00+05:30,ADANIENT,211.518829,0.021293,0.021521,2.152083
9,2020-01-14 00:00:00+05:30,ADANIENT,212.261536,0.003505,0.003511,0.35113
10,2020-01-15 00:00:00+05:30,ADANIENT,214.489594,0.010442,0.010497,1.049676


## 8. Return Statistics and Analysis

* **Purpose:** Perform comprehensive statistical analysis of calculated log returns across all stocks.

* **Statistical Metrics:**
  - **Descriptive Stats:** Mean, standard deviation, min/max, quartiles
  - **Distribution Shape:** Skewness and kurtosis for return distribution analysis
  - **Risk-Return Metrics:** Annualized returns, volatility, and Sharpe ratios
  - **Comparative Analysis:** Rank stocks by risk-adjusted performance

* **Business Value:** Identify top-performing stocks and understand risk characteristics before modeling.

In [9]:
# Calculate return statistics by symbol
return_stats = combined_data_with_returns.groupby('Symbol')['Log_Return'].agg([
    'count', 'mean', 'std', 'min', 'max', 'skew', 
    lambda x: x.quantile(0.25),   # Q1
    'median',                     # Q2
    lambda x: x.quantile(0.75)    # Q3
]).round(6)

# Rename lambda columns
return_stats.columns = ['Count', 'Mean', 'Std', 'Min', 'Max', 'Skewness', 'Q1', 'Median', 'Q3']

# Add additional metrics
return_stats['Annualized_Return'] = return_stats['Mean'] * 252  # 252 trading days
return_stats['Annualized_Volatility'] = return_stats['Std'] * np.sqrt(252)
return_stats['Sharpe_Ratio'] = return_stats['Annualized_Return'] / return_stats['Annualized_Volatility']

print("LOG RETURN STATISTICS BY STOCK")
print("=" * 60)

# Display top 10 stocks by Sharpe ratio
print("\nTop 10 Stocks by Sharpe Ratio:")
top_sharpe = return_stats.nlargest(10, 'Sharpe_Ratio')[['Mean', 'Std', 'Annualized_Return', 'Sharpe_Ratio']]
display(top_sharpe)

# Overall market statistics
overall_stats = combined_data_with_returns['Log_Return'].agg([
    'count', 'mean', 'std', 'min', 'max', 'skew', 'kurtosis'
])

print("\nOverall Market Statistics:")
print(f"  Total Observations: {overall_stats['count']:,}")
print(f"  Mean Log Return: {overall_stats['mean']:.6f} ({overall_stats['mean']*252:.4f} annualized)")
print(f"  Volatility (Std): {overall_stats['std']:.6f} ({overall_stats['std']*np.sqrt(252):.4f} annualized)")
print(f"  Minimum Return: {overall_stats['min']:.6f} ({overall_stats['min']*100:.2f}%)")
print(f"  Maximum Return: {overall_stats['max']:.6f} ({overall_stats['max']*100:.2f}%)")
print(f"  Skewness: {overall_stats['skew']:.4f}")
print(f"  Kurtosis: {overall_stats['kurtosis']:.4f}")

LOG RETURN STATISTICS BY STOCK

Top 10 Stocks by Sharpe Ratio:


Unnamed: 0_level_0,Mean,Std,Annualized_Return,Sharpe_Ratio
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M&M,0.001343,0.020941,0.338436,1.018073
SUNPHARMA,0.000975,0.015735,0.2457,0.983644
BHARTIARTL,0.001055,0.017251,0.26586,0.970819
APOLLOHOSP,0.001198,0.020375,0.301896,0.933382
ADANIENT,0.00175,0.033096,0.441,0.839388
CIPLA,0.000861,0.016792,0.216972,0.813956
POWERGRID,0.00091,0.01777,0.22932,0.812932
GRASIM,0.000954,0.018644,0.240408,0.812287
TATACONSUM,0.000894,0.017801,0.225288,0.797248
DIVISLAB,0.000922,0.018414,0.232344,0.794846



Overall Market Statistics:
  Total Observations: 71,700.0
  Mean Log Return: 0.000682 (0.1719 annualized)
  Volatility (Std): 0.019921 (0.3162 annualized)
  Minimum Return: -0.513351 (-51.34%)
  Maximum Return: 0.369307 (36.93%)
  Skewness: -0.8059
  Kurtosis: 23.2042


## 9. Data Visualization

* **Purpose:** Create comprehensive visual analysis dashboard to understand data characteristics and patterns.

* **Visualization Components:**
  - **Distribution Analysis:** Histogram of log returns across all stocks
  - **Price Evolution:** Time series plot showing stock price movement  
  - **Volatility Ranking:** Bar chart of highest volatility stocks
  - **Performance Comparison:** Cumulative returns for top-performing stocks

* **Interactive Features:** Using Plotly for interactive charts with hover information and zoom capabilities.

In [10]:
# Create comprehensive visualizations
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=[
        'Log Returns Distribution (All Stocks)',
        'Sample Stock Price Evolution', 
        'Returns Volatility by Stock',
        'Cumulative Returns (Top 5 by Sharpe)'
    ],
    specs=[[{"secondary_y": False}, {"secondary_y": False}],
           [{"secondary_y": False}, {"secondary_y": False}]]
)

# 1. Log Returns Distribution
fig.add_trace(
    go.Histogram(
        x=combined_data_with_returns['Log_Return'],
        nbinsx=100,
        name='Log Returns',
        opacity=0.7
    ),
    row=1, col=1
)

# 2. Sample Stock Price Evolution (Reliance)
reliance_data = combined_data_with_returns[combined_data_with_returns['Symbol'] == 'RELIANCE'].copy()
if not reliance_data.empty:
    fig.add_trace(
        go.Scatter(
            x=reliance_data['Date'],
            y=reliance_data['Close'],
            mode='lines',
            name='RELIANCE Close Price',
            line=dict(color='blue')
        ),
        row=1, col=2
    )

# 3. Returns Volatility by Stock
top_10_vol = return_stats.nlargest(10, 'Std')
fig.add_trace(
    go.Bar(
        x=top_10_vol.index,
        y=top_10_vol['Std'],
        name='Daily Volatility',
        marker_color='red',
        opacity=0.7
    ),
    row=2, col=1
)

# 4. Cumulative Returns for Top 5 Sharpe Ratio stocks
top_5_symbols = return_stats.nlargest(5, 'Sharpe_Ratio').index.tolist()
colors = ['blue', 'red', 'green', 'purple', 'orange']

for i, symbol in enumerate(top_5_symbols):
    stock_data = combined_data_with_returns[combined_data_with_returns['Symbol'] == symbol].copy()
    if not stock_data.empty:
        stock_data = stock_data.sort_values('Date')
        stock_data['Cumulative_Return'] = (1 + stock_data['Log_Return']).cumprod() - 1
        
        fig.add_trace(
            go.Scatter(
                x=stock_data['Date'],
                y=stock_data['Cumulative_Return'] * 100,
                mode='lines',
                name=f'{symbol}',
                line=dict(color=colors[i])
            ),
            row=2, col=2
        )

# Update layout
fig.update_layout(
    height=800,
    title_text="NIFTY 50 Data Analysis Dashboard",
    title_x=0.5,
    showlegend=True
)

# Update axes labels
fig.update_xaxes(title_text="Log Return", row=1, col=1)
fig.update_yaxes(title_text="Frequency", row=1, col=1)

fig.update_xaxes(title_text="Date", row=1, col=2)
fig.update_yaxes(title_text="Price (Rs)", row=1, col=2)

fig.update_xaxes(title_text="Stock Symbol", row=2, col=1)
fig.update_yaxes(title_text="Daily Volatility", row=2, col=1)

fig.update_xaxes(title_text="Date", row=2, col=2)
fig.update_yaxes(title_text="Cumulative Return (%)", row=2, col=2)

fig.show()

print("Comprehensive data visualization completed!")

Comprehensive data visualization completed!


![Nifty 50 Data Analysis Dashboard](../images/NIFTY50_Data_Analysis_Dashboard.png)

## 10. Save Processed Data

* **Purpose:** Export cleaned and processed datasets for use in subsequent analysis notebooks.

* **Data Products:**
  - **Main Dataset:** Complete processed data with log returns and technical metrics
  - **Statistics Summary:** Pre-calculated return statistics and risk metrics  
  - **Processing Log:** Metadata about data collection and processing steps

* **File Formats:** CSV format for cross-platform compatibility and easy loading in future notebooks.

In [11]:
# Create data directory if it doesn't exist
data_dir = Path('../data/processed')
data_dir.mkdir(parents=True, exist_ok=True)

# Save the processed data
processed_file = data_dir / 'nifty50_processed_data.csv'
combined_data_with_returns.to_csv(processed_file, index=False)

# Save return statistics
stats_file = data_dir / 'nifty50_return_statistics.csv'
return_stats.to_csv(stats_file)

# Create a summary file
summary_info = {
    'total_records': len(combined_data_with_returns),
    'unique_symbols': combined_data_with_returns['Symbol'].nunique(),
    'date_range_start': str(combined_data_with_returns['Date'].min()),
    'date_range_end': str(combined_data_with_returns['Date'].max()),
    'successful_symbols': len(successful_symbols),
    'failed_symbols': len(failed_symbols),
    'processing_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
}

summary_df = pd.DataFrame([summary_info])
summary_file = data_dir / 'data_processing_summary.csv'
summary_df.to_csv(summary_file, index=False)

print("DATA SAVED SUCCESSFULLY")
print("=" * 40)
print(f"Main Dataset: {processed_file}")
print(f"Statistics: {stats_file}")
print(f"Summary: {summary_file}")
print(f"\nRecords Saved: {len(combined_data_with_returns):,}")
print(f"File Size: {processed_file.stat().st_size / 1024**2:.2f} MB")

DATA SAVED SUCCESSFULLY
Main Dataset: ..\data\processed\nifty50_processed_data.csv
Statistics: ..\data\processed\nifty50_return_statistics.csv
Summary: ..\data\processed\data_processing_summary.csv

Records Saved: 71,700
File Size: 13.77 MB


## Summary

### What We Accomplished:

  1. **Data Acquisition**: Successfully fetched NIFTY 50 stock data from 2020-2025

  2. **Data Quality**: Assessed and validated data quality with comprehensive checks

  3. **Log Returns**: Calculated log returns for stationarity and modeling readiness

  4. **Statistical Analysis**: Computed return statistics, volatility, and Sharpe ratios

  5. **Visualization**: Created comprehensive dashboards for data exploration

  6. **Data Export**: Saved processed data for subsequent analysis notebooks

### Key Insights:

  - **Dataset Size**: Processed thousands of records across 50 stocks
  
  - **Time Period**: 5+ years of comprehensive market data
  
  - **Return Characteristics**: Analyzed risk-return profiles of individual stocks
  
  - **Data Quality**: Implemented robust error handling and quality assurance

### Next Steps:

**Notebook 02**: We'll perform Exploratory Data Analysis (EDA) and Time Series Foundations including:
- Stationarity testing (ADF tests)
- ACF/PACF analysis
- Time-based data splitting
- Trend and seasonality analysis

---

### *Next Notebook Preview*

Now that we have clean, processed data with calculated log returns, the next step is to dive deep into **Exploratory Data Analysis and Time Series Foundations**. We will explore statistical properties, test for stationarity, and prepare the data for advanced modeling techniques.

---

#### About This Project

This notebook is part of the **Stock Price Prediction - NIFTY 50** repository - a comprehensive machine learning pipeline for predicting stock prices using classical to advanced techniques including ARIMA, LSTM, XGBoost, and evolutionary optimization.

**Repository:** [`stock-price-prediction-nifty50`](https://github.com/prakash-ukhalkar/stock-price-prediction-nifty50)

**Project Features:**
- **12 Sequential Notebooks**: From data acquisition to deployment
- **Multiple Model Types**: Classical (ARIMA), Traditional ML (SVR, XGBoost), Deep Learning (LSTM, BiLSTM)
- **Advanced Optimization**: Genetic Algorithm and Simulated Annealing
- **Production Ready**: Streamlit dashboard and trading strategy backtesting


#### **Author**

**Prakash Ukhalkar**  
[![GitHub](https://img.shields.io/badge/GitHub-prakash--ukhalkar-blue?style=flat&logo=github)](https://github.com/prakash-ukhalkar)

---

<div align="center">
  <sub>Built with ❤️ for the quantitative finance and data science community</sub>
</div>