# Exploratory Data Analysis (EDA)

This notebook performs exploratory data analysis on stock price data to understand its statistical properties, identify patterns, and prepare for LSTM model training.
## Objective:

- Validate statistical properties of price and returns
- Inspect volatility & trend regimes
- Detect potential data leakage
- Support LSTM + walk-forward trading design

In [None]:
# Import required libraries for data analysis and visualization

import sys
import os


# Add project root to Python path to enable imports from src module
project_root = os.path.dirname(os.path.dirname(os.path.abspath("__file__")))
if project_root not in sys.path:
    sys.path.insert(0, project_root)


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style for consistent and clean plots
plt.style.use("seaborn-v0_8")
sns.set_context("notebook")

# Configure pandas to display floating point numbers with 6 decimal places
pd.set_option("display.float_format", "{:.6f}".format)

# Import data loading utility from the project
from src.data.loader import load_data

## Data Loading

Load historical stock data for AAPL (Apple Inc.) from January 1, 2022 onwards.

In [None]:
# Define stock ticker and date range for analysis
ticker = "AAPL"
start_date = "2022-01-01"
end_date = None  # None means use the latest available date

# Load the stock data using the custom data loader
df = load_data(ticker, start_date, end_date)

# Display first few rows to verify data was loaded correctly
df.head()

## Data Inspection

Examine the structure, data types, and basic statistics of the loaded dataset.

In [None]:
# Display data types, column info, and memory usage
# This helps identify any missing values or incorrect data types
df.info()

In [None]:
# Display statistical summary of all numeric columns
# Includes count, mean, std, min, max, and quartiles
df.describe()

## Data Quality Validation

Perform assertions to ensure data meets quality requirements for analysis.

In [None]:
# Validate data quality with assertions:

# 1. Check that index is DatetimeIndex for time series analysis
assert isinstance(df.index, pd.DatetimeIndex)

# 2. Ensure dates are in chronological order (no future dates mixed in)
assert df.index.is_monotonic_increasing

# 3. Verify there are no missing values in the dataset
assert df.isnull().sum().sum() == 0

## Closing Price Visualization

Plot the historical closing prices to visualize the overall price movement.

In [None]:
# Create a figure with specified size for better visibility
plt.figure(figsize=(12,4))

# Plot the Close price column over time
df["Close"].plot(title=f"{ticker} Closing Price")

# Display the plot
plt.show()

## Log Returns Calculation

Calculate logarithmic returns, which are preferred over simple returns for statistical analysis due to their time-additivity and better statistical properties.

In [None]:
# Calculate log returns: log(P_t / P_{t-1})
# Log returns are preferred because:
# - They are time-additive (can sum across periods)
# - They have better statistical properties (more normal-like)
# - They are symmetric (gains and losses are treated equally)
df["log_ret"] = np.log(df["Close"]).diff()

# Remove the first row (NaN from diff operation)
df.dropna(inplace=True)

## Log Returns Statistics

Examine the statistical distribution of log returns, including extreme percentiles.

In [None]:
# Display detailed statistics of log returns including extreme percentiles
# This helps identify the range of typical vs extreme returns
df["log_ret"].describe(percentiles=[0.01, 0.05, 0.95, 0.99])

## Log Returns Distribution

Visualize the distribution of log returns to check for normality and identify outliers.

In [None]:
# Create histogram to visualize the distribution of log returns
# 100 bins provide fine granularity to see the shape
plt.figure(figsize=(10,4))
df["log_ret"].hist(bins=100)
plt.title("Log Return Distribution")
plt.show()

## Autocorrelation Analysis

Plot the autocorrelation function (ACF) to check for temporal dependencies in returns.

In [None]:
# Import the ACF plotting function from statsmodels
from statsmodels.graphics.tsaplots import plot_acf

# Plot autocorrelation function for lag 1 to 30
# This helps identify if past returns predict future returns
plot_acf(df["log_ret"], lags=60)
plt.title("Autocorrelation of Log Returns")
plt.show()

## Volatility Calculation

Calculate rolling volatility measures to identify volatility regimes and clustering.

In [None]:
# Calculate 20-day rolling standard deviation of returns (short-term volatility)
df["vol_20"] = df["log_ret"].rolling(20).std()

# Calculate 60-day median of the 20-day rolling volatility (long-term volatility baseline)
# Using median is more robust to outliers than mean
df["vol_60_med"] = df["vol_20"].rolling(60).median()

## Volatility Regimes Visualization

Plot the short-term and long-term volatility to identify different market regimes.

In [None]:
# Plot both volatility measures over time to visualize regimes
plt.figure(figsize=(12,4))
df["vol_20"].plot(label="20D Vol")  # Short-term volatility
df["vol_60_med"].plot(label="60D Median Vol")  # Long-term median volatility
plt.legend()
plt.title("Volatility Regimes")
plt.show()

## Volatility Ratio Analysis

Calculate the ratio of short-term to long-term volatility to identify regime changes.

In [None]:
# Calculate volatility ratio: short-term / long-term
# Ratio > 1 indicates increased volatility (high volatility regime)
# Ratio < 1 indicates decreased volatility (low volatility regime)
vol_ratio = df["vol_20"] / df["vol_60_med"]
vol_ratio.plot(title="Volatility Ratio (20D Vol / 60D Median Vol)")

## Volatility Ratio Distribution

Visualize the distribution of volatility ratios to set thresholds for regime classification.

In [None]:
# Plot histogram of volatility ratio to see distribution
plt.figure(figsize=(10,4))
vol_ratio.hist(bins=50)

# Add vertical line at 1.2 to mark potential high-volatility threshold
plt.axvline(1.2, color="red", linestyle="--")
plt.title("Volatility Ratio")
plt.show()

## Trend Strength Calculation

Calculate a measure of trend strength based on price deviation from moving average.

In [None]:
# Calculate 50-day simple moving average of closing prices
close = df["Close"].iloc[:, 0]
df["ma50"] = close.rolling(50).mean()

# Calculate trend strength as absolute percentage deviation from MA
# Higher values indicate stronger trending behavior
df["trend_strength"] = (close - df["ma50"]).abs() / df["ma50"]

## Trend Strength Time Series

Visualize how trend strength evolves over time.

In [None]:
# Plot trend strength over time to see when strong trends occur
plt.figure(figsize=(12,4))
df["trend_strength"].plot(title="Trend Strength")
plt.show()

## Trend Strength Distribution

Visualize the distribution of trend strength to determine threshold for trend identification.

In [None]:
# Plot histogram of trend strength to identify typical vs extreme values
plt.figure(figsize=(10,4))
df["trend_strength"].hist(bins=50)

# Add vertical line at 0.0075 as potential threshold for strong trends
plt.axvline(0.0075, color="red", linestyle="--")
plt.title("Trend Strength Distribution")
plt.show()

## Regime Filtering Definition

Define a regime filter based on volatility and trend strength to identify favorable trading conditions.

In [None]:
# Define regime_ok based on two conditions:
# 1. Short-term volatility is less than 1.2x long-term median (not in high volatility regime)
# 2. Trend strength is greater than 0.75% (strong enough trend to trade)
df["regime_ok"] = (
    (df["vol_20"] < 1.2 * df["vol_60_med"]) &
    (df["trend_strength"] > 0.0075)
)

## Regime Distribution

Examine the proportion of time spent in each regime.

In [None]:
# Show the distribution of regime_ok values as percentages
# This tells us what fraction of time is suitable for trading
df["regime_ok"].value_counts(normalize=True)

## Feature Correlation Analysis

Examine correlations between different features to understand their relationships.

In [None]:
# Select features for correlation analysis
features = ["log_ret", "vol_20", "trend_strength"]

# Display correlation matrix between selected features
# This helps identify multicollinearity and feature relationships
df[features].corr()

## Feature Correlation with Future Returns

Calculate how well each feature correlates with next-day returns to assess predictive potential.

In [None]:
# Create target variable: next day's return (what we want to predict)
df["future_ret"] = df["log_ret"].shift(-1)

# Calculate correlation between features and future returns
# This helps assess the predictive power of each feature
df[features].corrwith(df["future_ret"])

## Key Takeaways

Summary of findings from the exploratory data analysis:

- Returns are noisy with near-zero autocorrelation
- Volatility clustering is present
- Trend regimes exist but are intermittent
- Regime filtering is justified
- Forecasting alone is weak â†’ filtering & sizing are critical