# Momentum, Volatility, and Volume Factors in U.S. Stock Returns

**ISYE 4031 Final Project**  
*Regression & Forecasting, Georgia Tech*

## Project Overview

This notebook analyzes the relationship between **momentum**, **volatility**, and **volume** factors in U.S. stock returns using S&P 500 data.

### Research Questions:
1. Do momentum indicators significantly predict future stock returns?
2. How does volatility clustering affect return predictability? 
3. Is trading volume a reliable indicator of price direction?

---

In [1]:
import yfinance as yf
import pandas as pd
from pandas_datareader import data as pdr
import datetime as dt
import numpy as np
from bs4 import BeautifulSoup
import requests, re
import ta

---
### S&P 500 Stock List
We start by scraping the current S&P 500 stock list from a reliable financial data source.

**Data Source**: [Stock Analysis - S&P 500](https://stockanalysis.com/list/sp-500-stocks/)

**Key Information Collected**:
- Stock symbols (tickers)
- Market capitalization



In [2]:
url = 'https://stockanalysis.com/list/sp-500-stocks/'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')

# Find the table and extract headers
table = soup.find('table', class_='symbol-table svelte-1ro3niy')
headers = [th.get_text(strip=True) for th in table.find('tr').find_all('th')]

# Extract all row data
stocks_data = []
for row in table.find_all('tr')[1:]:  # Skip header row
    row_data = [cell.get_text(strip=True) for cell in row.find_all('td')]
    stocks_data.append(row_data)

# Create DataFrame and set No. column as index
sp500_df = pd.DataFrame(stocks_data, columns=headers)
sp500_df = sp500_df.set_index('No.')

print("\nFirst 10 rows:")
print(sp500_df.head(10)[['Symbol', 'Market Cap']])


First 10 rows:
    Symbol Market Cap
No.                  
1     NVDA      4.69T
2     AAPL      4.07T
3     MSFT      3.78T
4    GOOGL      3.52T
5     GOOG      3.50T
6     AMZN      2.66T
7     AVGO      1.66T
8     META      1.58T
9     TSLA      1.46T
10   BRK.B      1.07T


---

### Stock Selection & Date Range Setup

**Stock Selection Process:**
- Extract first 50 companies from S&P 500 list for analysis
- Focus on established companies for reliable historical data

**Analysis Time Period:**
- **Start Date**: January 1, 2021
- **End Date**: December 27, 2024
- **Duration**: 4 years of market data
- **Purpose**: Capture post-pandemic market trends and recovery patterns

> **Note**: Using a subset of the top 50 stocks for computational efficiency and financial significance.



In [6]:
stocks = sp500_df.head(15)['Symbol'].tolist()

# Filter out all tickers that contain dots (they often cause yfinance issues)
stocks = [ticker for ticker in stocks if '.' not in ticker]

stocks.sort()
startDate = dt.date(2021, 1, 4)
endDate = dt.date(2024, 12, 27)

print(f"Selected stocks: {stocks}")
print(f"Total stocks: {len(stocks)}")

Selected stocks: ['AAPL', 'AMZN', 'AVGO', 'GOOG', 'GOOGL', 'JPM', 'LLY', 'META', 'MSFT', 'NVDA', 'ORCL', 'TSLA', 'V', 'WMT']
Total stocks: 14


---

### Weekly Returns and Factor Calculation

**Objective**: Calculate weekly log returns and prepare data structure for technical indicator analysis.

**Key Metrics**:
- **Weekly Log Returns**: `ln(Close/Open) √ó 100` for price movement analysis
- **ROC** Rate of Change indicator for momentum analysis.
- **RVOL** Relative Volume indicator for volume analysis.
- **BBW** Bollinger Band width indicator for volatility analysis.

**Data Structure**:
- Multi-level columns for organized factor storage
- Separate columns for each technical indicator per stock
- Week numbering for time series tracking

In [8]:
# Download the data
try:
    download = yf.download(
        tickers = stocks,
        start = startDate,
        end = endDate,
        actions = False, threads = True, auto_adjust = True, rounding = True,
        group_by = 'tickers', 
        interval = '1wk'
    )
    
    # Extract Open and Close to find Log Returns
    open_data = download.xs('Open', level=1, axis=1)
    close_data = download.xs('Close', level=1, axis=1)
    log_returns = (np.log(close_data / open_data) * 100)
    
    # Create MultiIndex DataFrame
    columns = []
    for ticker in stocks:
        columns.extend([(ticker, 'Log_Return_%'), (ticker, 'ROC'), (ticker, 'RVOL'), (ticker, 'BBW')])
    multi_columns = pd.MultiIndex.from_tuples(columns, names=['Ticker', 'Metric'])
    weekly_data = pd.DataFrame(index=open_data.index, columns=multi_columns)
    
    # Fill in the data
    for ticker in stocks:
        weekly_data[(ticker, 'Log_Return_%')] = log_returns[ticker].round(2)
        weekly_data[(ticker, 'ROC')] = "ROC"
        weekly_data[(ticker, 'RVOL')] = "RVOL"
        weekly_data[(ticker, 'BBW')] = "BBW"
    
    # Add week numbers as a separate column
    weekly_data.insert(0, 'Week', range(1, len(weekly_data) + 1))
    
    print(f"Total weeks: {len(weekly_data)}")
    print(f"Date range: {weekly_data.index[0].date()} to {weekly_data.index[-1].date()}")
    print(f"DataFrame shape: {weekly_data.shape}")
    
    display(weekly_data)
        
except Exception as e:
    print(f"Error: {e}")

[*********************100%***********************]  14 of 14 completed

Total weeks: 208
Date range: 2021-01-04 to 2024-12-23
DataFrame shape: (208, 57)





Ticker,Week,AAPL,AAPL,AAPL,AAPL,AMZN,AMZN,AMZN,AMZN,AVGO,...,TSLA,TSLA,V,V,V,V,WMT,WMT,WMT,WMT
Metric,Unnamed: 1_level_1,Log_Return_%,ROC,RVOL,BBW,Log_Return_%,ROC,RVOL,BBW,Log_Return_%,...,RVOL,BBW,Log_Return_%,ROC,RVOL,BBW,Log_Return_%,ROC,RVOL,BBW
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2021-01-04,1,-1.11,ROC,RVOL,BBW,-2.71,ROC,RVOL,BBW,1.43,...,RVOL,BBW,-2.20,ROC,RVOL,BBW,1.61,ROC,RVOL,BBW
2021-01-11,2,-1.60,ROC,RVOL,BBW,-1.40,ROC,RVOL,BBW,1.20,...,RVOL,BBW,-5.95,ROC,RVOL,BBW,-0.93,ROC,RVOL,BBW
2021-01-18,3,8.47,ROC,RVOL,BBW,5.79,ROC,RVOL,BBW,3.67,...,RVOL,BBW,-0.51,ROC,RVOL,BBW,1.17,ROC,RVOL,BBW
2021-01-25,4,-8.08,ROC,RVOL,BBW,-3.75,ROC,RVOL,BBW,-4.15,...,RVOL,BBW,-3.60,ROC,RVOL,BBW,-3.41,ROC,RVOL,BBW
2021-02-01,5,2.23,ROC,RVOL,BBW,3.33,ROC,RVOL,BBW,2.19,...,RVOL,BBW,6.75,ROC,RVOL,BBW,2.43,ROC,RVOL,BBW
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-11-25,204,2.50,ROC,RVOL,BBW,4.23,ROC,RVOL,BBW,-2.02,...,RVOL,BBW,1.03,ROC,RVOL,BBW,2.19,ROC,RVOL,BBW
2024-12-02,205,2.32,ROC,RVOL,BBW,7.82,ROC,RVOL,BBW,9.69,...,RVOL,BBW,-1.88,ROC,RVOL,BBW,3.06,ROC,RVOL,BBW
2024-12-09,206,2.57,ROC,RVOL,BBW,0.11,ROC,RVOL,BBW,22.81,...,RVOL,BBW,0.93,ROC,RVOL,BBW,-1.49,ROC,RVOL,BBW
2024-12-16,207,2.59,ROC,RVOL,BBW,-2.33,ROC,RVOL,BBW,-4.86,...,RVOL,BBW,0.94,ROC,RVOL,BBW,-2.34,ROC,RVOL,BBW


Getting Technical Indicators for Stocks

In [9]:
# Download data for technical indicator calculations using ta library
try:
    # Need more data to calculate technical indicators properly
    extended_start = dt.date(2020, 1, 1)  # Start earlier for indicator calculations
    
    tech_download = yf.download(
        tickers = stocks,
        start = extended_start,
        end = dt.date(2021, 1, 15),  # Just past first week
        actions = False, threads = True, auto_adjust = True, rounding = True,
        group_by = 'tickers', 
        interval = '1d'  # Daily data for better indicator calculations
    )
    
    # Extract price data
    close_prices = tech_download.xs('Close', level=1, axis=1)
    high_prices = tech_download.xs('High', level=1, axis=1)
    low_prices = tech_download.xs('Low', level=1, axis=1)
    volume_data = tech_download.xs('Volume', level=1, axis=1)
    
    # Find the first week of our analysis period (Jan 4-8, 2021)
    first_week_start = pd.Timestamp('2021-01-04')
    first_week_end = pd.Timestamp('2021-01-08')
    
    # Filter to first week
    first_week_mask = (close_prices.index >= first_week_start) & (close_prices.index <= first_week_end)
    first_week_close = close_prices[first_week_mask]
    
    # Create DataFrame with stocks as rows, indicators as columns
    first_week_indicators = pd.DataFrame(index=stocks, columns=['Log_Return_%', 'ROC', 'RVOL', 'BBW'])
    
    for ticker in stocks:
        # Get price data up to first week end for calculations
        ticker_data = pd.DataFrame({
            'close': close_prices.loc[:first_week_end, ticker],
            'high': high_prices.loc[:first_week_end, ticker],
            'low': low_prices.loc[:first_week_end, ticker],
            'volume': volume_data.loc[:first_week_end, ticker]
        }).dropna()
        
        if len(ticker_data) < 20:  # Need enough data for indicators
            first_week_indicators.loc[ticker] = [np.nan, np.nan, np.nan, np.nan]
            continue
            
        # Log Return % for first week (manual calculation)
        week_open = first_week_close[ticker].iloc[0] if len(first_week_close) > 0 else np.nan
        week_close = first_week_close[ticker].iloc[-1] if len(first_week_close) > 0 else np.nan
        log_return = np.log(week_close / week_open) * 100 if not pd.isna(week_open) and not pd.isna(week_close) else np.nan
        
        # Calculate indicators using ta library (with correct class names)
        # Rate of Change (ROC) - 10-day
        roc = ta.momentum.ROCIndicator(close=ticker_data['close'], window=10).roc().iloc[-1] * 100
        
        # Relative Volume (RVOL) - Current week volume vs 20-day SMA volume
        # Use simple rolling mean since ta library doesn't have VolumeSMAIndicator
        volume_sma_20 = ticker_data['volume'].rolling(window=20).mean()
        avg_volume = volume_sma_20.iloc[-1] if not volume_sma_20.empty else np.nan
        current_week_volume = ticker_data['volume'].iloc[-5:].mean()  # Last 5 days average
        rvol = (current_week_volume / avg_volume) if not pd.isna(avg_volume) and avg_volume != 0 else np.nan
        
        # Bollinger Band Width (BBW) using ta library
        bb_indicator = ta.volatility.BollingerBands(close=ticker_data['close'], window=20, window_dev=2)
        upper = bb_indicator.bollinger_hband().iloc[-1]
        lower = bb_indicator.bollinger_lband().iloc[-1]
        middle = bb_indicator.bollinger_mavg().iloc[-1]
        bbw = ((upper - lower) / middle) * 100 if not pd.isna(upper) and not pd.isna(lower) and middle != 0 else np.nan
            
        # Add row to DataFrame
        first_week_indicators.loc[ticker] = [log_return, roc, rvol, bbw]
    
    print(f"üìä First Week Technical Analysis (Using ta library)")
    print(f"Analysis Period: {first_week_start.date()} to {first_week_end.date()}")
    print(f"Stocks Analyzed: {len(stocks)}")
    print(f"\nüîç Technical Indicators for First Week:")
    print("‚Ä¢ Log_Return_%: Weekly log return")
    print("‚Ä¢ ROC: 10-day Rate of Change (ta.momentum)")
    print("‚Ä¢ RVOL: Relative Volume vs 20-day average (pandas rolling)")
    print("‚Ä¢ BBW: Bollinger Band Width (ta.volatility)")
    
    display(first_week_indicators.round(4))
    
except Exception as e:
    print(f"Error: {e}")
    import traceback
    traceback.print_exc()

[*********************100%***********************]  14 of 14 completed

üìä First Week Technical Analysis (Using ta library)
Analysis Period: 2021-01-04 to 2021-01-08
Stocks Analyzed: 14

üîç Technical Indicators for First Week:
‚Ä¢ Log_Return_%: Weekly log return
‚Ä¢ ROC: 10-day Rate of Change (ta.momentum)
‚Ä¢ RVOL: Relative Volume vs 20-day average (pandas rolling)
‚Ä¢ BBW: Bollinger Band Width (ta.volatility)





Unnamed: 0,Log_Return_%,ROC,RVOL,BBW
AAPL,2.019636,83.150298,1.073922,12.090838
AMZN,-0.125604,-8.162753,1.022023,6.977552
AVGO,4.688757,474.576271,1.001516,10.537177
GOOG,4.477595,432.4073,1.24789,5.04911
GOOGL,4.069133,403.169424,1.202613,4.774844
JPM,8.468695,953.117888,1.272108,16.987683
LLY,0.545271,55.954728,0.910911,8.616199
META,-0.51011,-20.264945,0.993752,5.801513
MSFT,0.884877,-63.067727,1.058476,7.507613
NVDA,1.21582,208.172706,1.624698,6.065621


In [None]:
# Goal Code

## üî¨ Multiple Linear Regression Analysis

### Factor Definition & Model Specification

**Our Three-Factor Model:**
```
Return_{i,t} = Œ± + Œ≤_MOM √ó MOM_{i,t-1} + Œ≤_BBW √ó BBW_{i,t-1} + Œ≤_VOL √ó VOL_{i,t-1} + Œµ_{i,t}
```

### Factor Definitions:

1. **Momentum Factor (MOM_Factor)**: 
   - **36-week Rate of Change (ROC)** 
   - Measures intermediate-term price trends
   - Formula: `ROC = ((Close_t - Close_t-36) / Close_t-36) √ó 100`

2. **Volatility Factor (BBW_Factor)**: 
   - **36-week Bollinger Band Width (BBW)**
   - Measures recent price volatility using 2 standard deviations
   - Formula: `BBW = ((Upper_Band - Lower_Band) / Moving_Average) √ó 100`

3. **Volume Factor (VOL_Factor)**: 
   - **50-week Relative Volume** 
   - Compares recent volume to long-term average
   - Formula: `RVOL = Current_Volume / 50_week_Average_Volume`

### Methodology:
- **Cross-sectional regressions** run weekly across our stock universe
- **One-week lag** on all factors to avoid look-ahead bias
- **Weekly time series** from 2021-2024 (4 years)
- **Minimum 10 stocks** required for stable regression results

---

In [78]:
# Step 1: Import required libraries for regression analysis
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

print("üî¨ Multiple Linear Regression: Factor Analysis Setup")
print("=" * 55)
print("üìä Analyzing Momentum, Volatility, and Volume Factors")
print("üéØ Model: Return_{i,t} = Œ± + Œ≤_MOM√óMOM_{i,t-1} + Œ≤_BBW√óBBW_{i,t-1} + Œ≤_VOL√óVOL_{i,t-1} + Œµ_{i,t}")
print("=" * 55)

üî¨ Multiple Linear Regression: Factor Analysis Setup
üìä Analyzing Momentum, Volatility, and Volume Factors
üéØ Model: Return_{i,t} = Œ± + Œ≤_MOM√óMOM_{i,t-1} + Œ≤_BBW√óBBW_{i,t-1} + Œ≤_VOL√óVOL_{i,t-1} + Œµ_{i,t}


In [79]:
# Step 2: Download Extended Dataset for Factor Calculations
print("\nüì• Downloading Extended Dataset for Factor Calculations...")

try:
    # Extended time range - start earlier for proper factor calculations
    extended_start_date = dt.date(2020, 6, 1)  # 6 months before analysis start
    
    print(f"üìÖ Extended date range: {extended_start_date} to {endDate}")
    print(f"üíæ Downloading daily data for {len(stocks)} stocks...")
    
    # Download daily data for accurate factor calculations
    extended_data = yf.download(
        tickers=stocks,
        start=extended_start_date,
        end=endDate,
        actions=False, 
        threads=True, 
        auto_adjust=True, 
        rounding=True,
        group_by='tickers', 
        interval='1d'
    )
    
    # Extract OHLCV data for each component
    daily_close = extended_data.xs('Close', level=1, axis=1)
    daily_high = extended_data.xs('High', level=1, axis=1)
    daily_low = extended_data.xs('Low', level=1, axis=1)
    daily_volume = extended_data.xs('Volume', level=1, axis=1)
    daily_open = extended_data.xs('Open', level=1, axis=1)
    
    # Convert daily to weekly data (Friday close)
    weekly_close_ext = daily_close.resample('W-FRI').last()
    weekly_high_ext = daily_high.resample('W-FRI').max()
    weekly_low_ext = daily_low.resample('W-FRI').min() 
    weekly_volume_ext = daily_volume.resample('W-FRI').sum()
    weekly_open_ext = daily_close.resample('W-FRI').first()
    
    # Filter to our main analysis period (2021 onwards)
    analysis_start_date = pd.Timestamp('2021-01-01')
    mask = weekly_close_ext.index >= analysis_start_date
    
    weekly_close_analysis = weekly_close_ext[mask]
    weekly_high_analysis = weekly_high_ext[mask]
    weekly_low_analysis = weekly_low_ext[mask]
    weekly_volume_analysis = weekly_volume_ext[mask]
    weekly_open_analysis = weekly_open_ext[mask]
    
    print(f"‚úÖ Data download successful!")
    print(f"üìä Analysis period: {weekly_close_analysis.index[0].date()} to {weekly_close_analysis.index[-1].date()}")
    print(f"üìà Total weeks in analysis: {len(weekly_close_analysis)}")
    print(f"üè¢ Stocks: {len(stocks)}")
    print(f"üìã Data shape: {weekly_close_analysis.shape}")
    
except Exception as e:
    print(f"‚ùå Error downloading extended data: {e}")
    raise


üì• Downloading Extended Dataset for Factor Calculations...
üìÖ Extended date range: 2020-06-01 to 2024-12-27
üíæ Downloading daily data for 14 stocks...


[*********************100%***********************]  14 of 14 completed

‚úÖ Data download successful!
üìä Analysis period: 2021-01-01 to 2024-12-27
üìà Total weeks in analysis: 209
üè¢ Stocks: 14
üìã Data shape: (209, 14)





In [80]:
# Step 3: Calculate Three Factor Exposures
print("\nüìä Calculating Factor Exposures for Each Stock...")
print("=" * 50)

# Initialize storage for factors and returns
factor_df = pd.DataFrame(index=weekly_close_analysis.index)

# Process each stock to calculate factors
successful_stocks = []

for ticker in stocks:
    try:
        print(f"üîÑ Processing {ticker}...", end="")
        
        # Get weekly price and volume data for this stock
        weekly_prices = pd.DataFrame({
            'close': weekly_close_ext[ticker],
            'high': weekly_high_ext[ticker],
            'low': weekly_low_ext[ticker],
            'volume': weekly_volume_ext[ticker],
            'open': weekly_open_ext[ticker]
        }).dropna()
        
        # Ensure we have enough data for calculations
        if len(weekly_prices) < 60:  # Need 50+ weeks for volume factor
            print(" ‚ö†Ô∏è Insufficient data")
            continue
            
        # FACTOR 1: MOMENTUM (36-week Rate of Change)
        # ROC = ((Close_t - Close_t-36) / Close_t-36) * 100
        momentum_36w = weekly_prices['close'].pct_change(periods=36) * 100
        
        # FACTOR 2: VOLATILITY (36-week Bollinger Band Width) 
        # Calculate 36-week moving average and standard deviation
        sma_36w = weekly_prices['close'].rolling(window=36).mean()
        std_36w = weekly_prices['close'].rolling(window=36).std()
        
        # Bollinger Bands (¬±2 standard deviations)
        upper_bb = sma_36w + (2 * std_36w)
        lower_bb = sma_36w - (2 * std_36w)
        
        # Bollinger Band Width = ((Upper - Lower) / SMA) * 100
        volatility_36w = ((upper_bb - lower_bb) / sma_36w) * 100
        
        # FACTOR 3: VOLUME (50-week Relative Volume)
        # RVOL = Current_Volume / 50_week_Average_Volume
        vol_sma_50w = weekly_prices['volume'].rolling(window=50).mean()
        volume_50w = weekly_prices['volume'] / vol_sma_50w
        
        # DEPENDENT VARIABLE: Weekly Log Returns
        weekly_log_returns = np.log(weekly_prices['close'] / weekly_prices['open']) * 100
        
        # Filter to analysis period and store
        analysis_mask = weekly_prices.index >= analysis_start_date
        
        factor_df[f'{ticker}_MOM'] = momentum_36w[analysis_mask]
        factor_df[f'{ticker}_VOL'] = volatility_36w[analysis_mask]
        factor_df[f'{ticker}_RVOL'] = volume_50w[analysis_mask]
        factor_df[f'{ticker}_RETURN'] = weekly_log_returns[analysis_mask]
        
        successful_stocks.append(ticker)
        print(" ‚úÖ")
        
    except Exception as e:
        print(f" ‚ùå Error: {e}")
        continue

print(f"\n‚úÖ Factor Calculation Complete!")
print(f"üìà Successfully processed: {len(successful_stocks)} stocks")
print(f"üìä Factor data shape: {factor_df.shape}")

# Remove rows with too many NaN values
factor_df_clean = factor_df.dropna(thresh=len(successful_stocks)*2)  # At least 50% valid data
print(f"üìã Clean data shape: {factor_df_clean.shape}")
print(f"üìÖ Analysis period: {factor_df_clean.index[0].date()} to {factor_df_clean.index[-1].date()}")

# Display sample of factor data
print(f"\nüìã Sample Factor Data (First 5 weeks, First 2 stocks):")
sample_cols = []
for ticker in successful_stocks[:2]:
    sample_cols.extend([f'{ticker}_RETURN', f'{ticker}_MOM', f'{ticker}_VOL', f'{ticker}_RVOL'])

display(factor_df_clean[sample_cols].head().round(3))


üìä Calculating Factor Exposures for Each Stock...
üîÑ Processing AAPL... ‚úÖ
üîÑ Processing AMZN... ‚úÖ
üîÑ Processing AVGO... ‚úÖ
üîÑ Processing GOOG... ‚úÖ
üîÑ Processing GOOGL... ‚úÖ
üîÑ Processing JPM... ‚úÖ
üîÑ Processing LLY... ‚úÖ
üîÑ Processing META... ‚úÖ
üîÑ Processing MSFT... ‚úÖ
üîÑ Processing NVDA... ‚úÖ
üîÑ Processing ORCL... ‚úÖ
üîÑ Processing TSLA... ‚úÖ
üîÑ Processing V... ‚úÖ
üîÑ Processing WMT... ‚úÖ

‚úÖ Factor Calculation Complete!
üìà Successfully processed: 14 stocks
üìä Factor data shape: (209, 56)
üìã Clean data shape: (204, 56)
üìÖ Analysis period: 2021-02-05 to 2024-12-27

üìã Sample Factor Data (First 5 weeks, First 2 stocks):


Unnamed: 0_level_0,AAPL_RETURN,AAPL_MOM,AAPL_VOL,AAPL_RVOL,AMZN_RETURN,AMZN_MOM,AMZN_VOL,AMZN_RVOL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-02-05,2.082,,54.5,,0.281,,27.617,
2021-02-12,-1.129,64.161,51.934,,-1.37,32.01,24.014,
2021-02-19,-2.524,54.098,48.716,,-0.589,27.694,20.334,
2021-02-26,-3.838,39.392,45.234,,-2.799,15.626,17.535,
2021-03-05,-5.108,38.036,41.574,,-4.745,11.423,14.736,


In [None]:
# Step 4: Weekly Cross-Sectional Multiple Linear Regression
print("\nüî¨ Running Weekly Cross-Sectional Multiple Linear Regressions...")
print("=" * 60)
print("üéØ Model: Return_{i,t} = Œ± + Œ≤_MOM√óMOM_{i,t-1} + Œ≤_VOL√óVOL_{i,t-1} + Œ≤_RVOL√óRVOL_{i,t-1} + Œµ_{i,t}")
print("üìÖ Using 1-week lag to avoid look-ahead bias")

# Storage for regression results
regression_results = []

# Get clean weeks for regression (skip first week due to lagging)
regression_weeks = factor_df_clean.index[1:]  # Start from week 2

print(f"üìä Running regressions for {len(regression_weeks)} weeks...")

for i, current_week in enumerate(regression_weeks):
    try:
        # Get previous week index for lagged factors
        prev_week_idx = factor_df_clean.index.get_loc(current_week) - 1
        previous_week = factor_df_clean.index[prev_week_idx]
        
        # Collect cross-sectional data for this week
        weekly_returns = []  # Dependent variable: Returns at time t
        weekly_factors = []  # Independent variables: Factors at time t-1 (lagged)
        stock_names = []
        
        for ticker in successful_stocks:
            # Current week return (dependent variable)
            return_col = f'{ticker}_RETURN'
            # Previous week factors (independent variables) 
            mom_col = f'{ticker}_MOM'
            vol_col = f'{ticker}_VOL'
            rvol_col = f'{ticker}_RVOL'
            
            # Get the data
            if all(col in factor_df_clean.columns for col in [return_col, mom_col, vol_col, rvol_col]):
                current_return = factor_df_clean.loc[current_week, return_col]
                lagged_mom = factor_df_clean.loc[previous_week, mom_col]
                lagged_vol = factor_df_clean.loc[previous_week, vol_col] 
                lagged_rvol = factor_df_clean.loc[previous_week, rvol_col]
                
                # Only include if all values are valid (not NaN or infinite)
                values = [current_return, lagged_mom, lagged_vol, lagged_rvol]
                if all(pd.notna(values)) and all(np.isfinite(values)):
                    weekly_returns.append(current_return)
                    weekly_factors.append([lagged_mom, lagged_vol, lagged_rvol])
                    stock_names.append(ticker)
        
        # Run regression if we have sufficient observations
        min_stocks = 8  # Minimum stocks needed for stable regression
        if len(weekly_returns) >= min_stocks:
            
            # Convert to numpy arrays
            y = np.array(weekly_returns)
            X = np.array(weekly_factors)
            
            # Fit regression using scikit-learn
            reg_model = LinearRegression()
            reg_model.fit(X, y)
            
            # Calculate predictions and residuals
            y_pred = reg_model.predict(X)
            residuals = y - y_pred
            
            # Calculate R-squared
            r_squared = r2_score(y, y_pred)
            
            # Calculate adjusted R-squared  
            n = len(y)  # number of observations
            k = X.shape[1]  # number of features
            adj_r_squared = 1 - (1 - r_squared) * (n - 1) / (n - k - 1)
            
            # Calculate F-statistic for overall model significance
            mse_model = np.sum((y_pred - np.mean(y))**2) / k
            mse_residual = np.sum(residuals**2) / (n - k - 1)
            f_statistic = mse_model / mse_residual if mse_residual > 0 else 0
            f_pvalue = 1 - stats.f.cdf(f_statistic, k, n - k - 1) if f_statistic > 0 else 1
            
            # Calculate t-statistics and p-values for coefficients
            # Standard errors calculation
            X_with_intercept = np.column_stack([np.ones(len(X)), X])
            try:
                # Calculate standard errors using matrix algebra
                XtX_inv = np.linalg.inv(X_with_intercept.T @ X_with_intercept)
                mse = np.sum(residuals**2) / (n - k - 1)
                var_coeff = mse * np.diag(XtX_inv)
                std_errors = np.sqrt(var_coeff)
                
                # Coefficients (intercept + slopes)
                all_coeffs = np.array([reg_model.intercept_] + list(reg_model.coef_))
                
                # T-statistics
                t_stats = all_coeffs / std_errors
                
                # P-values (two-tailed test)
                p_values = 2 * (1 - stats.t.cdf(np.abs(t_stats), n - k - 1))
                
            except np.linalg.LinAlgError:
                # If matrix inversion fails, set default values
                std_errors = np.ones(k + 1)
                t_stats = np.zeros(k + 1)
                p_values = np.ones(k + 1)
                all_coeffs = np.array([reg_model.intercept_] + list(reg_model.coef_))
            
            # Store comprehensive results
            result_dict = {
                'Week_Number': i + 2,  # Actual week number 
                'Date': current_week.date(),
                'N_Stocks': len(weekly_returns),
                'Alpha': reg_model.intercept_,  # Intercept
                'Beta_MOM': reg_model.coef_[0],  # Momentum coefficient
                'Beta_VOL': reg_model.coef_[1],  # Volatility coefficient  
                'Beta_RVOL': reg_model.coef_[2],  # Relative Volume coefficient
                'R_squared': r_squared,
                'Adj_R_squared': adj_r_squared,
                'F_statistic': f_statistic,
                'F_pvalue': f_pvalue,
                'Alpha_pvalue': p_values[0],
                'Beta_MOM_pvalue': p_values[1], 
                'Beta_VOL_pvalue': p_values[2],
                'Beta_RVOL_pvalue': p_values[3],
                'Alpha_tstat': t_stats[0],
                'Beta_MOM_tstat': t_stats[1],
                'Beta_VOL_tstat': t_stats[2], 
                'Beta_RVOL_tstat': t_stats[3],
                'Mean_Return': np.mean(y),
                'Std_Return': np.std(y)
            }
            
            regression_results.append(result_dict)
            
            # Progress indicator
            if (i + 1) % 20 == 0:
                print(f"üìä Completed {i + 1} regressions...")
                    
        else:
            if (i + 1) % 50 == 0:  # Less frequent warning for insufficient stocks
                print(f"‚ö†Ô∏è Week {i+2}: Only {len(weekly_returns)} stocks available (minimum {min_stocks} required)")
            
    except Exception as e:
        print(f"‚ùå Error processing week {i+2}: {e}")
        continue

# Convert results to DataFrame for analysis
mlr_results_df = pd.DataFrame(regression_results)

print(f"\n‚úÖ Multiple Linear Regression Analysis Complete!")
print(f"üìä Successfully analyzed {len(mlr_results_df)} weeks")
if len(mlr_results_df) > 0:
    print(f"üìà Average stocks per regression: {mlr_results_df['N_Stocks'].mean():.1f}")
    print(f"üìÖ Period coverage: {mlr_results_df['Date'].min()} to {mlr_results_df['Date'].max()}") 
else:
    print("‚ö†Ô∏è No successful regressions completed")

# Store results globally for further analysis  
globals()['mlr_results'] = mlr_results_df
globals()['factor_data_final'] = factor_df_clean

print(f"\nüíæ Results stored in variables:")
print(f"   ‚Ä¢ 'mlr_results': Weekly regression results")
print(f"   ‚Ä¢ 'factor_data_final': Complete factor dataset")


üî¨ Running Weekly Cross-Sectional Multiple Linear Regressions...
üéØ Model: Return_{i,t} = Œ± + Œ≤_MOM√óMOM_{i,t-1} + Œ≤_VOL√óVOL_{i,t-1} + Œ≤_RVOL√óRVOL_{i,t-1} + Œµ_{i,t}
üìÖ Using 1-week lag to avoid look-ahead bias
üìä Running regressions for 203 weeks...
‚ö†Ô∏è Week 2: Only 0 stocks available (minimum 8 required)
‚ö†Ô∏è Week 3: Only 0 stocks available (minimum 8 required)
‚ö†Ô∏è Week 4: Only 0 stocks available (minimum 8 required)
‚ö†Ô∏è Week 5: Only 0 stocks available (minimum 8 required)
‚ö†Ô∏è Week 6: Only 0 stocks available (minimum 8 required)
‚ö†Ô∏è Week 7: Only 0 stocks available (minimum 8 required)
‚ö†Ô∏è Week 8: Only 0 stocks available (minimum 8 required)
‚ö†Ô∏è Week 9: Only 0 stocks available (minimum 8 required)
‚ö†Ô∏è Week 10: Only 0 stocks available (minimum 8 required)
‚ö†Ô∏è Week 11: Only 0 stocks available (minimum 8 required)
‚ö†Ô∏è Week 12: Only 0 stocks available (minimum 8 required)
‚ö†Ô∏è Week 13: Only 0 stocks available (minimum 8 required)
‚ö†Ô∏è Wee

KeyError: 'N_Stocks'

In [76]:
# Step 5: Comprehensive Results Analysis & Summary
print("\nüìà MULTIPLE LINEAR REGRESSION RESULTS SUMMARY")
print("=" * 60)

if len(mlr_results_df) > 0:
    # Model Performance Metrics
    print(f"üî∏ OVERALL MODEL PERFORMANCE")
    print(f"   Total weeks analyzed: {len(mlr_results_df)}")
    print(f"   Average R-squared: {mlr_results_df['R_squared'].mean():.4f}")
    print(f"   Average Adjusted R-squared: {mlr_results_df['Adj_R_squared'].mean():.4f}")
    print(f"   Model significance (F-test p<0.05): {(mlr_results_df['F_pvalue'] < 0.05).sum()}/{len(mlr_results_df)} weeks ({(mlr_results_df['F_pvalue'] < 0.05).mean()*100:.1f}%)")
    
    # Factor Significance Analysis
    print(f"\nüî∏ FACTOR SIGNIFICANCE ANALYSIS (p < 0.05)")
    
    # Calculate significance statistics
    sig_alpha = (mlr_results_df['Alpha_pvalue'] < 0.05).sum()
    sig_mom = (mlr_results_df['Beta_MOM_pvalue'] < 0.05).sum() 
    sig_vol = (mlr_results_df['Beta_VOL_pvalue'] < 0.05).sum()
    sig_rvol = (mlr_results_df['Beta_RVOL_pvalue'] < 0.05).sum()
    total_weeks = len(mlr_results_df)
    
    print(f"   üìä Alpha (Intercept):")
    print(f"      ‚Ä¢ Average coefficient: {mlr_results_df['Alpha'].mean():.4f}")
    print(f"      ‚Ä¢ Significant weeks: {sig_alpha}/{total_weeks} ({sig_alpha/total_weeks*100:.1f}%)")
    print(f"      ‚Ä¢ Average t-statistic: {mlr_results_df['Alpha_tstat'].mean():.2f}")
    
    print(f"   üìà Momentum Factor (Œ≤_MOM - 36w ROC):")
    print(f"      ‚Ä¢ Average coefficient: {mlr_results_df['Beta_MOM'].mean():.4f}")
    print(f"      ‚Ä¢ Significant weeks: {sig_mom}/{total_weeks} ({sig_mom/total_weeks*100:.1f}%)")
    print(f"      ‚Ä¢ Average t-statistic: {mlr_results_df['Beta_MOM_tstat'].mean():.2f}")
    
    print(f"   üìä Volatility Factor (Œ≤_VOL - 36w BBW):")
    print(f"      ‚Ä¢ Average coefficient: {mlr_results_df['Beta_VOL'].mean():.4f}")
    print(f"      ‚Ä¢ Significant weeks: {sig_vol}/{total_weeks} ({sig_vol/total_weeks*100:.1f}%)")
    print(f"      ‚Ä¢ Average t-statistic: {mlr_results_df['Beta_VOL_tstat'].mean():.2f}")
    
    print(f"   üìà Volume Factor (Œ≤_RVOL - 50w Relative Volume):")
    print(f"      ‚Ä¢ Average coefficient: {mlr_results_df['Beta_RVOL'].mean():.4f}")
    print(f"      ‚Ä¢ Significant weeks: {sig_rvol}/{total_weeks} ({sig_rvol/total_weeks*100:.1f}%)")
    print(f"      ‚Ä¢ Average t-statistic: {mlr_results_df['Beta_RVOL_tstat'].mean():.2f}")
    
    # Economic Interpretation
    print(f"\nüî∏ ECONOMIC INTERPRETATION")
    avg_mom = mlr_results_df['Beta_MOM'].mean()
    avg_vol = mlr_results_df['Beta_VOL'].mean() 
    avg_rvol = mlr_results_df['Beta_RVOL'].mean()
    
    if avg_mom > 0:
        print(f"   üìà Momentum: Positive average coefficient ({avg_mom:.4f}) suggests momentum effect")
    else:
        print(f"   üìâ Momentum: Negative average coefficient ({avg_mom:.4f}) suggests mean reversion")
        
    if avg_vol > 0:
        print(f"   üìä Volatility: Positive average coefficient ({avg_vol:.4f}) suggests higher volatility predicts higher returns")
    else:
        print(f"   üìä Volatility: Negative average coefficient ({avg_vol:.4f}) suggests higher volatility predicts lower returns")
        
    if avg_rvol > 0:
        print(f"   üìà Volume: Positive average coefficient ({avg_rvol:.4f}) suggests higher volume predicts higher returns")
    else:
        print(f"   üìâ Volume: Negative average coefficient ({avg_rvol:.4f}) suggests higher volume predicts lower returns")
    
    # Display sample results
    print(f"\nüìã SAMPLE REGRESSION RESULTS (First 10 weeks)")
    print("=" * 80)
    display_cols = ['Week_Number', 'Date', 'N_Stocks', 'Alpha', 'Beta_MOM', 'Beta_VOL', 'Beta_RVOL', 'R_squared', 'F_pvalue']
    display(mlr_results_df[display_cols].head(10).round(4))
    
    print(f"\nüìã STATISTICAL SUMMARY OF COEFFICIENTS")
    print("=" * 50)
    coeff_summary = mlr_results_df[['Alpha', 'Beta_MOM', 'Beta_VOL', 'Beta_RVOL']].describe()
    display(coeff_summary.round(4))
    
else:
    print("‚ùå No regression results available. Check data and parameters.")

print(f"\nüéØ MODEL SPECIFICATION SUMMARY")
print("=" * 35)
print("Return_{i,t} = Œ± + Œ≤_MOM √ó MOM_{i,t-1} + Œ≤_VOL √ó VOL_{i,t-1} + Œ≤_RVOL √ó RVOL_{i,t-1} + Œµ_{i,t}")
print("\nWhere:")
print("‚Ä¢ Return_{i,t}: Weekly log return for stock i at time t")
print("‚Ä¢ MOM_{i,t-1}: 36-week Rate of Change (momentum), lagged 1 week")
print("‚Ä¢ VOL_{i,t-1}: 36-week Bollinger Band Width (volatility), lagged 1 week")
print("‚Ä¢ RVOL_{i,t-1}: 50-week Relative Volume (volume), lagged 1 week")
print("‚Ä¢ Œ±: Intercept (average unexplained return)")
print("‚Ä¢ Œ≤: Factor loadings (sensitivity to each factor)")
print("‚Ä¢ Œµ_{i,t}: Residual error term")

print(f"\n‚úÖ Analysis Complete! Results stored for further investigation.")


üìà MULTIPLE LINEAR REGRESSION RESULTS SUMMARY
‚ùå No regression results available. Check data and parameters.

üéØ MODEL SPECIFICATION SUMMARY
Return_{i,t} = Œ± + Œ≤_MOM √ó MOM_{i,t-1} + Œ≤_VOL √ó VOL_{i,t-1} + Œ≤_RVOL √ó RVOL_{i,t-1} + Œµ_{i,t}

Where:
‚Ä¢ Return_{i,t}: Weekly log return for stock i at time t
‚Ä¢ MOM_{i,t-1}: 36-week Rate of Change (momentum), lagged 1 week
‚Ä¢ VOL_{i,t-1}: 36-week Bollinger Band Width (volatility), lagged 1 week
‚Ä¢ RVOL_{i,t-1}: 50-week Relative Volume (volume), lagged 1 week
‚Ä¢ Œ±: Intercept (average unexplained return)
‚Ä¢ Œ≤: Factor loadings (sensitivity to each factor)
‚Ä¢ Œµ_{i,t}: Residual error term

‚úÖ Analysis Complete! Results stored for further investigation.
