# Momentum, Volatility, and Volume Factors in U.S. Stock Returns

**ISYE 4031 Final Project**  
*Regression & Forecasting, Georgia Tech*

## Project Overview

This notebook analyzes the relationship between **momentum**, **volatility**, and **volume** factors in U.S. stock returns using S&P 500 data.

### Research Questions:
1. Do momentum indicators significantly predict future stock returns?
2. How does volatility clustering affect return predictability? 
3. Is trading volume a reliable indicator of price direction?

---

In [17]:
import yfinance as yf
import pandas as pd
from pandas_datareader import data as pdr
import datetime as dt
import numpy as np
from bs4 import BeautifulSoup
import requests, re
import ta

---
### S&P 500 Stock List
We start by scraping the current S&P 500 stock list from a reliable financial data source.

**Data Source**: [Stock Analysis - S&P 500](https://stockanalysis.com/list/sp-500-stocks/)

**Key Information Collected**:
- Stock symbols (tickers)
- Market capitalization



In [18]:
url = 'https://stockanalysis.com/list/sp-500-stocks/'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')

# Find the table and extract headers
table = soup.find('table', class_='symbol-table svelte-1ro3niy')
headers = [th.get_text(strip=True) for th in table.find('tr').find_all('th')]

# Extract all row data
stocks_data = []
for row in table.find_all('tr')[1:]:  # Skip header row
    row_data = [cell.get_text(strip=True) for cell in row.find_all('td')]
    stocks_data.append(row_data)

# Create DataFrame and set No. column as index
sp500_df = pd.DataFrame(stocks_data, columns=headers)
sp500_df = sp500_df.set_index('No.')

print("\nFirst 10 rows:")
print(sp500_df.head(10)[['Symbol', 'Market Cap']])


First 10 rows:
    Symbol Market Cap
No.                  
1     NVDA      4.68T
2     AAPL      4.07T
3     MSFT      3.79T
4     GOOG      3.52T
5    GOOGL      3.46T
6     AMZN      2.63T
7     AVGO      1.68T
8     META      1.55T
9     TSLA      1.44T
10   BRK.B      1.07T


---

### Stock Selection & Date Range Setup

**Stock Selection Process:**
- Extract first 50 companies from S&P 500 list for analysis
- Focus on established companies for reliable historical data

**Analysis Time Period:**
- **Start Date**: January 1, 2021
- **End Date**: December 27, 2024
- **Duration**: 4 years of market data
- **Purpose**: Capture post-pandemic market trends and recovery patterns

> **Note**: Using a subset of the top 50 stocks for computational efficiency and financial significance.



In [19]:
stocks = sp500_df.head(3)['Symbol'].tolist()

# Filter out all tickers that contain dots (they often cause yfinance issues)
stocks = [ticker for ticker in stocks if '.' not in ticker]

stocks.sort()
startDate = dt.date(2021, 1, 4)
endDate = dt.date(2024, 12, 27)

print(f"Selected stocks: {stocks}")
print(f"Total stocks: {len(stocks)}")

Selected stocks: ['AAPL', 'MSFT', 'NVDA']
Total stocks: 3


---

### Weekly Returns and Factor Calculation

**Objective**: Calculate weekly log returns and prepare data structure for technical indicator analysis.

**Key Metrics**:
- **Weekly Log Returns**: `ln(Close/Open) Ã— 100` for price movement analysis
- **ROC** Rate of Change indicator for momentum analysis.
- **RVOL** Relative Volume indicator for volume analysis.
- **BBW** Bollinger Band width indicator for volatility analysis.

**Data Structure**:
- Multi-level columns for organized factor storage
- Separate columns for each technical indicator per stock
- Week numbering for time series tracking

In [20]:
# Download the data with extended lookback for indicator calculations
try:
    # Extend start date by 4 months for proper technical indicator calculations (need 50+ trading days)
    extended_start = startDate - pd.DateOffset(months=4)
    
    # Download daily data with extended period for more precise indicator calculations
    daily_download = yf.download(
        tickers = stocks,
        start = extended_start,
        end = endDate,
        actions = False, threads = True, auto_adjust = True, rounding = True,
        group_by = 'tickers', 
        interval = '1d'  # Daily data for daily-based indicators
    )
    
    # Extract OHLCV data
    daily_open = daily_download.xs('Open', level=1, axis=1)
    daily_close = daily_download.xs('Close', level=1, axis=1)
    daily_high = daily_download.xs('High', level=1, axis=1)
    daily_low = daily_download.xs('Low', level=1, axis=1)
    daily_volume = daily_download.xs('Volume', level=1, axis=1)
    
    # Convert daily to weekly data (Friday close) for analysis
    weekly_open = daily_open.resample('W-FRI').first()
    weekly_close = daily_close.resample('W-FRI').last()
    weekly_high = daily_high.resample('W-FRI').max()
    weekly_low = daily_low.resample('W-FRI').min()
    weekly_volume = daily_volume.resample('W-FRI').sum()
    
    # Filter to analysis period (Jan 4, 2021 onwards)
    analysis_start = pd.Timestamp('2021-01-04')
    analysis_mask = weekly_close.index >= analysis_start
    
    # Get analysis period data
    analysis_close = weekly_close[analysis_mask]
    analysis_open = weekly_open[analysis_mask]
    analysis_high = weekly_high[analysis_mask]
    analysis_low = weekly_low[analysis_mask]
    analysis_volume = weekly_volume[analysis_mask]
    
    # Calculate weekly log returns
    log_returns = (np.log(analysis_close / analysis_open) * 100)
    
    # Create MultiIndex DataFrame
    columns = []
    for ticker in stocks:
        columns.extend([(ticker, 'Log_Return_%'), (ticker, 'ROC'), (ticker, 'RVOL'), (ticker, 'BBW')])
    multi_columns = pd.MultiIndex.from_tuples(columns, names=['Ticker', 'Metric'])
    weekly_data = pd.DataFrame(index=analysis_close.index, columns=multi_columns)
    
    # Calculate indicators for each stock using daily data
    for ticker in stocks:
        if ticker in daily_close.columns:
            # Get full daily time series for calculations (including lookback period)
            ticker_daily_close = daily_close[ticker].dropna()
            ticker_daily_volume = daily_volume[ticker].dropna()
            
            if len(ticker_daily_close) > 60:  # Need sufficient daily data for indicators
                
                # 1. Log Returns (current week - already calculated from weekly data)
                weekly_data[(ticker, 'Log_Return_%')] = log_returns[ticker].round(2)
                
                # 2. Rate of Change (ROC) - 36-day, lagged by 1 week
                # Calculate daily ROC using the full extended dataset
                roc_36d = ticker_daily_close.pct_change(periods=36) * 100
                # Convert to weekly (take Friday values) using the FULL dataset including lookback
                roc_weekly_full = roc_36d.resample('W-FRI').last()
                
                # Now manually lag by getting previous week's values for each analysis week
                roc_lagged_values = []
                for current_week in analysis_close.index:
                    # Find the previous week in the full dataset
                    prev_week_candidates = roc_weekly_full.index[roc_weekly_full.index < current_week]
                    if len(prev_week_candidates) > 0:
                        prev_week = prev_week_candidates[-1]  # Most recent previous week
                        roc_lagged_values.append(roc_weekly_full.loc[prev_week])
                    else:
                        roc_lagged_values.append(np.nan)
                
                weekly_data[(ticker, 'ROC')] = pd.Series(roc_lagged_values, index=analysis_close.index).round(2)
                
                # 3. Relative Volume (RVOL) - 50-day SMA, lagged by 1 week
                # Calculate daily volume SMA using the full extended dataset
                volume_sma_50d = ticker_daily_volume.rolling(window=50).mean()
                rvol_daily = ticker_daily_volume / volume_sma_50d
                # Convert to weekly using the FULL dataset including lookback
                rvol_weekly_full = rvol_daily.resample('W-FRI').last()
                
                # Manually lag by getting previous week's values
                rvol_lagged_values = []
                for current_week in analysis_close.index:
                    prev_week_candidates = rvol_weekly_full.index[rvol_weekly_full.index < current_week]
                    if len(prev_week_candidates) > 0:
                        prev_week = prev_week_candidates[-1]
                        rvol_lagged_values.append(rvol_weekly_full.loc[prev_week])
                    else:
                        rvol_lagged_values.append(np.nan)
                
                weekly_data[(ticker, 'RVOL')] = pd.Series(rvol_lagged_values, index=analysis_close.index).round(2)
                
                # 4. Bollinger Band Width (BBW) - 36-day, 2 std dev, lagged by 1 week
                # Calculate daily Bollinger Bands using the full extended dataset
                sma_36d = ticker_daily_close.rolling(window=36).mean()
                std_36d = ticker_daily_close.rolling(window=36).std()
                upper_bb = sma_36d + (2 * std_36d)
                lower_bb = sma_36d - (2 * std_36d)
                bbw_daily = ((upper_bb - lower_bb) / sma_36d) * 100
                # Convert to weekly using the FULL dataset including lookback
                bbw_weekly_full = bbw_daily.resample('W-FRI').last()
                
                # Manually lag by getting previous week's values
                bbw_lagged_values = []
                for current_week in analysis_close.index:
                    prev_week_candidates = bbw_weekly_full.index[bbw_weekly_full.index < current_week]
                    if len(prev_week_candidates) > 0:
                        prev_week = prev_week_candidates[-1]
                        bbw_lagged_values.append(bbw_weekly_full.loc[prev_week])
                    else:
                        bbw_lagged_values.append(np.nan)
                
                weekly_data[(ticker, 'BBW')] = pd.Series(bbw_lagged_values, index=analysis_close.index).round(2)
            
            else:
                # Fill with NaN if insufficient data
                weekly_data[(ticker, 'Log_Return_%')] = np.nan
                weekly_data[(ticker, 'ROC')] = np.nan
                weekly_data[(ticker, 'RVOL')] = np.nan
                weekly_data[(ticker, 'BBW')] = np.nan
    
    # Add week numbers as a separate column
    weekly_data.insert(0, 'Week', range(1, len(weekly_data) + 1))
    
    print(f"ðŸ“Š Technical Analysis with Optimized Lagged Indicators Complete!")
    print(f"Extended lookback period: {extended_start.date()} to {analysis_start.date()}")
    print(f"Total weeks: {len(weekly_data)}")
    print(f"Date range: {weekly_data.index[0].date()} to {weekly_data.index[-1].date()}")
    print(f"DataFrame shape: {weekly_data.shape}")
    print(f"Stocks analyzed: {len(stocks)}")
    print(f"\nCalculated Daily-Based Indicators (All Lagged by 1 Week):")
    print("â€¢ Log_Return_%: Weekly log returns (current week)")
    print("â€¢ ROC: 36-day Rate of Change (lagged 1 week)")
    print("â€¢ RVOL: Relative Volume vs 50-day SMA (lagged 1 week)")
    print("â€¢ BBW: 36-day Bollinger Band Width (lagged 1 week)")
    
    print(f"\nUpdated Regression Format:")
    print("Return_{i,t} = Î± + Î²_MOMÃ—ROC_36d_{i,t-1} + Î²_BBWÃ—BBW_36d_{i,t-1} + Î²_VOLÃ—RVOL_50d_{i,t-1} + Îµ_{i,t}")
    
    display(weekly_data)
        
except Exception as e:
    print(f"Error: {e}")
    import traceback
    traceback.print_exc()

[*********************100%***********************]  3 of 3 completed

ðŸ“Š Technical Analysis with Optimized Lagged Indicators Complete!
Extended lookback period: 2020-09-04 to 2021-01-04
Total weeks: 208
Date range: 2021-01-08 to 2024-12-27
DataFrame shape: (208, 13)
Stocks analyzed: 3

Calculated Daily-Based Indicators (All Lagged by 1 Week):
â€¢ Log_Return_%: Weekly log returns (current week)
â€¢ ROC: 36-day Rate of Change (lagged 1 week)
â€¢ RVOL: Relative Volume vs 50-day SMA (lagged 1 week)
â€¢ BBW: 36-day Bollinger Band Width (lagged 1 week)

Updated Regression Format:
Return_{i,t} = Î± + Î²_MOMÃ—ROC_36d_{i,t-1} + Î²_BBWÃ—BBW_36d_{i,t-1} + Î²_VOLÃ—RVOL_50d_{i,t-1} + Îµ_{i,t}





Ticker,Week,AAPL,AAPL,AAPL,AAPL,MSFT,MSFT,MSFT,MSFT,NVDA,NVDA,NVDA,NVDA
Metric,Unnamed: 1_level_1,Log_Return_%,ROC,RVOL,BBW,Log_Return_%,ROC,RVOL,BBW,Log_Return_%,ROC,RVOL,BBW
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
2021-01-08,1,-1.11,14.08,0.92,19.81,-1.32,2.11,0.73,7.97,1.29,-4.19,0.63,6.50
2021-01-15,2,-1.60,9.76,0.95,19.68,-2.70,1.36,0.80,7.94,-4.28,-1.71,0.91,6.62
2021-01-22,3,8.47,11.67,1.05,16.88,5.55,1.21,1.15,7.22,5.18,-2.14,0.87,7.19
2021-01-29,4,-8.08,16.82,1.10,14.35,1.23,5.55,1.10,8.02,-5.92,2.32,0.81,8.13
2021-02-05,5,2.38,6.64,1.65,16.90,2.99,8.25,1.44,12.21,3.99,-4.57,0.94,8.23
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-11-29,204,2.50,1.97,0.78,6.92,1.21,0.31,1.22,6.09,-2.67,15.55,0.96,17.09
2024-12-06,205,2.32,3.50,0.59,7.42,5.09,1.64,0.79,6.03,2.57,4.22,0.58,13.62
2024-12-13,206,2.57,4.89,0.87,10.69,1.05,6.81,0.92,8.45,-3.46,4.96,0.82,12.01
2024-12-20,207,2.59,7.64,0.79,14.46,-2.42,5.55,0.97,11.54,0.39,-3.80,1.03,12.73
