# Notebook for CME Futures Challenge

### The Rough Idea

Model indices as geometric brownian motion (dS/S = mudt + sigmadB)  
Model mu (market line) as a linear regression with numerous factors including economic, credit measures, etc  
Model sigma as a function of volatility including recent volatility and EMA (decay)  
Long/short based on futures mispricings based on our model  

# Downloading historical data for indices (S&P, NASDAQ, DJIA)

Imports

In [343]:
import yfinance as yf
import pandas as pd
import plotly.express as px
from typing import List, Dict

Make get_data function for downloading from yf

In [411]:
timeframe = '1000mo' # set timeframe

def get_data(tickers: List):
    data_dictionary = {}
    for ticker in tickers:
        data_dictionary[ticker] = yf.download(ticker, period=timeframe, interval='1d')
    return data_dictionary

Now let's get data for indices and display with pd

In [412]:
indices = ['^GSPC', '^IXIC', '^DJI'] # S&P, NASDAQ, DJIA
etfs = ['SPY', 'QQQ', 'DIA']
futures = ['ES=F', 'NQ=F', 'YM=F']

data_dictionary = get_data(indices + etfs + futures)

#s_p = pd.DataFrame(data_dictionary['^GSPC'])
#nasdaq = pd.DataFrame(data_dictionary['^IXIC'])
#djia = pd.DataFrame(data_dictionary['^DJI'])

s_p = pd.DataFrame(data_dictionary['SPY'])
nasdaq = pd.DataFrame(data_dictionary['QQQ'])
djia = pd.DataFrame(data_dictionary['DIA'])

s_p_F = pd.DataFrame(data_dictionary['ES=F'])
nasdaq_F = pd.DataFrame(data_dictionary['NQ=F'])
djia_F = pd.DataFrame(data_dictionary['YM=F'])


YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to Tru

In [413]:
s_p

Price,Close,High,Low,Open,Volume
Ticker,SPY,SPY,SPY,SPY,SPY
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
1993-01-29,24.313030,24.330323,24.209276,24.330323,1003200
1993-02-01,24.485964,24.485964,24.330333,24.330333,480500
1993-02-02,24.537844,24.555136,24.416797,24.468674,201300
1993-02-03,24.797222,24.814514,24.555129,24.572422,529400
1993-02-04,24.900986,24.952863,24.607016,24.883693,531500
...,...,...,...,...,...
2025-09-15,659.082703,659.212348,657.517097,657.816255,63772400
2025-09-16,658.175232,659.950340,657.387438,659.641138,61169000
2025-09-17,657.357544,659.890500,652.491031,658.185266,101952200
2025-09-18,660.429016,663.051750,658.444528,660.060044,90459200


We need to flatten this - notice ticker header

In [414]:
s_p = s_p.droplevel(1, axis=1)
nasdaq = nasdaq.droplevel(1, axis=1)
djia = djia.droplevel(1, axis=1)

In [415]:
s_p

Price,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1993-01-29,24.313030,24.330323,24.209276,24.330323,1003200
1993-02-01,24.485964,24.485964,24.330333,24.330333,480500
1993-02-02,24.537844,24.555136,24.416797,24.468674,201300
1993-02-03,24.797222,24.814514,24.555129,24.572422,529400
1993-02-04,24.900986,24.952863,24.607016,24.883693,531500
...,...,...,...,...,...
2025-09-15,659.082703,659.212348,657.517097,657.816255,63772400
2025-09-16,658.175232,659.950340,657.387438,659.641138,61169000
2025-09-17,657.357544,659.890500,652.491031,658.185266,101952200
2025-09-18,660.429016,663.051750,658.444528,660.060044,90459200


Let's drop high, low, and open and rename columns

In [416]:
s_p.drop(columns=['High', 'Low', 'Open'], inplace=True)
nasdaq.drop(columns=['High', 'Low', 'Open'], inplace=True)
djia.drop(columns=['High', 'Low', 'Open'], inplace=True)

s_p = s_p.rename(columns={'Close': 'S&P_Close', 'Volume': 'S&P_Volume'})
nasdaq = nasdaq.rename(columns={'Close': 'NASDAQ_Close', 'Volume': 'NASDAQ_Volume'})
djia = djia.rename(columns={'Close': 'DJIA_Close', 'Volume': 'DJIA_Volume'})

Let's get a quick plot of an index

In [417]:
fig = px.line(s_p, x=s_p.index, y="S&P_Close", title="S&P Daily Past 30 Years")
fig.show()

# Downloading historical data for our factor model

We are going to model the index as a geometric brownian motion, with the mu factor being a linear regression model with numerous inputs.  

## Factor considerations:  
### <u>Term structure</u>
###### Term spread (10Y-3M)

### <u>Credit conditions</u>
###### IG spread (BAA-AAA)

### <u>Valuation</u>
###### Forward E/P - real 10Y
###### Dividend yield

### <u>Economic</u>
###### Fed funds
###### Inflation (CPI)
###### DXY change (dollar index)  

### Some of these we can get from yahoo finance:  

In [418]:
tickers = [
    # Term structure
    '^TNX', # 10yr CBOE
    '^IRX', # 3m bill (on discount basis, need to convert to yield)

    # Economic
    'DX-Y.NYB', # Dollar index
]

data_dictionary = get_data(tickers)

ten_yr = pd.DataFrame(data_dictionary['^TNX']['Close'])
three_m = pd.DataFrame(data_dictionary['^IRX']['Close'])
dollar_index = pd.DataFrame(data_dictionary['DX-Y.NYB']['Close'])


YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed


Rename columns

In [419]:
ten_yr = ten_yr.rename(columns={'^TNX': 'ten_yr'})
three_m = three_m.rename(columns={'^IRX': 'three_m'})
dollar_index = dollar_index.rename(columns={'DX-Y.NYB': 'dollar_index'})

We should get dividend yield too

In [420]:
etfs = ['SPY', 'QQQ', 'DIA']
div_data = {}

for etf in etfs:
    ticker = yf.Ticker(etf)
    div = ticker.dividends
    price = ticker.history(timeframe)['Close']

    # Calculate dividend yield
    div_12m = div.rolling(window='365D', min_periods=1).sum()
    div_12m = div_12m.reindex(price.index, method='ffill')
    div_yield = div_12m / price
    div_data[etf] = div_yield

Fix index for all 3 and rename columns

In [421]:
div_data['SPY'].index = pd.to_datetime(div_data['SPY'].index).normalize().tz_localize(None) # Normalize puts date in format we want
div_data['QQQ'].index = pd.to_datetime(div_data['QQQ'].index).normalize().tz_localize(None) # Localize (none) makes sure it doesn't add our timezone
div_data['DIA'].index = pd.to_datetime(div_data['DIA'].index).normalize().tz_localize(None)

div_data['SPY'].name = 'SPY_div'
div_data['QQQ'].name = 'QQQ_div'
div_data['DIA'].name = 'DIA_div'

In [422]:
div_data['SPY']

Date
1993-01-29         NaN
1993-02-01         NaN
1993-02-02         NaN
1993-02-03         NaN
1993-02-04         NaN
                ...   
2025-09-15    0.013546
2025-09-16    0.013565
2025-09-17    0.013582
2025-09-18    0.013518
2025-09-19    0.013560
Name: SPY_div, Length: 8217, dtype: float64

### pandas_datareader lets us download fred data

In [423]:
from pandas_datareader import data as pdr
from datetime import datetime

In [424]:
start = datetime(1990,1,1) # Start date for download

# Macroeconomic data
gdp = pdr.DataReader("GDP", "fred", start)
cpi = pdr.DataReader("CPIAUCSL", "fred", start)
fedfunds = pdr.DataReader("FEDFUNDS", "fred", start)

# For some reason this download doesn't have the most recent fed funds rate
fedfunds = pd.concat([fedfunds['FEDFUNDS'], pd.Series([4.08], index=[datetime(2025,9,17)])])

# Credit risk data
ig_spread = pdr.DataReader("BAMLC0A4CBBB", "fred", start)   # BofA BBB corp minus Treasuries
#hy_spread = pdr.DataReader("BAMLH0A0HYM2", "fred", start)   # BofA US High Yield spread
#baa_spread = pdr.DataReader("BAA10Y", "fred", start)        # Moody’s Baa – 10Y Treasury

Rename series

In [425]:
cpi.name = 'CPI'
fedfunds.name = 'fed_funds'
ig_spread.name = 'credit_spread'

In [426]:
fred_data = [gdp, cpi, fedfunds, ig_spread]

# Last business day <= today
last_bday = pd.bdate_range(end=pd.Timestamp.today().normalize().tz_localize(None), periods=1)[0]

for i, df in enumerate(fred_data):
    s = df.squeeze() # make it a Series
    # Build a business-day index from the series start to last_bday
    bidx = pd.bdate_range(start=s.index.min(), end=last_bday)
    # Reindex to business days and forward-fill
    s = s.reindex(bidx, method='ffill')
    # Write back as a 1-col DataFrame with a proper name
    name = s.name if s.name else f"series_{i}"
    fred_data[i] = s.to_frame(name)

In [427]:
fred_data[0]

Unnamed: 0,GDP
1990-01-01,5872.701
1990-01-02,5872.701
1990-01-03,5872.701
1990-01-04,5872.701
1990-01-05,5872.701
...,...
2025-09-16,30353.902
2025-09-17,30353.902
2025-09-18,30353.902
2025-09-19,30353.902


Let's build a master dataframe

In [428]:
data = s_p.join([nasdaq, djia, div_data['SPY'], div_data['QQQ'], div_data['DIA'], ten_yr, three_m, dollar_index, fred_data[0], fred_data[1], fred_data[2], fred_data[3]])
data

Unnamed: 0_level_0,S&P_Close,S&P_Volume,NASDAQ_Close,NASDAQ_Volume,DJIA_Close,DJIA_Volume,SPY_div,QQQ_div,DIA_div,ten_yr,three_m,dollar_index,GDP,CPIAUCSL,fed_funds,BAMLC0A4CBBB
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1993-01-29,24.313030,1003200.0,,,,,,,,6.390,2.900,92.459999,6729.459,142.800,3.02,
1993-02-01,24.485964,480500.0,,,,,,,,6.380,2.900,93.559998,6729.459,143.100,3.03,
1993-02-02,24.537844,201300.0,,,,,,,,6.460,2.960,93.919998,6729.459,143.100,3.03,
1993-02-03,24.797222,529400.0,,,,,,,,6.450,2.930,94.239998,6729.459,143.100,3.03,
1993-02-04,24.900986,531500.0,,,,,,,,6.390,2.900,94.529999,6729.459,143.100,3.03,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2025-09-15,659.082703,63772400.0,591.679993,44360300.0,458.759308,4862200.0,0.013546,0.006052,0.015259,4.034,3.900,97.300003,30353.902,323.364,4.33,0.95
2025-09-16,658.175232,61169000.0,591.179993,36942100.0,457.472961,4526100.0,0.013565,0.006057,0.015301,4.026,3.890,96.629997,30353.902,323.364,4.33,0.96
2025-09-17,657.357544,101952200.0,590.000000,69384800.0,459.945923,5998000.0,0.013582,0.006069,0.015219,4.076,3.868,96.870003,30353.902,323.364,4.08,0.96
2025-09-18,660.429016,90459200.0,595.320007,61069300.0,461.312012,6681000.0,0.013518,0.006015,0.015174,4.104,3.880,97.349998,30353.902,323.364,4.08,0.94


# Linear regression model

### Feature Engineering

We need to be careful to not include things such as raw moving averages that will leak volatility information into our drift prediction  

In [429]:
import numpy as np

Function definitions to help out

In [430]:
def rolling_mean(data, window):
    return data.rolling(window, min_periods=window).mean()

Features

#  (TODO: look at making features like diffs for economic metrics, figure out when economic metrics are released vs reported in data)

In [431]:
# First, make log prices / volumes of our data, then log normal assumptions are better and everything is additive
data['S&P_log_price'] = np.log(data['S&P_Close'])
data['NASDAQ_log_price'] = np.log(data['NASDAQ_Close'])
data['DJIA_log_price'] = np.log(data['DJIA_Close'])

data['S&P_log_volume'] = np.log(data['S&P_Volume'])
data['NASDAQ_log_volume'] = np.log(data['NASDAQ_Volume'])
data['DJIA_log_volume'] = np.log(data['DJIA_Volume'])

Setting our target returns metric

In [432]:
# Log returns (21 = 1 month)
days = 1
data['S&P_ret'] = data['S&P_log_price'].diff(days)
data['NASDAQ_ret'] = data['NASDAQ_log_price'].diff(days)
data['DJIA_ret'] = data['DJIA_log_price'].diff(days)

# Next month log returns -- This will be our target variable
data[['S&P_next_ret','NASDAQ_next_ret','DJIA_next_ret']] = data[['S&P_ret','NASDAQ_ret','DJIA_ret']].shift(-days).dropna()

ETF Features

In [433]:
# ===== S&P =====
# Price-based
data['S&P_mom_1w'] = data['S&P_log_price'].diff(5) # Total price change / momentum indicator
data['S&P_mom_3m'] = data['S&P_log_price'].diff(63)
data['S&P_3m_rolling_price'] = rolling_mean(data['S&P_log_price'], 63)
data['S&P_trend_speed_price'] = data['S&P_3m_rolling_price'].diff(5)  # How fast the 3m trend is changing on a weekly basis
data['S&P_trend_dist_price'] = data['S&P_log_price'] - data['S&P_3m_rolling_price']


# Volume-based (essentially the same as price for now)
data['S&P_vlm_1w'] = data['S&P_log_volume'].diff(5) # Total volume change / momentum indicator
data['S&P_vlm_1m'] = data['S&P_log_volume'].diff(21)
data['S&P_vlm_3m'] = data['S&P_log_volume'].diff(63)
data['S&P_3m_rolling_volume'] = rolling_mean(data['S&P_log_volume'], 63)
data['S&P_trend_speed_volume'] = data['S&P_3m_rolling_volume'].diff(5)  # How fast the 3m trend is changing on a weekly basis
data['S&P_trend_dist_volume'] = data['S&P_log_volume'] - data['S&P_3m_rolling_volume']

# ===== NASDAQ =====
# Price-based
data['NASDAQ_mom_1w'] = data['NASDAQ_log_price'].diff(5)  # Total price change / momentum indicator
data['NASDAQ_mom_3m'] = data['NASDAQ_log_price'].diff(63)
data['NASDAQ_3m_rolling_price'] = rolling_mean(data['NASDAQ_log_price'], 63)
data['NASDAQ_trend_speed_price'] = data['NASDAQ_3m_rolling_price'].diff(5)  # How fast the 3m trend is changing on a weekly basis
data['NASDAQ_trend_dist_price'] = data['NASDAQ_log_price'] - data['NASDAQ_3m_rolling_price']

# Volume-based (essentially the same as price for now)
data['NASDAQ_vlm_1w'] = data['NASDAQ_log_volume'].diff(5)  # Total volume change / momentum indicator
data['NASDAQ_vlm_1m'] = data['NASDAQ_log_volume'].diff(21)
data['NASDAQ_vlm_3m'] = data['NASDAQ_log_volume'].diff(63)
data['NASDAQ_3m_rolling_volume'] = rolling_mean(data['NASDAQ_log_volume'], 63)
data['NASDAQ_trend_speed_volume'] = data['NASDAQ_3m_rolling_volume'].diff(5)  # How fast the 3m trend is changing on a weekly basis
data['NASDAQ_trend_dist_volume'] = data['NASDAQ_log_volume'] - data['NASDAQ_3m_rolling_volume']

# ===== DJIA =====
# Price-based
data['DJIA_mom_1w'] = data['DJIA_log_price'].diff(5)  # Total price change / momentum indicator
data['DJIA_mom_3m'] = data['DJIA_log_price'].diff(63)
data['DJIA_3m_rolling_price'] = rolling_mean(data['DJIA_log_price'], 63)
data['DJIA_trend_speed_price'] = data['DJIA_3m_rolling_price'].diff(5)  # How fast the 3m trend is changing on a weekly basis
data['DJIA_trend_dist_price'] = data['DJIA_log_price'] - data['DJIA_3m_rolling_price']

# Volume-based (essentially the same as price for now)
data['DJIA_vlm_1w'] = data['DJIA_log_volume'].diff(5)  # Total volume change / momentum indicator
data['DJIA_vlm_1m'] = data['DJIA_log_volume'].diff(21)
data['DJIA_vlm_3m'] = data['DJIA_log_volume'].diff(63)
data['DJIA_3m_rolling_volume'] = rolling_mean(data['DJIA_log_volume'], 63)
data['DJIA_trend_speed_volume'] = data['DJIA_3m_rolling_volume'].diff(5)  # How fast the 3m trend is changing on a weekly basis
data['DJIA_trend_dist_volume'] = data['DJIA_log_volume'] - data['DJIA_3m_rolling_volume']

Macro features

Other features

In [434]:
month_dummies = pd.get_dummies(data.index.month, prefix="month")
month_dummies.set_index(data.index, inplace=True)
data = data.join(month_dummies)

In [435]:
data

Unnamed: 0_level_0,S&P_Close,S&P_Volume,NASDAQ_Close,NASDAQ_Volume,DJIA_Close,DJIA_Volume,SPY_div,QQQ_div,DIA_div,ten_yr,...,month_3,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1993-01-29,24.313030,1003200.0,,,,,,,,6.390,...,False,False,False,False,False,False,False,False,False,False
1993-02-01,24.485964,480500.0,,,,,,,,6.380,...,False,False,False,False,False,False,False,False,False,False
1993-02-02,24.537844,201300.0,,,,,,,,6.460,...,False,False,False,False,False,False,False,False,False,False
1993-02-03,24.797222,529400.0,,,,,,,,6.450,...,False,False,False,False,False,False,False,False,False,False
1993-02-04,24.900986,531500.0,,,,,,,,6.390,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2025-09-15,659.082703,63772400.0,591.679993,44360300.0,458.759308,4862200.0,0.013546,0.006052,0.015259,4.034,...,False,False,False,False,False,False,True,False,False,False
2025-09-16,658.175232,61169000.0,591.179993,36942100.0,457.472961,4526100.0,0.013565,0.006057,0.015301,4.026,...,False,False,False,False,False,False,True,False,False,False
2025-09-17,657.357544,101952200.0,590.000000,69384800.0,459.945923,5998000.0,0.013582,0.006069,0.015219,4.076,...,False,False,False,False,False,False,True,False,False,False
2025-09-18,660.429016,90459200.0,595.320007,61069300.0,461.312012,6681000.0,0.013518,0.006015,0.015174,4.104,...,False,False,False,False,False,False,True,False,False,False


### Preprocessing Data

Let's check for NaNs

In [436]:
data.isna().sum()

S&P_Close           0
S&P_Volume          0
NASDAQ_Close     1542
NASDAQ_Volume    1542
DJIA_Close       1256
                 ... 
month_8             0
month_9             0
month_10            0
month_11            0
month_12            0
Length: 73, dtype: int64

Impute some NaNs with average

In [437]:
data['ten_yr'] = data['ten_yr'].fillna(data['ten_yr'].mean())
data['three_m'] = data['three_m'].fillna(data['three_m'].mean())
data['dollar_index'] = data['dollar_index'].fillna(data['dollar_index'].mean())
data['BAMLC0A4CBBB'] = data['BAMLC0A4CBBB'].fillna(data['BAMLC0A4CBBB'].mean())
#data['S&P_ret'] = data['S&P_ret'].fillna(data['S&P_ret'].mean())
#data['NASDAQ_ret'] = data['NASDAQ_ret'].fillna(data['NASDAQ_ret'].mean())
#data['DJIA_ret'] = data['DJIA_ret'].fillna(data['DJIA_ret'].mean())

Drop others

In [438]:
data = data.dropna()
data

Unnamed: 0_level_0,S&P_Close,S&P_Volume,NASDAQ_Close,NASDAQ_Volume,DJIA_Close,DJIA_Volume,SPY_div,QQQ_div,DIA_div,ten_yr,...,month_3,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2003-12-24,73.179520,8055800.0,30.408175,44840300.0,63.607712,3742600.0,0.028232,0.000460,0.033565,4.187,...,False,False,False,False,False,False,False,False,False,True
2003-12-26,73.232903,8308400.0,30.365822,25497600.0,63.650860,2088500.0,0.028211,0.000461,0.033542,4.148,...,False,False,False,False,False,False,False,False,False,True
2003-12-29,74.207581,22483700.0,30.865559,63296200.0,64.519920,6589900.0,0.027841,0.000454,0.033091,4.230,...,False,False,False,False,False,False,False,False,False,True
2003-12-30,74.220917,19559500.0,30.967203,48249400.0,64.304222,5478200.0,0.027836,0.000452,0.033202,4.279,...,False,False,False,False,False,False,False,False,False,True
2003-12-31,74.287666,31501800.0,30.882517,60494600.0,64.452156,5658100.0,0.027811,0.000453,0.033125,4.257,...,False,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2025-09-12,655.592407,72780100.0,586.659973,50745900.0,458.011444,3784600.0,0.013618,0.006104,0.015283,4.061,...,False,False,False,False,False,False,True,False,False,False
2025-09-15,659.082703,63772400.0,591.679993,44360300.0,458.759308,4862200.0,0.013546,0.006052,0.015259,4.034,...,False,False,False,False,False,False,True,False,False,False
2025-09-16,658.175232,61169000.0,591.179993,36942100.0,457.472961,4526100.0,0.013565,0.006057,0.015301,4.026,...,False,False,False,False,False,False,True,False,False,False
2025-09-17,657.357544,101952200.0,590.000000,69384800.0,459.945923,5998000.0,0.013582,0.006069,0.015219,4.076,...,False,False,False,False,False,False,True,False,False,False


### Split data

Training/testing 80/20 split

In [439]:
import math

In [440]:
cutoff = math.floor(len(data)*.8)
training_data = data.iloc[:cutoff]
testing_data = data.iloc[cutoff:]

In [441]:
display(training_data.tail(5))
display(testing_data.head(5))

Unnamed: 0_level_0,S&P_Close,S&P_Volume,NASDAQ_Close,NASDAQ_Volume,DJIA_Close,DJIA_Volume,SPY_div,QQQ_div,DIA_div,ten_yr,...,month_3,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-05-05,390.514374,60162200.0,320.486237,46219300.0,315.984833,3506100.0,0.017846,0.005439,0.018026,1.584,...,False,False,True,False,False,False,False,False,False,False
2021-05-06,393.632874,74321400.0,322.901825,46814300.0,318.966553,4276700.0,0.017704,0.005398,0.017858,1.561,...,False,False,True,False,False,False,False,False,False,False
2021-05-07,396.497803,67733800.0,325.521973,53324500.0,321.154572,3526500.0,0.017576,0.005354,0.017736,1.577,...,False,False,True,False,False,False,False,False,False,False
2021-05-10,392.571533,81852400.0,317.301117,60700500.0,321.025208,6092500.0,0.017752,0.005493,0.017743,1.602,...,False,False,True,False,False,False,False,False,False,False
2021-05-11,389.067871,116888000.0,316.862823,71963600.0,316.566315,8396400.0,0.017912,0.005501,0.017993,1.624,...,False,False,True,False,False,False,False,False,False,False


Unnamed: 0_level_0,S&P_Close,S&P_Volume,NASDAQ_Close,NASDAQ_Volume,DJIA_Close,DJIA_Volume,SPY_div,QQQ_div,DIA_div,ten_yr,...,month_3,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-05-12,380.802002,134811000.0,308.661499,91164900.0,310.187195,7034100.0,0.018301,0.005647,0.018363,1.695,...,False,False,True,False,False,False,False,False,False,False
2021-05-13,385.376434,106394000.0,311.047852,69877800.0,314.295319,5741300.0,0.018084,0.005604,0.018123,1.668,...,False,False,True,False,False,False,False,False,False,False
2021-05-14,391.294067,82201600.0,317.914795,44370000.0,317.812592,4677300.0,0.01781,0.005483,0.017923,1.635,...,False,False,True,False,False,False,False,False,False,False
2021-05-17,390.298309,65129200.0,315.986237,39395000.0,317.249481,3262200.0,0.017856,0.005516,0.017954,1.64,...,False,False,True,False,False,False,False,False,False,False
2021-05-18,386.935669,59810200.0,313.853149,36528500.0,314.904633,3462300.0,0.018011,0.005554,0.018088,1.642,...,False,False,True,False,False,False,False,False,False,False


### Normalize inputs

In [442]:
from sklearn.preprocessing import StandardScaler

In [443]:
# Make sure we only fit on training_data and explanatory variables
targets = ['S&P_next_ret', 'NASDAQ_next_ret', 'DJIA_next_ret']
dummies = [f'month_{month}' for month in range(1,13)]
columns_to_ignore = [] #['NASDAQ_next_ret', 'DJIA_next_ret']
columns_to_ignore.extend(dummies)
features = [column for column in training_data.columns if column not in targets and column not in columns_to_ignore]

scaler = StandardScaler()
scaler.fit(training_data[features]) # Fitting on training data

train_scaled = training_data.copy()
test_scaled = testing_data.copy()

train_scaled[features] = scaler.transform(training_data[features])
test_scaled[features] = scaler.transform(testing_data[features])

# Save info on standardization for later
variables_mu = pd.Series(scaler.mean_, index=features)
variables_sd = pd.Series(scaler.scale_, index=features)

### Linear Regression

In [444]:
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import r2_score, root_mean_squared_error

We are going to test with and without ridge (which will help reduce the impact of collinearity)

In [445]:
# Function to get print results from the models
def eval_and_report(y_true, y_pred, model_name):
    print(f"{model_name:18s} | R^2: {r2_score(y_true, y_pred):.4f} | RMSE: {root_mean_squared_error(y_true, y_pred):.6f}")

In [446]:
# Models
results = {}

for target in targets:
    print(f"\n=== Target: {target} ===")
    X_train = train_scaled[features].copy()
    y_train = train_scaled[target].copy()
    X_test = test_scaled[features].copy()
    y_test = test_scaled[target].copy()

    # 1. Ordinary Least Squares (OLS)
    ols = LinearRegression()
    ols.fit(X_train, y_train)
    yhat_ols = ols.predict(X_test)
    eval_and_report(y_test, yhat_ols, "OLS")

    # Print top coefficients
    ols_coef = pd.Series(ols.coef_, index=features).sort_values(key=np.abs, ascending=False)
    print("Top OLS coeffs:\n", ols_coef.head(10))

    # 2. Ridge with CV over alphas (time-series CV)
    tscv = TimeSeriesSplit(n_splits=5)
    alphas = np.logspace(-4, 3, 30)

    ridge = RidgeCV(alphas=alphas, cv=tscv, fit_intercept=True)
    ridge.fit(X_train, y_train)
    yhat_ridge = ridge.predict(X_test)
    eval_and_report(y_test, yhat_ridge, f"Ridge (alpha={ridge.alpha_:.4g})")

    # Print top coefficients
    ridge_coef = pd.Series(ridge.coef_, index=features).sort_values(key=np.abs, ascending=False)
    print("Top Ridge coeffs:\n", ridge_coef.head(10))

    # 3. Lasso with CV over alphas (time-series CV)
    tscv = TimeSeriesSplit(n_splits=5)
    alphas = np.logspace(-4, 3, 30)

    lasso = LassoCV(alphas=alphas, cv=tscv, fit_intercept=True)
    lasso.fit(X_train, y_train)
    yhat_lasso = lasso.predict(X_test)
    eval_and_report(y_test, yhat_lasso, f"Lasso (alpha={lasso.alpha_:.4g})")

    # Print top coefficients
    lasso_coef = pd.Series(lasso.coef_, index=features).sort_values(key=np.abs, ascending=False)
    print("Top Lasso coeffs:\n", lasso_coef.head(10))

    # Store for later use
    results[target] = {
        "ols_model": ols,
        "ridge_model": ridge,
        "ols_coefs": ols_coef,
        "ridge_coefs": ridge_coef,
        "lasso_coefs": lasso_coef,
        "train_data_ols": pd.Series(ols.predict(X_train), index=y_train.index, name=f"OLS_train"),
        "train_data_ridge": pd.Series(ridge.predict(X_train), index=y_train.index, name=f"Ridge_train"),
        "train_data_lasso": pd.Series(lasso.predict(X_train), index=y_train.index, name=f"Lasso_train"),
        "yhat_ols": pd.Series(yhat_ols, index=y_test.index, name=f"{target}_OLS_pred"),
        "yhat_ridge": pd.Series(yhat_ridge, index=y_test.index, name=f"{target}_Ridge_pred"),
        "yhat_lasso": pd.Series(yhat_lasso, index=y_test.index, name=f"{target}_Ridge_pred"),
    }


=== Target: S&P_next_ret ===
OLS                | R^2: -4.7999 | RMSE: 0.026608
Top OLS coeffs:
 S&P_Close                  0.054545
DJIA_Close                -0.043388
S&P_3m_rolling_price      -0.034082
S&P_log_price             -0.033880
DJIA_3m_rolling_price      0.020646
DJIA_log_price             0.020606
NASDAQ_Close              -0.011978
NASDAQ_3m_rolling_price    0.011178
NASDAQ_log_price           0.011082
S&P_mom_3m                -0.010443
dtype: float64
Ridge (alpha=1000) | R^2: 0.0006 | RMSE: 0.011045
Top Ridge coeffs:
 DJIA_ret        -0.000636
S&P_mom_1w      -0.000599
S&P_ret         -0.000583
DIA_div          0.000439
S&P_vlm_1m      -0.000366
ten_yr          -0.000357
NASDAQ_Volume   -0.000347
S&P_Volume      -0.000308
BAMLC0A4CBBB    -0.000288
S&P_vlm_1w       0.000238
dtype: float64
Lasso (alpha=0.0003039) | R^2: -0.0019 | RMSE: 0.011059
Top Lasso coeffs:
 DJIA_ret        -0.000920
S&P_mom_1w      -0.000222
NASDAQ_Volume   -0.000166
S&P_ret         -0.000134
NASD

Let's try OLS again with PCA

In [447]:
from sklearn.decomposition import PCA

In [448]:
pca = PCA(n_components=.95).fit(train_scaled) # keep 95% of variance and fit to training set
train_pca = pca.transform(train_scaled)
test_pca = pca.transform(test_scaled)

In [449]:
# PCA
for target in targets:
    print(f"\n=== Target: {target} ===")
    y_train = train_scaled[target].copy()
    y_test  = test_scaled[target].copy()

    # 1) OLS on PCA components
    ols_pca = LinearRegression()
    ols_pca.fit(train_pca, y_train)
    yhat_pca = ols_pca.predict(test_pca)
    eval_and_report(y_test, yhat_pca, "OLS+PCA")

    results[target].update({
        "pca_model": ols_pca,
        "train_data_pca": pd.Series(ols_pca.predict(train_pca), index=y_train.index, name=f"PCA_train"),
        "yhat_pca": pd.Series(yhat_pca, index=y_test.index, name=f"{target}_PCA_pred")
    })


=== Target: S&P_next_ret ===
OLS+PCA            | R^2: -0.0086 | RMSE: 0.011096

=== Target: NASDAQ_next_ret ===
OLS+PCA            | R^2: -0.0007 | RMSE: 0.014459

=== Target: DJIA_next_ret ===
OLS+PCA            | R^2: -0.0199 | RMSE: 0.009551


Plot these results

In [450]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [451]:
# Pairs: (pretty name, next-return target key)
pairs = [
    ("S&P 500", "S&P_next_ret"),
    ("NASDAQ",  "NASDAQ_next_ret"),
    ("DJIA",    "DJIA_next_ret"),
]

def get_pred(results_dict, target_key, kind):
    """
    kind: 'ols', 'ridge', or 'pca'
    Prefer unstandardized preds if present, otherwise fall back to raw.
    """
    unstd_key = f"yhat_{kind}_unstd"
    std_key   = f"yhat_{kind}"
    if target_key in results_dict:
        if unstd_key in results_dict[target_key]:
            return results_dict[target_key][unstd_key]
        elif std_key in results_dict[target_key]:
            return results_dict[target_key][std_key]
    return None

fig = make_subplots(
    rows=3, cols=1, shared_xaxes=True, vertical_spacing=0.06,
    subplot_titles=[p[0] for p in pairs]
)

for i, (label, next_col) in enumerate(pairs, start=1):
    if next_col not in testing_data.columns:
        continue

    # True = next-period return from testing_data (unscaled)
    y_true = testing_data[next_col].dropna().sort_index().rename("True")

    # Predictions for the same target
    yhat_ols   = get_pred(results, next_col, "ols")
    yhat_ridge = get_pred(results, next_col, "ridge")
    yhat_pca   = get_pred(results, next_col, "pca")
    yhat_lasso = get_pred(results, next_col, "lasso")

    parts = [y_true]
    if yhat_ols is not None:
        parts.append(yhat_ols.rename("OLS").reindex(y_true.index))
    if yhat_ridge is not None:
        parts.append(yhat_ridge.rename("Ridge").reindex(y_true.index))
    if yhat_pca is not None:
        parts.append(yhat_pca.rename("PCA").reindex(y_true.index))
    if yhat_lasso is not None:
        parts.append(yhat_lasso.rename("Lasso").reindex(y_true.index))

    df = pd.concat(parts, axis=1).dropna(how="any")
    if df.empty:
        continue

    show_leg = (i == 1)

    # True next_ret
    fig.add_trace(
        go.Scatter(x=df.index, y=df["True"], name="True next_ret",
                   mode="lines", line=dict(width=1.6),
                   showlegend=show_leg, legendgroup="true"),
        row=i, col=1
    )
    """
    # OLS prediction
    if "OLS" in df:
        fig.add_trace(
            go.Scatter(x=df.index, y=df["OLS"], name="OLS prediction",
                       mode="lines", line=dict(width=1.4, dash="dash"),
                       showlegend=show_leg, legendgroup="ols"),
            row=i, col=1
        )
    """
    # Ridge prediction
    if "Ridge" in df:
        fig.add_trace(
            go.Scatter(x=df.index, y=df["Ridge"], name="Ridge prediction",
                       mode="lines", line=dict(width=1.4, dash="dot"),
                       showlegend=show_leg, legendgroup="ridge"),
            row=i, col=1
        )
    # PCA prediction
    if "PCA" in df:
        fig.add_trace(
            go.Scatter(x=df.index, y=df["PCA"], name="PCA prediction",
                       mode="lines", line=dict(width=1.4, dash="longdash"),
                       showlegend=show_leg, legendgroup="pca"),
            row=i, col=1
        )
    # Lasso prediction
    if "PCA" in df:
        fig.add_trace(
            go.Scatter(x=df.index, y=df["Lasso"], name="Lasso prediction",
                       mode="lines", line=dict(width=1.4, dash="dash"),
                       showlegend=show_leg, legendgroup="lasso"),
            row=i, col=1
        )

fig.update_layout(
    title="Testing: True Next-Period Log Returns vs OLS, Ridge & PCA Predictions",
    height=900,
    hovermode="x unified",
    template="plotly_white",
    margin=dict(t=80, r=30, b=80, l=70),
    legend=dict(orientation="h", yanchor="top", y=-0.12, xanchor="left", x=0)
)

for r in range(1, 4):
    fig.update_yaxes(title_text="Log return", row=r, col=1)

# Range tools on the bottom axis
fig.update_xaxes(
    rangeselector=dict(buttons=[
        dict(step="all", label="All"),
        dict(count=3, step="year", stepmode="backward", label="3Y"),
        dict(count=1, step="year", stepmode="backward", label="1Y"),
        dict(count=6, step="month", stepmode="backward", label="6M"),
        dict(count=1, step="month", stepmode="backward", label="1M"),
    ]),
    rangeslider=dict(visible=True),
    row=3, col=1
)

fig.show()


# It looks like this may be an ok base for mu. Let's try to build sigma.

In [452]:
vol_data = s_p.join([nasdaq, djia])
vol_data

Price,S&P_Close,S&P_Volume,NASDAQ_Close,NASDAQ_Volume,DJIA_Close,DJIA_Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1993-01-29,24.313030,1003200,,,,
1993-02-01,24.485964,480500,,,,
1993-02-02,24.537844,201300,,,,
1993-02-03,24.797222,529400,,,,
1993-02-04,24.900986,531500,,,,
...,...,...,...,...,...,...
2025-09-15,659.082703,63772400,591.679993,44360300.0,458.759308,4862200.0
2025-09-16,658.175232,61169000,591.179993,36942100.0,457.472961,4526100.0
2025-09-17,657.357544,101952200,590.000000,69384800.0,459.945923,5998000.0
2025-09-18,660.429016,90459200,595.320007,61069300.0,461.312012,6681000.0


In [453]:
# First, make log prices / volumes of our data, then log normal assumptions are better and everything is additive
vol_data['S&P_log_price'] = np.log(vol_data['S&P_Close'])
vol_data['NASDAQ_log_price'] = np.log(vol_data['NASDAQ_Close'])
vol_data['DJIA_log_price'] = np.log(vol_data['DJIA_Close'])

vol_data['S&P_log_volume'] = np.log(data['S&P_Volume'])
vol_data['NASDAQ_log_volume'] = np.log(data['NASDAQ_Volume'])
vol_data['DJIA_log_volume'] = np.log(data['DJIA_Volume'])

# Log returns
vol_data['S&P_ret'] = vol_data['S&P_log_price'].diff(days)
vol_data['NASDAQ_ret'] = vol_data['NASDAQ_log_price'].diff(days)
vol_data['DJIA_ret'] = vol_data['DJIA_log_price'].diff(days)

# Next month log returns
vol_data[['S&P_next_ret','NASDAQ_next_ret','DJIA_next_ret']] = vol_data[['S&P_ret','NASDAQ_ret','DJIA_ret']].shift(-days).dropna()

# Residuals from previous
vol_data['S&P_next_ret'] = vol_data['S&P_next_ret'] - pd.concat([results['S&P_next_ret']['train_data_pca'],results['S&P_next_ret']['yhat_pca']])
vol_data['NASDAQ_ret'] = vol_data['NASDAQ_ret'] - pd.concat([results['NASDAQ_next_ret']['train_data_pca'],results['NASDAQ_next_ret']['yhat_pca']])
vol_data['DJIA_ret'] = vol_data['DJIA_ret'] - pd.concat([results['DJIA_next_ret']['train_data_pca'],results['DJIA_next_ret']['yhat_pca']])


# TODO: Change this, ripped straight from GPT as a quick test

In [454]:
EPS = 1e-12

# If you already have a rolling_mean helper, keep it. Otherwise:
def rolling_mean(s, w):
    return s.rolling(w, min_periods=max(2, int(w*0.6))).mean()

def ewma_vol(r, lam=0.94):
    # EWMA variance per RiskMetrics: sigma_t^2 = (1-lam)*r_{t-1}^2 + lam*sigma_{t-1}^2
    # Use pandas ewm for convenience
    return r.pow(2).ewm(alpha=(1-lam), adjust=False).mean().clip(lower=0)

def rolling_autocorr(x, lag=1, window=63):
    # Rolling autocorrelation of x at a given lag
    # For stability, require at least ~60% of window
    minp = max(10, int(window*0.6))
    x0 = x
    x1 = x.shift(lag)
    return x0.rolling(window, min_periods=minp).corr(x1)

def realized_quarticity(r, window=63):
    # 3-month robust quarticity proxy (if daily): sum r^4 * (n / 3) approximation
    # Here we simply provide rolling sum of r^4; scaling optional depending on use
    minp = max(10, int(window*0.6))
    return (r.pow(4)).rolling(window, min_periods=minp).sum()

def build_vol_features(data, prefix, day_w=21, qtr_w=63, yr_w=252, ewma_lambda=0.94):
    """
    Expects:
      data[f'{prefix}_log_price'] (daily log price)
      data[f'{prefix}_log_volume'] (daily log volume)
    Produces a suite of volatility-centric features for that prefix.
    """
    lp = data[f"{prefix}_log_price"]
    lv = data.get(f"{prefix}_log_volume", None)

    # Daily log return
    r = lp.diff()  # already log-price, so diff = log-return

    # --- Realized volatility proxies ---
    data[f"{prefix}_rv_1m"]  = r.rolling(day_w, min_periods=int(day_w*0.6)).var().clip(lower=0)          # variance
    data[f"{prefix}_rv_3m"]  = r.rolling(qtr_w, min_periods=int(qtr_w*0.6)).var().clip(lower=0)
    data[f"{prefix}_rv_1y"]  = r.rolling(yr_w,  min_periods=int(yr_w*0.6)).var().clip(lower=0)
    data[f"{prefix}_absrv_1m"] = r.abs().rolling(day_w, min_periods=int(day_w*0.6)).mean()               # mean |r|
    data[f"{prefix}_absrv_3m"] = r.abs().rolling(qtr_w, min_periods=int(qtr_w*0.6)).mean()

    # EWMA volatility (RiskMetrics-style)
    data[f"{prefix}_ewma_var"] = ewma_vol(r, lam=ewma_lambda)
    data[f"{prefix}_ewma_vol"] = np.sqrt(data[f"{prefix}_ewma_var"])

    # Volatility-of-volatility (how fast vol is changing)
    data[f"{prefix}_vol_speed_1w"] = data[f"{prefix}_rv_3m"].diff(5)                                      # weekly change in 3m var
    data[f"{prefix}_vol_mom_1m"]   = data[f"{prefix}_rv_3m"] - data[f"{prefix}_rv_1m"]                    # 3m vs 1m
    data[f"{prefix}_vol_mom_1y"]   = data[f"{prefix}_rv_1y"] - data[f"{prefix}_rv_3m"]

    # Volatility clustering proxies
    data[f"{prefix}_acf_sqret_lag1_3m"] = rolling_autocorr(r.pow(2), lag=1, window=qtr_w)
    data[f"{prefix}_acf_absret_lag1_3m"] = rolling_autocorr(r.abs(), lag=1, window=qtr_w)

    # Leverage effect proxy (contemporaneous corr between return and next day's vol)
    # Negative returns often precede higher vol; we proxy with corr(r_t, |r|_{t+1})
    data[f"{prefix}_lev_proxy_3m"] = r.rolling(qtr_w, min_periods=int(qtr_w*0.6)).corr(r.abs().shift(-1))

    # Quarticity (heavy tails proxy)
    data[f"{prefix}_quarticity_3m"] = realized_quarticity(r, window=qtr_w)

    # Ratio features (normalized vol levels)
    data[f"{prefix}_vol_ratio_1m_3m"] = (data[f"{prefix}_rv_1m"] / (data[f"{prefix}_rv_3m"] + EPS))
    data[f"{prefix}_vol_ratio_3m_1y"] = (data[f"{prefix}_rv_3m"] / (data[f"{prefix}_rv_1y"] + EPS))
    data[f"{prefix}_ewma_over_3m"]    = (data[f"{prefix}_ewma_var"] / (data[f"{prefix}_rv_3m"] + EPS))

    # Price–volatility relation: distance from trend as a stress proxy
    data[f"{prefix}_price_trend_3m"]  = rolling_mean(lp, qtr_w)
    data[f"{prefix}_price_trend_dist"] = lp - data[f"{prefix}_price_trend_3m"]
    # Volatility when far below trend often spikes; include interaction
    data[f"{prefix}_vol_x_trend_dist"] = data[f"{prefix}_rv_1m"] * data[f"{prefix}_price_trend_dist"]

    # Volume–volatility links (if volume available)
    if lv is not None:
        dv = lv.diff()  # log-volume change
        data[f"{prefix}_vlm_var_1m"] = dv.rolling(day_w, min_periods=int(day_w*0.6)).var().clip(lower=0)
        data[f"{prefix}_vlm_var_3m"] = dv.rolling(qtr_w, min_periods=int(qtr_w*0.6)).var().clip(lower=0)
        # Corr between |r| and volume changes (vol–volume clustering)
        data[f"{prefix}_corr_absr_dlv_3m"] = r.abs().rolling(qtr_w, min_periods=int(qtr_w*0.6)).corr(dv)
        # Volume surprise proxy: current vs 3m trend
        data[f"{prefix}_vlm_trend_3m"] = rolling_mean(lv, qtr_w)
        data[f"{prefix}_vlm_trend_dist"] = lv - data[f"{prefix}_vlm_trend_3m"]
        # Vol reacts to volume surprises
        data[f"{prefix}_vol_x_vlm_surprise"] = data[f"{prefix}_rv_1m"] * data[f"{prefix}_vlm_trend_dist"]

    # Optional: implied vs realized vol spread if you have VIX-like series
    # if f"{prefix}_impl_vol" in data.columns:
    #     data[f"{prefix}_ivr_spread"] = data[f"{prefix}_impl_vol"]**2 - data[f"{prefix}_rv_1m"]

    # Forward-looking realized vol target example (if needed)
    # data[f"{prefix}_fwd_rv_1m"] = r.shift(-1).rolling(day_w, min_periods=int(day_w*0.6)).var()

    return data

# ---- Apply to all indices ----
IDX_PREFIXES = ["S&P", "NASDAQ", "DJIA"]
for pfx in IDX_PREFIXES:
    vol_data = build_vol_features(vol_data, pfx, day_w=21, qtr_w=63, yr_w=252, ewma_lambda=0.94)

# ---- Cross-index spillover features (optional but useful) ----
# Differences/spreads in contemporaneous vol across indices capture contagion/regime moves
vol_data["SPX_minus_NDX_vol_1m"] = vol_data["S&P_rv_1m"] - vol_data["NASDAQ_rv_1m"]
vol_data["SPX_minus_DJIA_vol_1m"] = vol_data["S&P_rv_1m"] - vol_data["DJIA_rv_1m"]
vol_data["NDX_minus_DJIA_vol_1m"] = vol_data["NASDAQ_rv_1m"] - vol_data["DJIA_rv_1m"]

# A simple global vol factor: first PC of 3m realized vars (if you want a single factor)
try:
    _X = vol_data[["S&P_rv_3m", "NASDAQ_rv_3m", "DJIA_rv_3m"]].dropna()
    _Xc = (_X - _X.mean()) / (_X.std(ddof=0) + EPS)
    # first PC (no scikit-learn to keep it lightweight)
    U, S, Vt = np.linalg.svd(_Xc.values, full_matrices=False)
    gvol = pd.Series(U[:, 0]*S[0], index=_X.index, name="global_vol_pc1")
    vol_data["global_vol_pc1"] = gvol.reindex(data.index)
except Exception:
    pass

vol_data.dropna(inplace=True)

In [455]:
vol_data

Price,S&P_Close,S&P_Volume,NASDAQ_Close,NASDAQ_Volume,DJIA_Close,DJIA_Volume,S&P_log_price,NASDAQ_log_price,DJIA_log_price,S&P_log_volume,...,DJIA_vlm_var_1m,DJIA_vlm_var_3m,DJIA_corr_absr_dlv_3m,DJIA_vlm_trend_3m,DJIA_vlm_trend_dist,DJIA_vol_x_vlm_surprise,SPX_minus_NDX_vol_1m,SPX_minus_DJIA_vol_1m,NDX_minus_DJIA_vol_1m,global_vol_pc1
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2004-02-19,76.924599,51146200,31.322958,100843800.0,66.054520,6430900.0,4.342826,3.444351,4.190480,17.750199,...,0.183854,0.163275,0.362774,15.637369,0.039256,0.000002,-0.000069,0.000010,7.879387e-05,0.765691
2004-02-20,76.690948,46728800,31.221315,132347600.0,65.799736,9334700.0,4.339784,3.441101,4.186616,17.659871,...,0.190730,0.162235,0.352632,15.647930,0.401319,0.000015,-0.000071,0.000009,8.049876e-05,0.787468
2004-02-23,76.497322,36357000,30.874050,125199800.0,65.682281,7350700.0,4.337256,3.429916,4.184829,17.408897,...,0.184047,0.159739,0.360884,15.651989,0.158317,0.000006,-0.000075,0.000009,8.385642e-05,0.788277
2004-02-24,76.363815,43953000,30.797806,124521900.0,65.434921,7973900.0,4.335509,3.427443,4.181056,17.598631,...,0.183610,0.155746,0.359413,15.657835,0.233849,0.000008,-0.000075,0.000010,8.480435e-05,0.797927
2004-02-25,76.684273,31213600,30.984146,74457100.0,65.657547,4212700.0,4.339697,3.433476,4.184453,17.256364,...,0.203312,0.162379,0.360935,15.648211,-0.394597,-0.000011,-0.000070,0.000010,8.023364e-05,0.799117
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2025-09-12,655.592407,72780100,586.659973,50745900.0,458.011444,3784600.0,6.485539,6.374445,6.126894,18.102953,...,0.144960,0.113274,0.474106,15.188782,-0.042331,-0.000002,-0.000019,-0.000010,9.078153e-06,0.836755
2025-09-15,659.082703,63772400,591.679993,44360300.0,458.759308,4862200.0,6.490849,6.382966,6.128526,17.970831,...,0.143888,0.108058,0.412614,15.188900,0.208102,0.000008,-0.000021,-0.000009,1.212366e-05,0.861911
2025-09-16,658.175232,61169000,591.179993,36942100.0,457.472961,4526100.0,6.489471,6.382121,6.125718,17.929151,...,0.141966,0.105289,0.434828,15.194517,0.130854,0.000005,-0.000020,-0.000010,1.007059e-05,0.868411
2025-09-17,657.357544,101952200,590.000000,69384800.0,459.945923,5998000.0,6.488228,6.380123,6.131109,18.440015,...,0.138226,0.105306,0.442952,15.208728,0.398209,0.000017,-0.000020,-0.000010,9.766296e-06,0.877492


In [456]:
month_dummies = pd.get_dummies(vol_data.index.month, prefix="month")
month_dummies.set_index(vol_data.index, inplace=True)
vol_data = vol_data.join(month_dummies)

In [457]:

cutoff = math.floor(len(vol_data)*.8)
training_data = vol_data.iloc[:cutoff]
testing_data = vol_data.iloc[cutoff:]
# Make sure we only fit on training_data and explanatory variables
targets = ['S&P_next_ret', 'NASDAQ_next_ret', 'DJIA_next_ret']
dummies = [f'month_{month}' for month in range(1,13)]
features = [column for column in training_data.columns if column not in targets and column not in dummies]

scaler = StandardScaler()
scaler.fit(training_data[features]) # Fitting on training data

train_scaled = training_data.copy()
test_scaled = testing_data.copy()

train_scaled[features] = scaler.transform(training_data[features])
test_scaled[features] = scaler.transform(testing_data[features])

# Save info on standardization for later
variables_mu = pd.Series(scaler.mean_, index=features)
variables_sd = pd.Series(scaler.scale_, index=features)

In [458]:
# Models
results_sigma = {}

for target in targets:
    print(f"\n=== Target: {target} ===")
    X_train = train_scaled[features].copy()
    y_train = train_scaled[target].copy()
    X_test = test_scaled[features].copy()
    y_test = test_scaled[target].copy()

    # 1. Ordinary Least Squares (OLS)
    ols = LinearRegression()
    ols.fit(X_train, y_train)
    yhat_ols = ols.predict(X_test)
    eval_and_report(y_test, yhat_ols, "OLS")

    # Print top coefficients
    ols_coef = pd.Series(ols.coef_, index=features).sort_values(key=np.abs, ascending=False)
    print("Top OLS coeffs:\n", ols_coef.head(10))

    # 2. Ridge with CV over alphas (time-series CV)
    tscv = TimeSeriesSplit(n_splits=5)
    alphas = np.logspace(-4, 3, 30)

    ridge = RidgeCV(alphas=alphas, cv=tscv, fit_intercept=True)
    ridge.fit(X_train, y_train)
    yhat_ridge = ridge.predict(X_test)
    eval_and_report(y_test, yhat_ridge, f"Ridge: alpha={ridge.alpha_:.4g})")

    # Print top coefficients
    ridge_coef = pd.Series(ridge.coef_, index=features).sort_values(key=np.abs, ascending=False)
    print("Top Ridge coeffs:\n", ridge_coef.head(10))

    # 3. Lasso with CV over alphas (time-series CV)
    tscv = TimeSeriesSplit(n_splits=5)
    alphas = np.logspace(-4, 3, 30)

    lasso = LassoCV(alphas=alphas, cv=tscv, fit_intercept=True)
    lasso.fit(X_train, y_train)
    yhat_lasso = lasso.predict(X_test)
    eval_and_report(y_test, yhat_lasso, f"Lasso (alpha={lasso.alpha_:.4g})")

    # Print top coefficients
    lasso_coef = pd.Series(lasso.coef_, index=features).sort_values(key=np.abs, ascending=False)
    print("Top Lasso coeffs:\n", lasso_coef.head(10))

    # Store for later use
    results_sigma[target] = {
        "ols_model": ols,
        "ridge_model": ridge,
        "ols_coefs": ols_coef,
        "ridge_coefs": ridge_coef,
        "lasso_coefs": lasso_coef,
        "train_data_ols": pd.Series(ols.predict(X_train), index=y_train.index, name=f"OLS_train"),
        "train_data_ridge": pd.Series(ridge.predict(X_train), index=y_train.index, name=f"Ridge_train"),
        "train_data_lasso": pd.Series(lasso.predict(X_train), index=y_train.index, name=f"Lasso_train"),
        "yhat_ols": pd.Series(yhat_ols, index=y_test.index, name=f"{target}_OLS_pred"),
        "yhat_ridge": pd.Series(yhat_ridge, index=y_test.index, name=f"{target}_Ridge_pred"),
        "yhat_lasso": pd.Series(yhat_lasso, index=y_test.index, name=f"{target}_Ridge_pred"),
    }


=== Target: S&P_next_ret ===
OLS                | R^2: -4.5289 | RMSE: 0.026057
Top OLS coeffs:
 DJIA_quarticity_3m         0.033700
S&P_vol_x_trend_dist      -0.029455
S&P_ewma_vol              -0.029301
DJIA_ewma_vol              0.027753
NASDAQ_quarticity_3m      -0.027753
S&P_Close                 -0.025970
DJIA_Close                 0.025080
NASDAQ_vol_x_trend_dist    0.015279
S&P_absrv_1m               0.013531
S&P_price_trend_3m         0.013341
dtype: float64
Ridge: alpha=1000) | R^2: -0.0078 | RMSE: 0.011125
Top Ridge coeffs:
 S&P_vol_x_trend_dist        -0.000967
DJIA_vol_x_trend_dist       -0.000879
DJIA_vol_x_vlm_surprise      0.000814
NASDAQ_vol_x_vlm_surprise   -0.000748
NASDAQ_ewma_var             -0.000443
S&P_vol_speed_1w             0.000414
S&P_quarticity_3m           -0.000411
NASDAQ_ret                   0.000405
S&P_vol_x_vlm_surprise      -0.000362
NASDAQ_vol_speed_1w         -0.000326
dtype: float64
Lasso (alpha=1000) | R^2: -0.0041 | RMSE: 0.011105
Top Lasso c


Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.355e-04, tolerance: 2.922e-05



Ridge: alpha=1000) | R^2: -0.0035 | RMSE: 0.014495
Top Ridge coeffs:
 DJIA_vol_x_trend_dist       -0.000908
S&P_vol_x_trend_dist        -0.000885
DJIA_ret                    -0.000765
DJIA_vol_x_vlm_surprise      0.000699
NASDAQ_vol_x_vlm_surprise   -0.000582
S&P_quarticity_3m           -0.000453
S&P_price_trend_dist        -0.000446
NASDAQ_Volume               -0.000418
S&P_ret                     -0.000416
NASDAQ_ewma_var             -0.000352
dtype: float64
Lasso (alpha=0.0003039) | R^2: -0.0013 | RMSE: 0.014478
Top Lasso coeffs:
 DJIA_ret                -0.001384
DJIA_vol_x_trend_dist   -0.000341
NASDAQ_Volume           -0.000287
NASDAQ_vol_speed_1w     -0.000161
DJIA_rv_1y               0.000091
DJIA_vlm_var_1m         -0.000037
NASDAQ_vlm_trend_dist   -0.000021
NASDAQ_Close             0.000000
S&P_Close               -0.000000
DJIA_log_price          -0.000000
dtype: float64

=== Target: DJIA_next_ret ===
OLS                | R^2: -5.4688 | RMSE: 0.024078
Top OLS coeffs:
 DJIA_q

In [459]:
pca = PCA(n_components=.95).fit(train_scaled) # keep 95% of variance and fit to training set
train_pca = pca.transform(train_scaled)
test_pca = pca.transform(test_scaled)

# PCA
for target in targets:
    print(f"\n=== Target: {target} ===")
    y_train = train_scaled[target].copy()
    y_test  = test_scaled[target].copy()

    # 1) OLS on PCA components
    ols_pca = LinearRegression()
    ols_pca.fit(train_pca, y_train)
    yhat_pca = ols_pca.predict(test_pca)
    eval_and_report(y_test, yhat_pca, "OLS+PCA")

    results_sigma[target].update({
        "pca_model": ols_pca,
        "train_data_pca": pd.Series(ols_pca.predict(train_pca), index=y_train.index, name=f"PCA_train"),
        "yhat_pca": pd.Series(yhat_pca, index=y_test.index, name=f"{target}_PCA_pred")
    })


=== Target: S&P_next_ret ===
OLS+PCA            | R^2: -0.0074 | RMSE: 0.011123

=== Target: NASDAQ_next_ret ===
OLS+PCA            | R^2: -0.0053 | RMSE: 0.014507

=== Target: DJIA_next_ret ===
OLS+PCA            | R^2: -0.0341 | RMSE: 0.009627


In [460]:
# Pairs: (pretty name, next-return target key)
pairs = [
    ("S&P 500", "S&P_next_ret"),
    ("NASDAQ",  "NASDAQ_next_ret"),
    ("DJIA",    "DJIA_next_ret"),
]

def get_pred(results_dict, target_key, kind):
    """
    kind: 'ols', 'ridge', or 'pca'
    Prefer unstandardized preds if present, otherwise fall back to raw.
    """
    unstd_key = f"yhat_{kind}_unstd"
    std_key   = f"yhat_{kind}"
    if target_key in results_dict:
        if unstd_key in results_dict[target_key]:
            return results_dict[target_key][unstd_key]
        elif std_key in results_dict[target_key]:
            return results_dict[target_key][std_key]
    return None

fig = make_subplots(
    rows=3, cols=1, shared_xaxes=True, vertical_spacing=0.06,
    subplot_titles=[p[0] for p in pairs]
)

for i, (label, next_col) in enumerate(pairs, start=1):
    if next_col not in testing_data.columns:
        continue

    # True = next-period return from testing_data (unscaled)
    y_true = testing_data[next_col].dropna().sort_index().rename("True")

    # Predictions for the same target
    yhat_ols   = get_pred(results_sigma, next_col, "ols")
    yhat_ridge = get_pred(results_sigma, next_col, "ridge")
    yhat_pca   = get_pred(results_sigma, next_col, "pca")
    yhat_lasso   = get_pred(results_sigma, next_col, "lasso")

    parts = [y_true]
    if yhat_ols is not None:
        parts.append(yhat_ols.rename("OLS").reindex(y_true.index))
    if yhat_ridge is not None:
        parts.append(yhat_ridge.rename("Ridge").reindex(y_true.index))
    if yhat_pca is not None:
        parts.append(yhat_pca.rename("PCA").reindex(y_true.index))
    if yhat_lasso is not None:
        parts.append(yhat_lasso.rename("Lasso").reindex(y_true.index))

    df = pd.concat(parts, axis=1).dropna(how="any")
    if df.empty:
        continue

    show_leg = (i == 1)

    # True next_ret
    fig.add_trace(
        go.Scatter(x=df.index, y=df["True"], name="True next_ret",
                   mode="lines", line=dict(width=1.6),
                   showlegend=show_leg, legendgroup="true"),
        row=i, col=1
    )
    """
    # OLS prediction
    if "OLS" in df:
        fig.add_trace(
            go.Scatter(x=df.index, y=df["OLS"], name="OLS prediction",
                       mode="lines", line=dict(width=1.4, dash="dash"),
                       showlegend=show_leg, legendgroup="ols"),
            row=i, col=1
        )
    """
    # Ridge prediction
    if "Ridge" in df:
        fig.add_trace(
            go.Scatter(x=df.index, y=df["Ridge"], name="Ridge prediction",
                       mode="lines", line=dict(width=1.4, dash="dot"),
                       showlegend=show_leg, legendgroup="ridge"),
            row=i, col=1
        )
    # PCA prediction 
    if "PCA" in df:
        fig.add_trace(
            go.Scatter(x=df.index, y=df["PCA"], name="PCA prediction",
                       mode="lines", line=dict(width=1.4, dash="longdash"),
                       showlegend=show_leg, legendgroup="pca"),
            row=i, col=1
        )
    # Lasso prediction  
    if "Lasso" in df:
        fig.add_trace(
            go.Scatter(x=df.index, y=df["Lasso"], name="Lasso prediction",
                       mode="lines", line=dict(width=1.4, dash="dash"),
                       showlegend=show_leg, legendgroup="lasso"),
            row=i, col=1
        )

fig.update_layout(
    title="Testing: True Next-Period Log Returns vs OLS, Ridge & PCA Predictions",
    height=900,
    hovermode="x unified",
    template="plotly_white",
    margin=dict(t=80, r=30, b=80, l=70),
    legend=dict(orientation="h", yanchor="top", y=-0.12, xanchor="left", x=0)
)

for r in range(1, 4):
    fig.update_yaxes(title_text="Log return", row=r, col=1)

# Range tools on the bottom axis
fig.update_xaxes(
    rangeselector=dict(buttons=[
        dict(step="all", label="All"),
        dict(count=3, step="year", stepmode="backward", label="3Y"),
        dict(count=1, step="year", stepmode="backward", label="1Y"),
        dict(count=6, step="month", stepmode="backward", label="6M"),
        dict(count=1, step="month", stepmode="backward", label="1M"),
    ]),
    rangeslider=dict(visible=True),
    row=3, col=1
)

fig.show()


Combine the drift and diffusion

In [461]:
combined = pd.DataFrame(data['S&P_Close']).join([data['NASDAQ_Close'], data['DJIA_Close']])
models = ['ridge','lasso','pca']

for model in models:
    for target in targets:
        df = pd.DataFrame(pd.concat([results_sigma[target][f'train_data_{model}'],results_sigma[target][f'yhat_{model}']])).join(pd.Series(pd.concat([results[target][f'train_data_{model}'],results[target][f'yhat_{model}']]),name='mu'))
        df['mu'] = df['mu'] - (data[target].mean())
        df['total'] = df[0] + df['mu']
        combined[f'{target}_pred_{model}'] = df['total']
combined = combined.dropna()


In [462]:
combined['S&P_next_ret_pred_lasso'].sum()

np.float64(-0.08793744834695257)

In [463]:
# --- config ---
is_log_returns = True
dash_map = {"ridge": "dot", "pca": "dash", "lasso": "longdash"}

pairs = [
    ("S&P 500",  "S&P_next_ret",  "S&P_Close"),
    ("NASDAQ",   "NASDAQ_next_ret","NASDAQ_Close"),
    ("DJIA",     "DJIA_next_ret", "DJIA_Close"),
]

fig = make_subplots(
    rows=3, cols=1, shared_xaxes=True, vertical_spacing=0.06,
    subplot_titles=[p[0] for p in pairs]
)

for i, (label, target_key, close_col) in enumerate(pairs, start=1):
    if close_col not in combined.columns:
        continue

    close = combined[close_col].dropna()
    if close.empty:
        continue

    # Actual Close
    show_leg = (i == 1)
    fig.add_trace(
        go.Scatter(
            x=close.index, y=close.values,
            name=f"{label} — Actual", mode="lines",
            line=dict(width=1.7),
            showlegend=show_leg, legendgroup=f"{label}_actual"
        ),
        row=i, col=1
    )

    # For each model, overlay estimated close path
    for m in models:
        pred_col = f"{target_key}_pred_{m}"
        if pred_col not in combined.columns:
            continue

        rhat = combined[pred_col].dropna()
        idx = close.index.intersection(rhat.index)
        if idx.empty:
            continue

        s = close.loc[idx]
        r = rhat.loc[idx]

        base = float(s.iloc[0])
        if is_log_returns:
            # r_t moves price t-1 -> t, shift to anchor at first date
            cum = r.shift(1).fillna(0.0).cumsum()
            est = base * np.exp(cum)
        else:
            growth = (1.0 + r).shift(1).fillna(1.0).cumprod()
            est = base * growth

        fig.add_trace(
            go.Scatter(
                x=idx, y=est.values,
                name=f"{label} — {m.upper()} (anchored)",
                mode="lines",
                line=dict(width=1.3, dash=dash_map.get(m, "dash")),
                showlegend=show_leg, legendgroup=f"{label}_{m}"
            ),
            row=i, col=1
        )

fig.update_layout(
    title="Actual vs Estimated Close (Anchored) — Ridge & PCA",
    height=900,
    hovermode="x unified",
    template="plotly_white",
    margin=dict(t=80, r=30, b=80, l=70),
    legend=dict(orientation="h", yanchor="top", y=-0.12, xanchor="left", x=0)
)

for r in range(1, 4):
    fig.update_yaxes(title_text="Price", row=r, col=1)

fig.update_xaxes(
    rangeselector=dict(buttons=[
        dict(step="all", label="All"),
        dict(count=3, step="year", stepmode="backward", label="3Y"),
        dict(count=1, step="year", stepmode="backward", label="1Y"),
        dict(count=6, step="month", stepmode="backward", label="6M"),
        dict(count=1, step="month", stepmode="backward", label="1M"),
    ]),
    rangeslider=dict(visible=True),
    row=3, col=1
)

fig.show()


Pretty interesting results here... It looks like we ultimately ended up capturing accurate drift through lasso, but volatility is not being incorporated very well. I think we need to move to backtesting to implement entry and exit strategies and compare against a baseline. Afterwards, we can try to implement some more advanced models as well as refine features.

# Backtesting

Sizing strategy 1: Using Merton-Kelly criterion

In [464]:
def get_merton_kelly_size(mu: float, rf: float, sigma: float):
    size = (mu - rf) / sigma**2
    return size

Sizing strategy 2: Using vol target

In [465]:
def get_vol_target_sizing(target: float, vol: float):
    size = (target/vol)
    return size

## Trading strategy 1: Trade based on our models

Add 10yr, also get cumulative log returns

In [466]:
combined = combined.join(ten_yr)
combined['ten_yr'] = combined['ten_yr']/100

In [467]:
indexes = ['S&P', 'NASDAQ', 'DJIA']
for idx in indexes:
    combined[f'{idx}_log'] = np.log(combined[f'{idx}_Close'])
    combined[f'{idx}_log_ret'] = combined[f'{idx}_log'].diff()
    combined[f'{idx}_cum_log_ret'] = combined[f'{idx}_log_ret'].cumsum()

In [468]:
for idx in indexes:
    for model in models:
        combined[f'{idx}_pred_{model}'] = combined[f'{idx}_Close'] * np.exp(combined[f'{idx}_next_ret_pred_{model}'])
        combined[f'{idx}_vol'] = combined[f'{idx}_Close'].rolling(window=3).std() #vol proxy
combined = combined.dropna()

In [469]:
combined

Unnamed: 0_level_0,S&P_Close,NASDAQ_Close,DJIA_Close,S&P_next_ret_pred_ridge,NASDAQ_next_ret_pred_ridge,DJIA_next_ret_pred_ridge,S&P_next_ret_pred_lasso,NASDAQ_next_ret_pred_lasso,DJIA_next_ret_pred_lasso,S&P_next_ret_pred_pca,...,S&P_pred_lasso,S&P_pred_pca,NASDAQ_pred_ridge,NASDAQ_vol,NASDAQ_pred_lasso,NASDAQ_pred_pca,DJIA_pred_ridge,DJIA_vol,DJIA_pred_lasso,DJIA_pred_pca
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2004-02-23,76.497322,30.874050,65.682281,-0.001068,-0.001610,-0.001377,-0.000084,0.000051,0.000268,0.000221,...,76.490860,76.514227,30.824374,0.235387,30.875628,30.869990,65.591868,0.190294,65.699893,65.687679
2004-02-24,76.363815,30.797806,65.434921,0.000167,-0.000588,-0.000211,0.000159,0.000632,0.000752,0.000775,...,76.375923,76.423022,30.779696,0.225746,30.817285,30.798039,65.421095,0.186222,65.484176,65.463524
2004-02-25,76.684273,30.984146,65.657547,-0.000542,-0.000979,-0.000870,-0.000315,-0.000296,-0.000355,0.000746,...,76.660129,76.741528,30.953812,0.093681,30.974987,30.994419,65.600436,0.136236,65.634237,65.687817
2004-02-26,76.730988,31.060385,65.502968,0.000148,0.000610,0.000546,0.000199,0.001135,0.000994,0.001232,...,76.746258,76.825552,31.079327,0.135082,31.095646,31.112586,65.538714,0.114081,65.568112,65.606169
2004-02-27,76.784454,30.975674,65.583405,-0.000460,-0.000803,-0.000565,-0.000210,-0.000134,-0.000029,0.000380,...,76.768324,76.813613,30.950817,0.046655,30.971513,30.975232,65.546381,0.077311,65.581498,65.605112
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2025-09-12,655.592407,586.659973,458.011444,-0.001035,-0.001808,-0.000602,0.000287,0.001608,0.001420,-0.001800,...,655.780873,654.413271,585.600262,2.988917,587.604338,584.791927,457.735699,2.995127,458.662427,456.971710
2025-09-15,659.082703,591.679993,458.759308,-0.001395,-0.002602,-0.001526,-0.000335,0.000117,-0.000032,-0.001999,...,658.862186,657.766411,590.142676,3.864721,591.749001,589.304456,458.059864,1.334406,458.744577,457.339990
2025-09-16,658.175232,591.179993,457.472961,-0.000703,-0.000577,0.000134,0.000211,0.001422,0.001117,-0.000830,...,658.314035,657.628973,590.839135,2.765296,592.021444,590.326930,457.534488,0.646007,457.984062,457.080347
2025-09-17,657.357544,590.000000,459.945923,-0.001942,-0.003935,-0.002534,-0.000734,-0.001311,-0.001177,-0.002450,...,656.875296,655.749134,587.682762,0.862628,589.227209,587.071782,458.781711,1.236816,459.405101,458.184611


### Backtest

Set variables

Run backtest

Might be slower, but much easier to implement strategy using iterative process

In [499]:
# Create backtest df copy
bt = combined.copy()

# Set initial conditions
starting_cash = 10000
bt_data = {}

for index in indexes:
    bt_data[index] = {}

    for model in models:
        bt[f'{index}_portfolio_value_{model}'] = 0
        bt[f'{index}_cash_{model}'] = 0
        bt[f'{index}_cash_{model}'].iloc[0] = starting_cash

        bt_data[index][model] = {
            'last_value': 0,
            'last_cash': starting_cash
        }

# Starts at second day
for idx, row in bt.iloc[1:].iterrows():
    for index in indexes:
        for model in models:
            last_value = bt_data[index][model]['last_value']
            last_cash = bt_data[index][model]['last_cash']

            close = row[f'{index}_Close']
            pred = row[f'{index}_pred_{model}']
            distance = pred - close
            mu = pred/close - 1
            vol = row[f'{index}_vol']
            rf = row['ten_yr']
            investment = 0

            # Update portfolio value
            current_value = last_value * np.exp(row[f'{index}_log_ret'])

            # Calculate total equity
            total_equity = current_value + last_cash 

            # Get bet sizing with Merton-Kelly
            #target_pos = get_merton_kelly_size(mu*252, rf, vol*(252**2)) * total_equity # Annualize mu and vol
            #target_pos = max(-total_equity, min(target_pos, total_equity)) # Make sure we don't exceed equity


            # Get bet sizing with vol target
            #target_pos = get_vol_target_sizing(.1, vol*(252**2)) * total_equity
            #target_pos = max(-total_equity, min(target_pos, total_equity)) # Make sure we don't exceed equity
            
            # Bet sizing based on size of potential increase/decrease, signed
            target_pos = (mu*252 - rf)/abs(mu-rf) * 10000
            target_pos = max(-total_equity, min(target_pos, total_equity)) # Make sure we don't exceed equity

            # Execute trade / Update investment amount to reach target position
            investment = target_pos - current_value

            # Update cash
            last_cash = bt_data[index][model]['last_cash'] = last_cash - investment
            bt.loc[idx, f'{index}_cash_{model}'] = last_cash

            # Store portfolio value as last value for calculation
            bt_data[index][model]['last_value'] = current_value + investment
            bt.loc[idx, f'{index}_portfolio_value_{model}'] = bt_data[index][model]['last_value']
    
for index in indexes:
    for model in models:
        bt[f'{index}_total_value_{model}'] = bt[f'{index}_portfolio_value_{model}'] + bt[f'{index}_cash_{model}']
        final_value = bt[f'{index}_total_value_{model}'].iloc[-1]
        print(f'Final value for {index}, {model}: {final_value:.2f}')


ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy




Final value for S&P, ridge: 732423.16
Final value for S&P, lasso: 104465.20
Final value for S&P, pca: 298824.12
Final value for NASDAQ, ridge: 836688.23
Final value for NASDAQ, lasso: 157995.94
Final value for NASDAQ, pca: 453830.11
Final value for DJIA, ridge: 317554.88
Final value for DJIA, lasso: 75375.50
Final value for DJIA, pca: 178545.94


In [500]:
for index in indexes:
    bt[f'{index}_baseline'] = np.exp(bt[f'{index}_cum_log_ret']) * 10000

In [501]:
colors = {
    'ridge': 'red',
    'lasso': 'green',
    'pca': 'pink'
}

# Initialize a figure with a row for each index
fig = make_subplots(
    rows=len(indexes), 
    cols=1, 
    subplot_titles=[f'{index} Model Performance' for index in indexes],
    shared_xaxes=True # Link the x-axes
)

# --- 3. Loop Through Indexes and Models to Add Traces ---
# Enumerate through indexes to get the row number (i)
for i, index in enumerate(indexes):
    # Add the baseline trace for the current index
    fig.add_trace(go.Scatter(
        x=bt.index, 
        y=bt[f'{index}_baseline'],
        mode='lines',
        name=f'{index} Baseline',
        line=dict(color='blue', width=1)
    ), row=i + 1, col=1) # row is 1-indexed
    
    # Loop through the models to plot each one
    for model in models:
        fig.add_trace(go.Scatter(
            x=bt.index, 
            y=bt[f'{index}_total_value_{model}'],
            mode='lines',
            name=f'{index} {model}',
            line=dict(color=colors[model], width=1, dash='dot')
        ), row=i + 1, col=1)

# --- 4. Update the Layout ---
fig.update_layout(
    title_text='Backtest Performance: Model Comparison by Index',
    title_x=0.5, # Center the title
    legend_title='Metrics',
    height=800 # Adjust height to make plots more readable
)

# --- 5. Display the Chart ---
fig.show()

# Something seems broken here... returns look way too high compared to log growth we plotted earlier...

In [502]:
cutoff = math.floor(len(bt)*.8)
training_returns = bt.iloc[:cutoff]
testing_returns = bt.iloc[cutoff:]

for index in indexes:
    for model in models:
        print(f'Final training value for {index}, {model}: {training_returns[f'{index}_total_value_{model}'].iloc[-1]}, CAGR: {100*((training_returns[f'{index}_total_value_{model}'].iloc[-1]/training_returns[f'{index}_total_value_{model}'].iloc[0])**(1/(training_returns.index[-1].year - training_returns.index[0].year + training_returns.index[-1].month/12))-1):.2f}%')
        print(f'Final testing value for {index}, {model}: {testing_returns[f'{index}_total_value_{model}'].iloc[-1]}, CAGR: {100*((testing_returns[f'{index}_total_value_{model}'].iloc[-1]/testing_returns[f'{index}_total_value_{model}'].iloc[0])**(1/(testing_returns.index[-1].year - testing_returns.index[0].year + testing_returns.index[-1].month/12))-1):.2f}%')
        print('\n')

Final training value for S&P, ridge: 649443.5554735758, CAGR: 27.08%
Final testing value for S&P, ridge: 732423.1605815467, CAGR: 2.57%


Final training value for S&P, lasso: 86760.07700525882, CAGR: 13.21%
Final testing value for S&P, lasso: 104465.19725078132, CAGR: 3.94%


Final training value for S&P, pca: 264597.2377800202, CAGR: 20.69%
Final testing value for S&P, pca: 298824.12371342885, CAGR: 2.57%


Final training value for NASDAQ, ridge: 677147.7348672615, CAGR: 27.38%
Final testing value for NASDAQ, ridge: 836688.228062928, CAGR: 4.56%


Final training value for NASDAQ, lasso: 94876.71602472827, CAGR: 13.79%
Final testing value for NASDAQ, lasso: 157995.9431442135, CAGR: 11.36%


Final training value for NASDAQ, pca: 297603.53514324804, CAGR: 21.51%
Final testing value for NASDAQ, pca: 453830.1124448043, CAGR: 9.31%


Final training value for DJIA, ridge: 269156.15538750484, CAGR: 20.81%
Final testing value for DJIA, ridge: 317554.8810146355, CAGR: 3.53%


Final training val

Trading strategy 2: Trading using bollinger bands and moving average