# Notebook for CME Futures Challenge

### The Rough Idea

Model indices as geometric brownian motion (dS/S = mudt + sigmadB)  
Model mu (market line) as a linear regression with numerous factors including economic, credit measures, etc  
Model sigma as a function of volatility including recent volatility and EMA (decay)  
Long/short based on futures mispricings based on our model  

# Downloading historical data for indices (S&P, NASDAQ, DJIA)

Imports

In [131]:
import yfinance as yf
import pandas as pd
import plotly.express as px
from typing import List, Dict

Make get_data function for downloading from yf

In [132]:
def get_data(tickers: List):
    data_dictionary = {}
    for ticker in tickers:
        data_dictionary[ticker] = yf.download(ticker, period='240mo', interval='1d')
    return data_dictionary

Now let's get data for indices and display with pd

In [133]:
indices = ['^GSPC', '^IXIC', '^DJI'] # S&P, NASDAQ, DJIA
data_dictionary = get_data(indices)

s_p = pd.DataFrame(data_dictionary['^GSPC'])
nasdaq = pd.DataFrame(data_dictionary['^IXIC'])
djia = pd.DataFrame(data_dictionary['^DJI'])


YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed


In [134]:
s_p

Price,Close,High,Low,Open,Volume
Ticker,^GSPC,^GSPC,^GSPC,^GSPC,^GSPC
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2005-09-20,1221.339966,1236.489990,1220.069946,1231.020020,2319250000
2005-09-21,1210.199951,1221.520020,1209.890015,1221.339966,2548150000
2005-09-22,1214.619995,1216.640015,1205.349976,1210.199951,2424720000
2005-09-23,1215.290039,1218.829956,1209.800049,1214.619995,1973020000
2005-09-26,1215.630005,1222.560059,1211.839966,1215.290039,2022220000
...,...,...,...,...,...
2025-09-15,6615.279785,6619.620117,6602.069824,6603.490234,5045020000
2025-09-16,6606.759766,6626.990234,6600.109863,6624.129883,5359510000
2025-09-17,6600.350098,6624.390137,6551.149902,6604.870117,5805340000
2025-09-18,6631.959961,6656.799805,6611.890137,6626.850098,5292400000


We need to flatten this - notice ticker header

In [135]:
s_p = s_p.droplevel(1, axis=1)
nasdaq = nasdaq.droplevel(1, axis=1)
djia = djia.droplevel(1, axis=1)

In [136]:
s_p

Price,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2005-09-20,1221.339966,1236.489990,1220.069946,1231.020020,2319250000
2005-09-21,1210.199951,1221.520020,1209.890015,1221.339966,2548150000
2005-09-22,1214.619995,1216.640015,1205.349976,1210.199951,2424720000
2005-09-23,1215.290039,1218.829956,1209.800049,1214.619995,1973020000
2005-09-26,1215.630005,1222.560059,1211.839966,1215.290039,2022220000
...,...,...,...,...,...
2025-09-15,6615.279785,6619.620117,6602.069824,6603.490234,5045020000
2025-09-16,6606.759766,6626.990234,6600.109863,6624.129883,5359510000
2025-09-17,6600.350098,6624.390137,6551.149902,6604.870117,5805340000
2025-09-18,6631.959961,6656.799805,6611.890137,6626.850098,5292400000


Let's drop high, low, and open and rename columns

In [137]:
s_p.drop(columns=['High', 'Low', 'Open'], inplace=True)
nasdaq.drop(columns=['High', 'Low', 'Open'], inplace=True)
djia.drop(columns=['High', 'Low', 'Open'], inplace=True)

s_p = s_p.rename(columns={'Close': 'S&P_Close', 'Volume': 'S&P_Volume'})
nasdaq = nasdaq.rename(columns={'Close': 'NASDAQ_Close', 'Volume': 'NASDAQ_Volume'})
djia = djia.rename(columns={'Close': 'DJIA_Close', 'Volume': 'DJIA_Volume'})

Let's get a quick plot of an index

In [138]:
fig = px.line(s_p, x=s_p.index, y="S&P_Close", title="S&P Daily Past 20 Years")
fig.show()

# Downloading historical data for our factor model

We are going to model the index as a geometric brownian motion, with the mu factor being a linear regression model with numerous inputs.  

## Factor considerations:  
### <u>Term structure</u>
###### Term spread (10Y-3M)

### <u>Credit conditions</u>
###### IG spread (BAA-AAA)

### <u>Valuation</u>
###### Forward E/P - real 10Y
###### Dividend yield

### <u>Economic</u>
###### Fed funds
###### Inflation (CPI)
###### DXY change (dollar index)  

### Some of these we can get from yahoo finance:  

In [139]:
tickers = [
    # Term structure
    '^TNX', # 10yr CBOE
    '^IRX', # 3m bill (on discount basis, need to convert to yield)

    # Economic
    'DX-Y.NYB', # Dollar index
]

data_dictionary = get_data(tickers)

ten_yr = pd.DataFrame(data_dictionary['^TNX']['Close'])
three_m = pd.DataFrame(data_dictionary['^IRX']['Close'])
dollar_index = pd.DataFrame(data_dictionary['DX-Y.NYB']['Close'])


YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed


Rename columns

In [150]:
ten_yr = ten_yr.rename(columns={'^TNX': 'ten_yr'})
three_m = three_m.rename(columns={'^IRX': 'three_m'})
dollar_index = dollar_index.rename(columns={'DX-Y.NYB': 'dollar_index'})

We should get dividend yield too

In [141]:
etfs = ['SPY', 'QQQ', 'DIA']
div_data = {}

for etf in etfs:
    ticker = yf.Ticker(etf)
    div = ticker.dividends
    price = ticker.history('252mo')['Close']

    # Calculate dividend yield
    div_12m = div.rolling(window='365D', min_periods=1).sum()
    div_12m = div_12m.reindex(price.index, method='ffill')
    div_yield = div_12m / price
    div_data[etf] = div_yield

Fix index for all 3 and rename columns

In [157]:
div_data['SPY'].index = pd.to_datetime(div_data['SPY'].index).normalize()
div_data['QQQ'].index = pd.to_datetime(div_data['QQQ'].index).normalize()
div_data['DIA'].index = pd.to_datetime(div_data['DIA'].index).normalize()

div_data['SPY'].name = 'SPY_div'
div_data['QQQ'].name = 'QQQ_div'
div_data['DIA'].name = 'DIA_div'

In [158]:
div_data['SPY']

Date
2004-09-20    0.028893
2004-09-21    0.028768
2004-09-22    0.029131
2004-09-23    0.029289
2004-09-24    0.029155
                ...   
2025-09-15    0.013546
2025-09-16    0.013565
2025-09-17    0.013582
2025-09-18    0.013518
2025-09-19    0.013560
Name: SPY_div, Length: 5285, dtype: float64

### pandas_datareader lets us download fred data

In [143]:
from pandas_datareader import data as pdr
from datetime import datetime

In [144]:
start = datetime(2000,1,1) # Start date for download

# Macroeconomic data
gdp = pdr.DataReader("GDP", "fred", start)
cpi = pdr.DataReader("CPIAUCSL", "fred", start)
fedfunds = pdr.DataReader("FEDFUNDS", "fred", start)

# For some reason this download doesn't have the most recent fed funds rate
fedfunds = pd.concat([fedfunds['FEDFUNDS'], pd.Series([4.08], index=[datetime(2025,9,17)])])

# Credit risk data
ig_spread = pdr.DataReader("BAMLC0A4CBBB", "fred", start)   # BofA BBB corp minus Treasuries
#hy_spread = pdr.DataReader("BAMLH0A0HYM2", "fred", start)   # BofA US High Yield spread
#baa_spread = pdr.DataReader("BAA10Y", "fred", start)        # Moody’s Baa – 10Y Treasury

Rename series

In [172]:
cpi.name = 'CPI'
fedfunds.name = 'fed_funds'
ig_spread.name = 'credit_spread'

In [173]:
fred_data = [gdp, cpi, fedfunds, ig_spread]

# Last business day <= today
last_bday = pd.bdate_range(end=pd.Timestamp.today().normalize(), periods=1)[0]

for i, df in enumerate(fred_data):
    s = df.squeeze() # make it a Series
    # Build a business-day index from the series start to last_bday
    bidx = pd.bdate_range(start=s.index.min(), end=last_bday)
    # Reindex to business days and forward-fill
    s = s.reindex(bidx, method='ffill')
    # Write back as a 1-col DataFrame with a proper name
    name = s.name if s.name else f"series_{i}"
    fred_data[i] = s.to_frame(name)

In [174]:
fred_data[0]

Unnamed: 0,GDP
2000-01-03,10002.179
2000-01-04,10002.179
2000-01-05,10002.179
2000-01-06,10002.179
2000-01-07,10002.179
...,...
2025-09-15,30353.902
2025-09-16,30353.902
2025-09-17,30353.902
2025-09-18,30353.902


Let's build a master dataframe

In [175]:
data = s_p.join([nasdaq, djia, div_data['SPY'], div_data['QQQ'], div_data['DIA'], ten_yr, three_m, dollar_index, fred_data[0], fred_data[1], fred_data[2], fred_data[3]])
data

Unnamed: 0_level_0,S&P_Close,S&P_Volume,NASDAQ_Close,NASDAQ_Volume,DJIA_Close,DJIA_Volume,SPY_div,QQQ_div,DIA_div,ten_yr,three_m,dollar_index,GDP,CPIAUCSL,fed_funds,BAMLC0A4CBBB
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2005-09-20,1221.339966,2.319250e+09,2131.330078,1.845670e+09,10481.519531,244560000.0,0.034084,0.012425,0.037398,4.243,3.525,88.599998,13142.642,198.800,3.62,1.23
2005-09-21,1210.199951,2.548150e+09,2106.639893,1.772370e+09,10378.030273,266650000.0,0.034406,0.012563,0.037758,4.188,3.342,88.139999,13142.642,198.800,3.62,1.24
2005-09-22,1214.619995,2.424720e+09,2110.780029,1.692930e+09,10422.049805,254260000.0,0.034284,0.012518,0.037642,4.176,3.380,88.559998,13142.642,198.800,3.62,1.25
2005-09-23,1215.290039,1.973020e+09,2116.840088,1.604120e+09,10419.589844,238590000.0,0.034255,0.012482,0.037620,4.248,3.390,89.209999,13142.642,198.800,3.62,1.25
2005-09-26,1215.630005,2.022220e+09,2121.459961,1.502410e+09,10443.629883,234320000.0,0.034216,0.012479,0.037498,4.294,3.425,89.110001,13142.642,198.800,3.62,1.24
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2025-09-15,6615.279785,5.045020e+09,22348.750000,8.904030e+09,45883.449219,401500000.0,0.013546,0.006052,0.015259,4.034,3.900,97.300003,30353.902,323.364,4.33,0.95
2025-09-16,6606.759766,5.359510e+09,22333.960938,8.592240e+09,45757.898438,443400000.0,0.013565,0.006057,0.015301,4.026,3.890,96.629997,30353.902,323.364,4.33,0.96
2025-09-17,6600.350098,5.805340e+09,22261.330078,9.325980e+09,46018.320312,509830000.0,0.013582,0.006069,0.015219,4.076,3.868,96.870003,30353.902,323.364,4.08,0.96
2025-09-18,6631.959961,5.292400e+09,22470.720703,1.047845e+10,46142.421875,489090000.0,0.013518,0.006015,0.015174,4.104,3.880,97.349998,30353.902,323.364,4.08,0.94


# Linear regression model

### Feature Engineering

We need to be careful to not include things such as raw moving averages that will leak volatility information into our drift prediction  

In [186]:
import numpy as np

In [None]:
# Price-based

# Volume-based

# Log returns -- This will be our target variable
data['S&P_ret'] = np.log(data['S&P_Close']).diff()
data['NASDAQ_ret'] = np.log(data['NASDAQ_Close']).diff()
data['DJIA_ret'] = np.log(data['DJIA_Close']).diff()

### Preprocessing Data

Let's check for NaNs

In [188]:
data.isna().sum()

S&P_Close        0
S&P_Volume       0
NASDAQ_Close     0
NASDAQ_Volume    0
DJIA_Close       0
DJIA_Volume      0
SPY_div          0
QQQ_div          0
DIA_div          0
ten_yr           0
three_m          0
dollar_index     0
GDP              0
CPIAUCSL         0
fed_funds        0
BAMLC0A4CBBB     0
S&P_ret          1
NASDAQ_ret       1
DJIA_ret         1
dtype: int64

Impute NaNs with average

In [189]:
data['ten_yr'] = data['ten_yr'].fillna(data['ten_yr'].mean())
data['three_m'] = data['three_m'].fillna(data['three_m'].mean())
data['dollar_index'] = data['dollar_index'].fillna(data['dollar_index'].mean())
data['BAMLC0A4CBBB'] = data['BAMLC0A4CBBB'].fillna(data['BAMLC0A4CBBB'].mean())
data['S&P_ret'] = data['S&P_ret'].fillna(data['S&P_ret'].mean())
data['NASDAQ_ret'] = data['NASDAQ_ret'].fillna(data['NASDAQ_ret'].mean())
data['DJIA_ret'] = data['DJIA_ret'].fillna(data['DJIA_ret'].mean())

In [190]:
data

Unnamed: 0_level_0,S&P_Close,S&P_Volume,NASDAQ_Close,NASDAQ_Volume,DJIA_Close,DJIA_Volume,SPY_div,QQQ_div,DIA_div,ten_yr,three_m,dollar_index,GDP,CPIAUCSL,fed_funds,BAMLC0A4CBBB,S&P_ret,NASDAQ_ret,DJIA_ret
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2005-09-20,1221.339966,2.319250e+09,2131.330078,1.845670e+09,10481.519531,244560000.0,0.034084,0.012425,0.037398,4.243,3.525,88.599998,13142.642,198.800,3.62,1.23,0.000337,0.000470,0.000295
2005-09-21,1210.199951,2.548150e+09,2106.639893,1.772370e+09,10378.030273,266650000.0,0.034406,0.012563,0.037758,4.188,3.342,88.139999,13142.642,198.800,3.62,1.24,-0.009163,-0.011652,-0.009923
2005-09-22,1214.619995,2.424720e+09,2110.780029,1.692930e+09,10422.049805,254260000.0,0.034284,0.012518,0.037642,4.176,3.380,88.559998,13142.642,198.800,3.62,1.25,0.003646,0.001963,0.004233
2005-09-23,1215.290039,1.973020e+09,2116.840088,1.604120e+09,10419.589844,238590000.0,0.034255,0.012482,0.037620,4.248,3.390,89.209999,13142.642,198.800,3.62,1.25,0.000551,0.002867,-0.000236
2005-09-26,1215.630005,2.022220e+09,2121.459961,1.502410e+09,10443.629883,234320000.0,0.034216,0.012479,0.037498,4.294,3.425,89.110001,13142.642,198.800,3.62,1.24,0.000280,0.002180,0.002305
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2025-09-15,6615.279785,5.045020e+09,22348.750000,8.904030e+09,45883.449219,401500000.0,0.013546,0.006052,0.015259,4.034,3.900,97.300003,30353.902,323.364,4.33,0.95,0.004696,0.009335,0.001074
2025-09-16,6606.759766,5.359510e+09,22333.960938,8.592240e+09,45757.898438,443400000.0,0.013565,0.006057,0.015301,4.026,3.890,96.629997,30353.902,323.364,4.33,0.96,-0.001289,-0.000662,-0.002740
2025-09-17,6600.350098,5.805340e+09,22261.330078,9.325980e+09,46018.320312,509830000.0,0.013582,0.006069,0.015219,4.076,3.868,96.870003,30353.902,323.364,4.08,0.96,-0.000971,-0.003257,0.005675
2025-09-18,6631.959961,5.292400e+09,22470.720703,1.047845e+10,46142.421875,489090000.0,0.013518,0.006015,0.015174,4.104,3.880,97.349998,30353.902,323.364,4.08,0.94,0.004778,0.009362,0.002693


Let's normalize our input values

In [184]:
from sklearn.preprocessing import StandardScaler

In [None]:
normalizer = StandardScaler

data[[]]