Data: 
Option prices and implied volatility from post-no-preference option chain dataset. data spans from 2019-current. collected more recently on mon-wed-fri. Many but not all options are posted there for each security.

AAPL price data from Kaggle(?)

Treasury Bond rates from home.treasury.gov. Daily treasury par yield rates

In [6]:
import pandas as pd
import numpy as np
#10m rows takes about 30 seconds... Expect long processing times for full data. do in batches.
#f10m = pd.read_csv('pnp_options.csv', nrows=10000000)

f1m = pd.read_csv('data/pnp_options.csv', usecols=['date', 'act_symbol', 'expiration', 'strike', 'call_put', 'bid', 'ask',
       'vol'])

**Note**

try using adj close and regular close.. or both?

In [77]:
prices = pd.read_csv('data/AAPL.csv', usecols=['Date', 'Close'])
rates19 = pd.read_csv('data/treasury_2019.csv', usecols=['Date', '1 Mo', '2 Mo', '3 Mo'])
rates20 = pd.read_csv('data/treasury_2020.csv', usecols=['Date', '1 Mo', '2 Mo', '3 Mo'])
rates21 = pd.read_csv('data/treasury_2021.csv', usecols=['Date', '1 Mo', '2 Mo', '3 Mo'])
rates22 = pd.read_csv('data/treasury_2022.csv', usecols=['Date', '1 Mo', '2 Mo', '3 Mo'])
rates23 = pd.read_csv('data/treasury_2023.csv', usecols=['Date', '1 Mo', '2 Mo', '3 Mo'])
rates24 = pd.read_csv('data/treasury_2024.csv', usecols=['Date', '1 Mo', '2 Mo', '3 Mo'])

rates = pd.concat([rates24, rates23, rates22, rates21, rates20, rates19])

In [12]:
display(f1m.head(), f1m.tail())

Unnamed: 0,date,act_symbol,expiration,strike,call_put,bid,ask,vol
0,2019-02-09,A,2019-02-15,65.0,Call,10.5,11.25,0.2705
1,2019-02-09,A,2019-02-15,65.0,Put,0.0,0.03,0.3133
2,2019-02-09,A,2019-02-15,67.5,Call,8.15,8.5,0.2705
3,2019-02-09,A,2019-02-15,67.5,Put,0.0,0.03,0.3133
4,2019-02-09,A,2019-02-15,70.0,Call,5.7,6.0,0.2705


Unnamed: 0,date,act_symbol,expiration,strike,call_put,bid,ask,vol
66911463,2024-08-16,ZTS,2024-10-18,220.0,Put,34.3,38.6,0.2853
66911464,2024-08-16,ZTS,2024-10-18,230.0,Call,0.0,0.75,0.4004
66911465,2024-08-16,ZTS,2024-10-18,230.0,Put,44.1,48.6,0.4004
66911466,2024-08-16,ZTS,2024-10-18,240.0,Call,0.0,2.45,0.5154
66911467,2024-08-16,ZTS,2024-10-18,240.0,Put,54.1,58.6,0.5154


### NOTE...

**The following data engineering assumes**:

1. we are only interested in AAPL options. if we want to expand, we can simply modify the first line to include more act_symbol values

2. our prices dataframe includes prices covering the entire range of dates for option prices, plus an extra n_prices before the earliest option price, such that we can utilize the n_prices preceding the option pricing in our LSTM model down the line 

3. our rates dataframe also contains treasury rates with dates covering all issuance dates of options.
    
4. all of our option time-to-expiries are closest to 1-3 months, not 6+ months. in the 1m row data the longest time-to-expiry is 62 days. if we see in the full data a time-to-expiry longer than 135 days, we need to add an option to use the 6 month treasury rate in our determine_r function.


**Also**, you will probably want to split this cell into multiple smaller ones when we're working with the full data. some of the actions might be computationally expensive.

**Handling Rate Data**

Preparing for merge w/ options data..

In [79]:
rate = rates.copy()
rate['Date'] = pd.to_datetime(rate['Date'], format='%m/%d/%Y')

# Set the 'Date' column as the index
rate.set_index('Date', inplace=True)

# Create a complete date range from the first to the last date in the data
full_date_range = pd.date_range(start=rate.index.min(), end=rate.index.max(), freq='D')

# Reindex the DataFrame to include every day, filling missing dates with NaN
rate_reindexed = rate.reindex(full_date_range)

# Interpolate missing values for all columns
rate_reindexed[['1 Mo', '2 Mo', '3 Mo']] = rate_reindexed[['1 Mo', '2 Mo', '3 Mo']].interpolate(method='linear')

# Reset the index to bring the 'Date' back as a column (if needed)
rate_reindexed.reset_index(inplace=True)
rate_reindexed.rename(columns={'index': 'Date'}, inplace=True)
rate_reindexed['Date'] = rate_reindexed['Date'].dt.strftime('%Y%m%d').astype(int)
display(rate_reindexed.head())

Unnamed: 0,Date,1 Mo,2 Mo,3 Mo
0,20190102,2.4,2.4,2.42
1,20190103,2.42,2.42,2.41
2,20190104,2.4,2.42,2.42
3,20190105,2.406667,2.42,2.43
4,20190106,2.413333,2.42,2.44


**Handling Options Data**

Preparing for merge

In [80]:
pd.options.mode.chained_assignment = None  # default='warn'
options = f1m[f1m['act_symbol'] == 'AAPL']
options['days_expiry'] = (pd.to_datetime(options['expiration']) - pd.to_datetime(options['date'])).dt.days
options.drop(['expiration'], axis=1, inplace=True)

options['date'] = pd.to_numeric(options['date'].str.replace('-',''))
display(options.head())

Unnamed: 0,date,act_symbol,strike,call_put,bid,ask,vol,days_expiry
142,20190209,AAPL,145.0,Call,25.3,26.05,0.4236,13
143,20190209,AAPL,145.0,Put,0.06,0.08,0.3886,13
144,20190209,AAPL,152.5,Call,17.9,18.55,0.3267,13
145,20190209,AAPL,152.5,Put,0.12,0.16,0.3166,13
146,20190209,AAPL,157.5,Call,13.3,13.55,0.295,13


**Merging Options with Rate Data**

Merging, then obtaining appropriate rates based on the time-to-expiry of the option

In [144]:
options_rates = pd.merge(options, rate_reindexed, left_on='date', right_on='Date', how='left')

def determine_r(row):
    if row['days_expiry'] < 45:
        return row['1 Mo']
    elif 45 <= row['days_expiry'] < 75:
        return row['2 Mo']
    else:
        return row['3 Mo']

options_rates['r'] = options_rates.apply(determine_r, axis=1)
options_rates.drop(['Date', '1 Mo', '2 Mo', '3 Mo'], axis=1, inplace=True)
display(options_rates.head(), options_rates.tail())

Unnamed: 0,date,act_symbol,strike,call_put,bid,ask,vol,days_expiry,r
0,20190209,AAPL,145.0,Call,25.3,26.05,0.4236,13,2.433333
1,20190209,AAPL,145.0,Put,0.06,0.08,0.3886,13,2.433333
2,20190209,AAPL,152.5,Call,17.9,18.55,0.3267,13,2.433333
3,20190209,AAPL,152.5,Put,0.12,0.16,0.3166,13,2.433333
4,20190209,AAPL,157.5,Call,13.3,13.55,0.295,13,2.433333


Unnamed: 0,date,act_symbol,strike,call_put,bid,ask,vol,days_expiry,r
100053,20240816,AAPL,285.0,Put,58.1,59.35,0.3585,63,5.4
100054,20240816,AAPL,290.0,Call,0.06,0.09,0.2431,63,5.4
100055,20240816,AAPL,290.0,Put,63.15,64.35,0.3831,63,5.4
100056,20240816,AAPL,295.0,Call,0.05,0.08,0.2535,63,5.4
100057,20240816,AAPL,295.0,Put,68.1,69.35,0.4005,63,5.4


**handling price data**

preparing for merge w/ options data..

includes producing columns representing the preceding (n) days of prices at any given date.

In [143]:
n_timesteps = 20

aapl_prices = prices.copy()
aapl_prices['Date'] = pd.to_datetime(aapl_prices['Date'], format='%Y-%m-%d')
#.dt.strftime('%Y%m%d')
#reindexing..
aapl_prices.set_index('Date', inplace=True)
# Create a complete date range from the first to the last date in the data
full_date_range = pd.date_range(start=aapl_prices.index.min(), end=aapl_prices.index.max(), freq='D')
aapl_reindexed = aapl_prices.reindex(full_date_range)

# fill NaN values for price..
aapl_reindexed.ffill(inplace=True)
aapl_reindexed.reset_index(inplace=True)
aapl_reindexed.rename(columns={'index': 'Date'}, inplace=True)

aapl_reindexed['Date'] = pd.to_numeric(aapl_reindexed['Date'].dt.strftime('%Y%m%d'))
display(aapl_reindexed.head())

#obtaining subset of aapl prices starting n_timesteps days before the earliest date in options_rates, and 
#ending on the latest date in options_rates
startIndex = aapl_reindexed.index[min(options_rates['date']) == aapl_reindexed['Date']].tolist()[0] - n_timesteps
endIndex = aapl_reindexed.index[max(options_rates['date']) == aapl_reindexed['Date']].tolist()[0] + 1
aapl_reindexed = aapl_reindexed.iloc[startIndex:endIndex,:]
#creating dataframe w/ n+2 columns indicating the previous n prices and the current price at a given day
stepData = []
for i in range(len(aapl_reindexed) - n_timesteps):
    date = aapl_reindexed['Date'].iloc[i + n_timesteps] 
    n_prices = aapl_reindexed['Close'].iloc[i:i + n_timesteps + 1].tolist() 
    stepData.append([date] + n_prices)
columns = ['Date'] + [f't{i+1}' for i in range(n_timesteps)] + ['currentP']
lastn_prices = pd.DataFrame(stepData, columns=columns)
display(lastn_prices.head(), lastn_prices.tail())


Unnamed: 0,Date,Close
0,20181206,43.68
1,20181207,42.122501
2,20181208,42.122501
3,20181209,42.122501
4,20181210,42.400002


Unnamed: 0,Date,t1,t2,t3,t4,t5,t6,t7,t8,t9,...,t12,t13,t14,t15,t16,t17,t18,t19,t20,currentP
0,20190209,39.205002,39.205002,38.325001,38.48,38.174999,39.439999,39.439999,39.439999,39.075001,...,41.610001,41.630001,41.630001,41.630001,42.8125,43.544998,43.560001,42.735001,42.602501,42.602501
1,20190210,39.205002,38.325001,38.48,38.174999,39.439999,39.439999,39.439999,39.075001,38.669998,...,41.630001,41.630001,41.630001,42.8125,43.544998,43.560001,42.735001,42.602501,42.602501,42.602501
2,20190211,38.325001,38.48,38.174999,39.439999,39.439999,39.439999,39.075001,38.669998,41.3125,...,41.630001,41.630001,42.8125,43.544998,43.560001,42.735001,42.602501,42.602501,42.602501,42.357498
3,20190212,38.48,38.174999,39.439999,39.439999,39.439999,39.075001,38.669998,41.3125,41.610001,...,41.630001,42.8125,43.544998,43.560001,42.735001,42.602501,42.602501,42.602501,42.357498,42.7225
4,20190213,38.174999,39.439999,39.439999,39.439999,39.075001,38.669998,41.3125,41.610001,41.630001,...,42.8125,43.544998,43.560001,42.735001,42.602501,42.602501,42.602501,42.357498,42.7225,42.544998


Unnamed: 0,Date,t1,t2,t3,t4,t5,t6,t7,t8,t9,...,t12,t13,t14,t15,t16,t17,t18,t19,t20,currentP
2011,20240812,225.009995,218.539993,217.490005,217.960007,217.960007,217.960007,218.240005,218.800003,222.080002,...,219.860001,219.860001,209.270004,207.229996,209.820007,213.309998,216.240005,216.240005,216.240005,217.529999
2012,20240813,218.539993,217.490005,217.960007,217.960007,217.960007,218.240005,218.800003,222.080002,218.360001,...,219.860001,209.270004,207.229996,209.820007,213.309998,216.240005,216.240005,216.240005,217.529999,221.270004
2013,20240814,217.490005,217.960007,217.960007,217.960007,218.240005,218.800003,222.080002,218.360001,219.860001,...,209.270004,207.229996,209.820007,213.309998,216.240005,216.240005,216.240005,217.529999,221.270004,221.720001
2014,20240815,217.960007,217.960007,217.960007,218.240005,218.800003,222.080002,218.360001,219.860001,219.860001,...,207.229996,209.820007,213.309998,216.240005,216.240005,216.240005,217.529999,221.270004,221.720001,224.720001
2015,20240816,217.960007,217.960007,218.240005,218.800003,222.080002,218.360001,219.860001,219.860001,219.860001,...,209.820007,213.309998,216.240005,216.240005,216.240005,217.529999,221.270004,221.720001,224.720001,226.050003


**Complete Dataset Merge**

Merging price step data with options_rates data, and replacing bid/ask price with 'fair price' of option..

In [145]:
full_data = pd.merge_asof(options_rates, lastn_prices, left_on='date', right_on='Date', direction='backward')
column_order = ['currentP'] + full_data.columns.drop(['currentP']).tolist()
full_data = full_data.loc[:,column_order]
full_data['option_fp'] = (full_data['ask'] + full_data['bid'])/2
full_data = full_data.drop(['date', 'act_symbol', 'bid', 'ask', 'Date'], axis=1)
display(full_data.head(), full_data.tail())

Unnamed: 0,currentP,strike,call_put,vol,days_expiry,r,t1,t2,t3,t4,...,t12,t13,t14,t15,t16,t17,t18,t19,t20,option_fp
0,42.602501,145.0,Call,0.4236,13,2.433333,39.205002,39.205002,38.325001,38.48,...,41.610001,41.630001,41.630001,41.630001,42.8125,43.544998,43.560001,42.735001,42.602501,25.675
1,42.602501,145.0,Put,0.3886,13,2.433333,39.205002,39.205002,38.325001,38.48,...,41.610001,41.630001,41.630001,41.630001,42.8125,43.544998,43.560001,42.735001,42.602501,0.07
2,42.602501,152.5,Call,0.3267,13,2.433333,39.205002,39.205002,38.325001,38.48,...,41.610001,41.630001,41.630001,41.630001,42.8125,43.544998,43.560001,42.735001,42.602501,18.225
3,42.602501,152.5,Put,0.3166,13,2.433333,39.205002,39.205002,38.325001,38.48,...,41.610001,41.630001,41.630001,41.630001,42.8125,43.544998,43.560001,42.735001,42.602501,0.14
4,42.602501,157.5,Call,0.295,13,2.433333,39.205002,39.205002,38.325001,38.48,...,41.610001,41.630001,41.630001,41.630001,42.8125,43.544998,43.560001,42.735001,42.602501,13.425


Unnamed: 0,currentP,strike,call_put,vol,days_expiry,r,t1,t2,t3,t4,...,t12,t13,t14,t15,t16,t17,t18,t19,t20,option_fp
100053,226.050003,285.0,Put,0.3585,63,5.4,217.960007,217.960007,218.240005,218.800003,...,209.820007,213.309998,216.240005,216.240005,216.240005,217.529999,221.270004,221.720001,224.720001,58.725
100054,226.050003,290.0,Call,0.2431,63,5.4,217.960007,217.960007,218.240005,218.800003,...,209.820007,213.309998,216.240005,216.240005,216.240005,217.529999,221.270004,221.720001,224.720001,0.075
100055,226.050003,290.0,Put,0.3831,63,5.4,217.960007,217.960007,218.240005,218.800003,...,209.820007,213.309998,216.240005,216.240005,216.240005,217.529999,221.270004,221.720001,224.720001,63.75
100056,226.050003,295.0,Call,0.2535,63,5.4,217.960007,217.960007,218.240005,218.800003,...,209.820007,213.309998,216.240005,216.240005,216.240005,217.529999,221.270004,221.720001,224.720001,0.065
100057,226.050003,295.0,Put,0.4005,63,5.4,217.960007,217.960007,218.240005,218.800003,...,209.820007,213.309998,216.240005,216.240005,216.240005,217.529999,221.270004,221.720001,224.720001,68.725


**Export Data**

In [139]:
full_data.to_csv('data/full_data.csv', index=False)

In [99]:
df = f1m[f1m['act_symbol'] == 'AAPL']
aapl_prices = prices.copy()
rate = rates.copy()

#remove expiration date, replace with int # of days until expiration
df['days_expiry'] = (pd.to_datetime(df['expiration']) - pd.to_datetime(df['date'])).dt.days
df = df.drop(['expiration'], axis=1)

#format dates on df, aapl_prices, rates to match each other
df['date'] = pd.to_numeric(df['date'].str.replace('-',''))
aapl_prices = aapl_prices.drop(['Open', 'High', 'Low', 'Close', 'Volume'], axis=1)
aapl_prices['Date'] = pd.to_datetime(aapl_prices['Date'], format='%Y-%m-%d')
#reindexing aapl prices..
aapl_prices.set_index('Date', inplace=True)
# Create a complete date range from the first to the last date in the data
full_date_range = pd.date_range(start=aapl_prices.index.min(), end=aapl_prices.index.max(), freq='D')
aapl_reindexed = aapl_prices.reindex(full_date_range)

# fill NaN values for price..
aapl_reindexed.ffill(inplace=True)
aapl_reindexed.reset_index(inplace=True)
aapl_reindexed.rename(columns={'index': 'Date'}, inplace=True)

aapl_reindexed['Date'] = pd.to_numeric(aapl_reindexed['Date'].dt.strftime('%Y%m%d'))
rate['Date'] = pd.to_datetime(rate['Date'], format='%m/%d/%Y')

# Set the 'Date' column as the index
rate.set_index('Date', inplace=True)

# Create a complete date range from the first to the last date in the data
full_date_range = pd.date_range(start=rate.index.min(), end=rate.index.max(), freq='D')

# Reindex the DataFrame to include every day, filling missing dates with NaN
rate_reindexed = rate.reindex(full_date_range)

# Interpolate missing values for all columns
rate_reindexed[['1 Mo', '2 Mo', '3 Mo', '6 Mo', '1 Yr']] = rate_reindexed[['1 Mo', '2 Mo', '3 Mo', '6 Mo', '1 Yr']].interpolate(method='linear')

# Reset the index to bring the 'Date' back as a column (if needed)
rate_reindexed.reset_index(inplace=True)
rate_reindexed.rename(columns={'index': 'Date'}, inplace=True)
rate_reindexed['Date'] = rate_reindexed['Date'].dt.strftime('%Y%m%d').astype(int)
df = pd.merge(df, rate_reindexed, left_on='date', right_on='Date', how='left')

#choose risk free rate 'r', based on which treasury rate matures closest to the expiration date of the option.
#Then drop other treasury columns leaving just 'r'
def determine_r(row):
    if row['days_expiry'] < 45:
        return row['1 Mo']
    elif 45 <= row['days_expiry'] < 75:
        return row['2 Mo']
    else:
        return row['3 Mo']

df['r'] = df.apply(determine_r, axis=1)
df = df.drop(['1 Mo', '2 Mo', '3 Mo', '6 Mo', '1 Yr'], axis=1)

#Now to the price dataframe...

#create df with n_prices + 1 columns. first column indicating the date on the last pricing. other n_prices columns will be the n_prices leading up to the current date.
#then we will merge again on date, using most recent closing price preceding option pricing.
n = 20
#obtain df of all prices needed for last_n_prices df... and convert closing price to decimal type
display(aapl_prices.loc[40:45])
print(min(df['date']))
minDateIndex = aapl_prices.index[aapl_prices['Date'] == min(df['date'])-1].tolist()[0]
maxDateIndex = aapl_prices.index[aapl_prices['Date'] == max(df['date'])-1].tolist()[0]
print(minDateIndex, maxDateIndex)
display(aapl_prices.head())
print(aapl_prices.iloc[minDateIndex])
print(aapl_prices.iloc[minDateIndex-1])
prices_all = aapl_prices.loc[maxDateIndex:(minDateIndex+n),].sort_values(by=['Date'])
print(prices_all.head())
prices_all = prices_all.drop(['Open', 'High', 'Low', 'Close', 'Volume'], axis=1)
print(prices_all.head())
#creating dataframe w/ n+2 columns indicating the previous n prices and the current price at a given day
stepData = []
for i in range(len(prices_all) - n):
    date = prices_all['Date'].iloc[i + n] 
    n_prices = prices_all['Adj Close'].iloc[i:i + n + 1].tolist() 
    stepData.append([date] + n_prices)
columns = ['Date'] + [f't{i+1}' for i in range(n)] + ['currentP']
lastn_prices = pd.DataFrame(stepData, columns=columns)
#print(lastn_prices.head())
#merging historical prices with other attributes
df = pd.merge_asof(df, lastn_prices, on='Date', direction='backward')

#Create column for 'fair price' of option, just average of bid and ask... then drop out the rows not being used in this first iteration...
df['option_fp'] = (df['ask'] + df['bid'])/2
inputs_df = df.drop(['date', 'act_symbol', 'bid', 'ask', 'Date'], axis=1)

#shift current price column alongside other spatial (MLP) parameters
inputs_df = inputs_df.loc[:,['currentP', 'strike', 'call_put', 'vol', 'days_expiry', 'r', 't1', 't2', 't3', 't4',
       't5', 't6', 't7', 't8', 't9', 't10', 't11', 't12', 't13', 't14', 't15',
       't16', 't17', 't18', 't19', 't20', 'option_fp']]

display(inputs_df.head())

KeyError: "['Open', 'High', 'Low', 'Volume'] not found in axis"

In [5]:
inputs_df.to_csv('input_data.csv', index=False)

**Preparing Train/Test Data**

In [11]:
from sklearn.model_selection import train_test_split

#y is option_fp
#train and test calls and puts seperately (obviously)
display(inputs_df.head())

call_df = inputs_df[inputs_df['call_put'] == "Call"]
call_df = call_df.drop(['call_put'], axis=1)
put_df = inputs_df[inputs_df['call_put'] == "Put"]
put_df = put_df.drop(['call_put'], axis=1)

display(call_df.tail())
CALL_X_train, CALL_X_test, CALL_Y_train, CALL_Y_test = train_test_split(call_df.drop(['option_fp'], axis=1).values, call_df['option_fp'].values, 
                                                                        test_size=.1, random_state=1)

PUT_X_train, PUT_X_test, PUT_Y_train, PUT_Y_test = train_test_split(put_df.drop(['option_fp'], axis=1).values, put_df['option_fp'].values, 
                                                                        test_size=.1, random_state=1)

#for input to LSTM-MLP, must split inputs into state inputs (for LSTM), and non-state inputs. We can pass the state inputs seperately through the
#LSTM, then take the LSTM output and concatenate it with the remaining non-state inputs. Thus we must split the inputs into two list elements,
#the first of which being a list of state-inputs (all price data), the second of which being a list of the remaining inputs.

#Note... Should we put the current asset price as part of the state inputs, or the non-state inputs?? try both....
#let's start with putting all pricing info in the state:
#dropping volatility from the LSTM-MLP inputs... Need to keep it in train/test split s.t. we can test the error of BSM using same y data.
BSM_CALL_X_train, BSM_CALL_X_test, BSM_CALL_Y_train, BSM_CALL_Y_test = CALL_X_train, CALL_X_test, CALL_Y_train, CALL_Y_test

LSTM_CALL_X_train, LSTM_CALL_X_test, LSTM_CALL_Y_train, LSTM_CALL_Y_test = np.delete(CALL_X_train,2,1), np.delete(CALL_X_test,2,1), CALL_Y_train, CALL_Y_test
LSTM_PUT_X_train, LSTM_PUT_X_test, LSTM_PUT_Y_train, LSTM_PUT_Y_test = np.delete(PUT_X_train,2,1), np.delete(PUT_X_test,2,1), PUT_Y_train, PUT_Y_test
print(LSTM_CALL_X_train)
LSTM_CALL_X_train = [LSTM_CALL_X_train[:,4:].reshape(LSTM_CALL_X_train.shape[0],n,1), LSTM_CALL_X_train[:,:4]]
LSTM_CALL_X_test = [LSTM_CALL_X_test[:,4:].reshape(LSTM_CALL_X_test.shape[0],n,1), LSTM_CALL_X_test[:,:4]]
LSTM_PUT_X_train = [LSTM_PUT_X_train[:,4:].reshape(LSTM_PUT_X_train.shape[0],n,1), LSTM_PUT_X_train[:,:4]]
LSTM_PUT_X_test = [LSTM_PUT_X_test[:,4:].reshape(LSTM_PUT_X_test.shape[0],n,1), LSTM_PUT_X_test[:,:4]]
print(LSTM_CALL_X_train[0][0], LSTM_CALL_X_train[1][0])

Unnamed: 0,currentP,strike,call_put,vol,days_expiry,r,t1,t2,t3,t4,...,t12,t13,t14,t15,t16,t17,t18,t19,t20,option_fp
0,170.41,145.0,Call,0.4236,13,,153.8,152.29,150.0,153.07,...,156.3,154.68,165.25,166.44,166.52,171.25,174.18,174.24,170.94,25.675
1,170.41,145.0,Call,0.4236,13,,153.8,152.29,150.0,153.07,...,156.3,154.68,165.25,166.44,166.52,171.25,174.18,174.24,170.94,25.675
2,170.41,145.0,Call,0.4236,13,,153.8,152.29,150.0,153.07,...,156.3,154.68,165.25,166.44,166.52,171.25,174.18,174.24,170.94,25.675
3,170.41,145.0,Call,0.4236,13,,153.8,152.29,150.0,153.07,...,156.3,154.68,165.25,166.44,166.52,171.25,174.18,174.24,170.94,25.675
4,170.41,145.0,Call,0.4236,13,,153.8,152.29,150.0,153.07,...,156.3,154.68,165.25,166.44,166.52,171.25,174.18,174.24,170.94,25.675


Unnamed: 0,currentP,strike,vol,days_expiry,r,t1,t2,t3,t4,t5,...,t12,t13,t14,t15,t16,t17,t18,t19,t20,option_fp
19945,172.91,200.0,0.1985,48,,170.94,170.41,169.43,170.89,170.18,...,174.23,174.33,174.87,173.15,174.97,175.85,175.53,174.52,172.5,1.13
19946,172.91,200.0,0.1985,48,,170.94,170.41,169.43,170.89,170.18,...,174.23,174.33,174.87,173.15,174.97,175.85,175.53,174.52,172.5,1.13
19947,172.91,200.0,0.1985,48,,170.94,170.41,169.43,170.89,170.18,...,174.23,174.33,174.87,173.15,174.97,175.85,175.53,174.52,172.5,1.13
19948,172.91,200.0,0.1985,48,,170.94,170.41,169.43,170.89,170.18,...,174.23,174.33,174.87,173.15,174.97,175.85,175.53,174.52,172.5,1.13
19949,172.91,200.0,0.1985,48,,170.94,170.41,169.43,170.89,170.18,...,174.23,174.33,174.87,173.15,174.97,175.85,175.53,174.52,172.5,1.13


[[174.97 150.    47.   ... 174.33 174.87 173.15]
 [170.41 195.    27.   ... 174.18 174.24 170.94]
 [172.97 145.    54.   ... 170.93 172.03 171.06]
 ...
 [170.41 157.5   27.   ... 174.18 174.24 170.94]
 [172.97 190.    27.   ... 170.93 172.03 171.06]
 [170.41 162.5   13.   ... 174.18 174.24 170.94]]
[[166.44]
 [166.52]
 [171.25]
 [174.18]
 [174.24]
 [170.94]
 [170.41]
 [169.43]
 [170.89]
 [170.18]
 [170.8 ]
 [170.42]
 [170.93]
 [172.03]
 [171.06]
 [172.97]
 [174.23]
 [174.33]
 [174.87]
 [173.15]] [174.97 150.    47.      nan]


**First benchmarking error on Black-Scholes Model**

BSM Implementation:

In [64]:
from scipy.stats import norm

N = norm.cdf

def BS_CALL(params: np.array):
    K = params[1]
    sigma = params[2]
    T = params[3]/365
    r = params[4]/100
    S = params[0]
    d1 = (np.log(S/K) + (r + sigma**2/2)*T) / (sigma*np.sqrt(T))
    d2 = d1 - sigma * np.sqrt(T)
    return S * N(d1) - K * np.exp(-r*T)* N(d2)

def BS_PUT(params: np.array):
    K = params[1]
    sigma = params[2]
    T = params[3]/365
    r = params[4]/100
    S = params[0]
    d1 = (np.log(S/K) + (r + sigma**2/2)*T) / (sigma*np.sqrt(T))
    d2 = d1 - sigma* np.sqrt(T)
    return K*np.exp(-r*T)*N(-d2) - S*N(-d1)

BSM Benchmark Error:

In [65]:
#need following columns in this order:
#CurrentP (S) [0] AKA, strike (K) [1], days_expiry (T) [3], r (r) [4], vol (sigma) [2]

#then calculate price for all rows of test sets, then calculate squared error for each price, take mean, 
#compare with mse from one of the LSTM models on the test set (to-do)

BS_CALL_res = [BS_CALL(elem) for elem in CALL_X_test]
BS_PUT_res = [BS_PUT(elem) for elem in PUT_X_test]

BS_CALL_mse = np.mean((BS_CALL_res - CALL_Y_test)**2)
BS_PUT_mse = np.mean((BS_PUT_res - PUT_Y_test)**2)
print(BS_CALL_mse)
print(BS_PUT_mse)

print(CALL_X_test[0], CALL_Y_test[0])

print(BS_CALL_res[:5])
print(CALL_Y_test[:5])
print((BS_CALL_res[:5] - CALL_Y_test[:5])**2)

0.028180376721682294
0.10644548580248238
[213.26   222.5      0.2283  48.       2.43   203.43   200.99   200.48
 208.97   202.75   201.74   206.5    210.35   210.36   212.64   212.46
 202.64   206.49   204.16   205.53   209.01   208.74   205.7    209.19
 213.28  ] 3.75
[3.708035080253964, 7.093790683373669, 2.7726409121217586, 4.332160836057582, 6.0086885656784546]
[3.75  7.175 2.825 4.35  5.8  ]
[0.00176105 0.00659495 0.00274147 0.00031824 0.04355092]


**Model Params**

In [None]:
layers = 4
features = 4
n_batch = 4096
n_epochs = 100

**Building LSTM-MLP Model**

In [None]:
from keras.models import Sequential, Model, load_model
from keras.layers import Dense, Activation, LeakyReLU, BatchNormalization, LSTM, Bidirectional, Input, Concatenate
from keras import backend as K
from keras.callbacks import TensorBoard
from keras.optimizers import Adam
from keras.utils import plot_model

def make_model():
    close_history = Input((n, 1))
    input2 = Input((features,))
    
    lstm = Sequential()
    lstm.add(LSTM(units=8, input_shape=(n, 1), return_sequences=True))
    lstm.add(LSTM(units=8, return_sequences=True))
    lstm.add(LSTM(units=8, return_sequences=True))
    lstm.add(LSTM(units=8, return_sequences=False))
    input1 = lstm(close_history)
    
    connect = Concatenate()([input1, input2])
    
    for _ in range(layers - 1):
        connect = Dense(400)(connect)
        connect = BatchNormalization()(connect)
        connect = LeakyReLU()(connect)
    
    predict = Dense(1, activation='relu')(connect)

    return Model(inputs=[close_history, input2], outputs=predict)

In [None]:
call_model = make_model()
call_model.summary()

**Testing a few parameters in base LSTM-MLP model**

In [54]:
call_model.compile(optimizer=Adam(learning_rate=1e-2), loss='mse')
history = call_model.fit(LSTM_CALL_X_train, LSTM_CALL_Y_train, 
                    batch_size=n_batch, epochs=10, 
                    validation_split = 0.01,
                    callbacks=[TensorBoard()],
                    verbose=1)
#call_model.save('saved-models/call_test1.keras')

Epoch 1/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 242ms/step - loss: 48.4499 - val_loss: 6819.1099
Epoch 2/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 200ms/step - loss: 2.5083 - val_loss: 2474.0715
Epoch 3/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 202ms/step - loss: 1.4641 - val_loss: 1806.6454
Epoch 4/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 206ms/step - loss: 0.8735 - val_loss: 719.9346
Epoch 5/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 211ms/step - loss: 0.4956 - val_loss: 540.2351
Epoch 6/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 202ms/step - loss: 0.4807 - val_loss: 354.1342
Epoch 7/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 203ms/step - loss: 0.4415 - val_loss: 261.0613
Epoch 8/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 202ms/step - loss: 0.3603 - val_loss: 130.2426
Epoch 9/10
[1m15/1

In [55]:
call_model.evaluate(LSTM_CALL_X_train, LSTM_CALL_Y_train, batch_size=4096)

[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 102ms/step - loss: 65.9484


66.12503814697266

In [56]:
call_model.compile(optimizer=Adam(learning_rate=1e-3), loss='mse')
history = call_model.fit(LSTM_CALL_X_train, LSTM_CALL_Y_train, 
                    batch_size=n_batch, epochs=10, 
                    validation_split = 0.01,
                    callbacks=[TensorBoard()],
                    verbose=1)
#call_model.save('saved-models/call_test2.keras')

Epoch 1/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 250ms/step - loss: 0.4462 - val_loss: 64.0724
Epoch 2/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 207ms/step - loss: 0.3298 - val_loss: 44.0734
Epoch 3/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 201ms/step - loss: 0.2759 - val_loss: 24.6695
Epoch 4/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 205ms/step - loss: 0.2925 - val_loss: 11.9165
Epoch 5/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 201ms/step - loss: 0.3017 - val_loss: 8.9507
Epoch 6/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 209ms/step - loss: 0.2729 - val_loss: 3.9998
Epoch 7/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 201ms/step - loss: 0.3682 - val_loss: 4.1332
Epoch 8/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 203ms/step - loss: 0.3515 - val_loss: 3.6075
Epoch 9/10
[1m15/15[0m [32m━━━━━

In [8]:
call_model.compile(optimizer=Adam(learning_rate=1e-3), loss='mae')
history = call_model.fit(LSTM_CALL_X_train, LSTM_CALL_Y_train, 
                    batch_size=n_batch, epochs=15, 
                    validation_split = 0.01,
                    callbacks=[TensorBoard()],
                    verbose=1)
#call_model.save('saved-models/call_test3.keras')

NameError: name 'call_model' is not defined

In [67]:
predicted_value = call_model.predict(LSTM_CALL_X_train)
print(LSTM_CALL_Y_train[:5])
t = np.transpose(predicted_value)
print(t[0][:5])
print(np.mean(LSTM_CALL_Y_train))
print(np.mean(predicted_value))


[1m1860/1860[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 5ms/step
[ 1.205  1.35   1.295 15.15   0.035]
[ 0.9846658  1.0550512  1.5737005 12.941632   0.       ]
9.087045806017818
8.952071


In [68]:
predicted_valuet = call_model.predict(LSTM_CALL_X_test)
print(LSTM_CALL_Y_test[:5])
t = np.transpose(predicted_valuet)
print(t[0][:5])
print(np.mean(LSTM_CALL_Y_test))
print(np.mean(predicted_valuet))

[1m207/207[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step
[3.75  7.175 2.825 4.35  5.8  ]
[2.9940124 8.082534  2.2449937 4.2151694 5.533465 ]
9.223704236006052
9.09366
