# Predict Range of Daily Price Movement on S&P500 Index Using Vix

## Background

The purpose of his project is tring to find if there are any way that the <span style="color:blue"> Range of Daily Price Change  on S&P500 Index </span> can be predicted by <span style="color:blue"> some form of munipulated data generated from Vix price </span>. 


VIX is the ticker symbol and the popular name for the Chicago Board Options Exchange's CBOE Volatility Index, a popular measure of the stock market's expectation of volatility based on S&P 500 index options. It is calculated and disseminated on a real-time basis by the CBOE. The VIX is a 30-day expectation of volatility given by a weighted portfolio of out-of-the-money European options on the S&P 500. The formular is as follows: 

<div align="center">
    $$VIX = \sqrt{\frac{2e^{r\tau}}{\tau}\Big(\int_{0}^{F}{\frac{P(K)}{K^{2}}}dK + \int_{0}^{\infty}{\frac{C(K)}{K^{2}}}dK\Big)}$$
</div>

Refer to [\[1\]](https://en.wikipedia.org/wiki/VIX) for more details;



## Goal

The final goal of this project is find the optimum model to predict the price range, but a compromise can be made if a given model can predict if the <span style="color:blue">percentage change from open price is less than x% </span> can be predited with a <span style="color:blue">higher probability</span> than those try to predicts size of the range. We also want to gain some insights if possible on predicting daily direction of S&P500.

We will use the following methods(models) to try to find a way to predict the daily S&P500 price range using vix. The following is a list of methods and data sets will be used to construct such a model.




## Hypothesis

### 1: Logistic Regression Can Be Used to Predict If The Price Movement Will Be Within a Pre-set range $\pm x$ ?
Logistic regression is a statistical model uses a logistic function to model a binary dependent variable. In this case the binary dependent variable is defined as if the price moved more than $x$ no matter in which direction.  If possible we also want to find if there are any relasionships between vix and the diraction of SPX movement.

In [52]:
import pandas as pd
from datetime import datetime
import utility as utl
import numpy as np
import talib

vix = pd.read_csv("vix.csv")
vix = utl.clean_up_data(vix, "Vix")

snp = pd.read_csv("snp.csv")
snp = utl.clean_up_data(snp, "SPX")

data_set = pd.concat([vix, snp], axis= 1)
data_set = data_set[data_set.index.map(lambda x: x > datetime(2006,1,1) and x < datetime(2021,2,22) )]
data_set["Vix_Prev_Close"] = data_set["Vix_Close"].shift(1)
data_set["Vix_OC_Gap"] = data_set["Vix_Open"] - data_set["Vix_Prev_Close"]
data_set["SPX_Mov"] = data_set["SPX_Open"] - data_set["SPX_Close"]
for i in range(3):
    data_set[f"SPX_Close_shift_{i+1}"] = data_set["SPX_Close"].shift(i+1)

open = data_set['SPX_Open'].dropna()
close = data_set['SPX_Close'].dropna()
high = data_set['SPX_High'].dropna()
low = data_set['SPX_Low'].dropna()

vix_close = data_set['SPX_Close'].dropna()
vix_high = data_set['SPX_High'].dropna()
vix_low = data_set['SPX_Low'].dropna()
data_set["RSI"] = talib.RSI(close, timeperiod=14)
data_set["RSI"] = data_set["RSI"].shift(1)
data_set["ATR"] = talib.ATR(high, low, close, timeperiod = 5)
data_set["ATR"] = data_set["ATR"].shift(1)
data_set["BB_UPPER"], data_set["BB_MID"], data_set["BB_LOWER"] = talib.BBANDS(close, timeperiod=5, nbdevup=2, nbdevdn=2, matype=talib.MA_Type.T3)
data_set["BB_UPPER"].shift(1)
data_set["BB_MID"].shift(1)
data_set["BB_LOWER"].shift(1)
data_set["ADX"] = talib.ADX(high=high,low=low,close=close,timeperiod=10)
data_set["ADX"] = data_set["ADX"].shift(1)

data_set["DOJI"] = talib.CDLDOJI(open,high,low,close)
data_set["DOJI"] = data_set["DOJI"].shift(1)

data_set["VIX_RSI"] = talib.RSI(vix_close, timeperiod=14)
data_set["VIX_RSI"] = data_set["VIX_RSI"].shift(1)
data_set["VIX_ATR"] = talib.ATR(vix_high, vix_low, vix_close, timeperiod = 5)
data_set["VIX_ATR"] = data_set["VIX_ATR"].shift(1)

data_set["SPX_OC_Gap"] = data_set["SPX_Open"] - data_set["SPX_Close_shift_1"]
data_set["SPX_pre_Mov"] = data_set["SPX_Mov"].shift(1)
data_set["SPX_High_shift_1"] = data_set["SPX_High"].shift(1)
data_set["SPX_Low_shift_1"] = data_set["SPX_Low"].shift(1)
data_set["SPX_PRE_HL_RANGE"] = data_set["SPX_High_shift_1"] - data_set["SPX_Low_shift_1"]
data_set["SPX_ABS_Mov"] = abs(data_set["SPX_Mov"] )
data_set["SPX_Direction"] = data_set["SPX_Mov"] >= 0
data_set["SPX_MOV_Direction"] = data_set["SPX_Direction"].apply(lambda x: 1 if x else -1)

data_set["MONTH"] = data_set.index.map(lambda x: x.month)
data_set["DAY"] = data_set.index.map(lambda x: x.day)
data_set["WEEKDAY"] = data_set.index.map(lambda x: x.weekday())

data_set["SPX_MOVE_GT"] = data_set["SPX_ABS_Mov"]>15
data_set["SPX_MOVE_GT_15"] = data_set["SPX_MOVE_GT"].apply(lambda x: 1 if x else 0)


del data_set["SPX_MOVE_GT"]
del data_set["SPX_Direction"]
del data_set["SPX_Mov"]

data_set

Unnamed: 0_level_0,Vix_Close,Vix_Open,Vix_High,Vix_Low,SPX_Close,SPX_Open,SPX_High,SPX_Low,Vix_Prev_Close,Vix_OC_Gap,...,SPX_pre_Mov,SPX_High_shift_1,SPX_Low_shift_1,SPX_PRE_HL_RANGE,SPX_ABS_Mov,SPX_MOV_Direction,MONTH,DAY,WEEKDAY,SPX_MOVE_GT_15
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2006-01-03,11.14,12.25,12.51,10.99,,,,,,,...,,,,,,-1,1,3,1,0
2006-01-04,11.37,11.22,11.71,10.97,1273.46,1268.80,1275.37,1267.74,11.14,0.08,...,,,,,4.66,-1,1,4,2,0
2006-01-05,11.31,11.43,11.84,11.31,1273.48,1273.46,1276.91,1270.30,11.37,0.06,...,-4.66,1275.37,1267.74,7.63,0.02,-1,1,5,3,0
2006-01-06,11.00,11.23,11.50,10.81,1285.45,1273.48,1286.09,1273.48,11.31,-0.08,...,-0.02,1276.91,1270.30,6.61,11.97,-1,1,6,4,0
2006-01-09,11.13,11.35,11.35,10.98,1290.15,1285.45,1290.78,1284.82,11.00,0.35,...,-11.97,1286.09,1273.48,12.61,4.70,-1,1,9,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-02-12,19.97,21.60,22.45,19.95,3934.83,3911.65,3937.23,3905.78,21.25,0.35,...,0.02,3925.99,3890.39,35.60,23.18,-1,2,12,4,1
2021-02-16,21.46,21.13,22.46,20.88,3932.59,3939.61,3950.43,3923.85,19.97,1.16,...,-23.18,3937.23,3905.78,31.45,7.02,1,2,16,1,0
2021-02-17,21.50,22.02,23.44,21.09,3931.33,3918.50,3933.61,3900.43,21.46,0.56,...,7.02,3950.43,3923.85,26.58,12.83,-1,2,17,2,0
2021-02-18,22.49,21.98,24.23,21.80,3913.97,3915.86,3921.98,3885.03,21.50,0.48,...,-12.83,3933.61,3900.43,33.18,1.89,1,2,18,3,0


After we clean up the data, we split the above data set into train set and test set with the cut-off of 01/01/2019.

In [65]:
from sklearn import metrics 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
training_set = data_set[data_set.index < datetime(2020,1,1)].dropna()
test_set = data_set[data_set.index >= datetime(2020,1,1)].dropna()

logistic_regression = LogisticRegression()
# x_columns = ["SPX_Open", "Vix_OC_Gap", "SPX_pre_Mov", "SPX_OC_Gap", "SPX_High_shift_1", "SPX_Low_shift_1", "RSI", "ATR", "BB_UPPER", "BB_MID", "BB_LOWER", "VIX_RSI", "VIX_ATR"]
# x_columns = ["Vix_OC_Gap", "SPX_pre_Mov", "SPX_PRE_HL_RANGE","SPX_OC_Gap", "RSI", "ATR", "BB_UPPER", "BB_MID", "BB_LOWER", "VIX_RSI", "VIX_ATR"]

x_columns = ["Vix_OC_Gap", "SPX_pre_Mov", "SPX_PRE_HL_RANGE","SPX_OC_Gap", "RSI", "ATR", "ADX","BB_UPPER", "BB_MID", "BB_LOWER", "VIX_RSI", "VIX_ATR", "MONTH", "DAY","WEEKDAY"]
for i in range(3):
    x_columns.append(f"SPX_Close_shift_{i+1}")

x_train = training_set.loc[:, x_columns]
x_test = test_set.loc[:, x_columns]
clf = logistic_regression.fit(x_train, training_set["SPX_MOVE_GT_15"])
y_pred = logistic_regression.predict(x_test)
accuracy = metrics.accuracy_score(test_set["SPX_MOVE_GT_15"], y_pred)
accuracy_percentage = 100 * accuracy

test_set["range_predict"] = y_pred
accuracy_percentage
pd.set_option('display.max_rows', 10)
test_set[test_set["range_predict"] != test_set["SPX_MOVE_GT_15"]]
test_set[(test_set["range_predict"] == 0) & (test_set["SPX_MOVE_GT_15"] == 1)]

# test_set[test_set["SPX_MOVE_GT_15"]==1]

# len(test_set[(test_set["range_predict"] == 0) & (test_set["SPX_MOVE_GT_15"] == 1)])/len(test_set[test_set["SPX_MOVE_GT_15"]==1])
# accuracy_percentage
# x_train

Unnamed: 0_level_0,Vix_Close,Vix_Open,Vix_High,Vix_Low,SPX_Close,SPX_Open,SPX_High,SPX_Low,Vix_Prev_Close,Vix_OC_Gap,...,SPX_High_shift_1,SPX_Low_shift_1,SPX_PRE_HL_RANGE,SPX_ABS_Mov,SPX_MOV_Direction,MONTH,DAY,WEEKDAY,SPX_MOVE_GT_15,range_predict
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-06,13.85,15.45,16.39,13.54,3246.28,3217.55,3246.84,3214.64,14.02,1.43,...,3246.15,3222.34,23.81,28.73,-1,1,6,0,1,0
2020-01-10,12.56,12.42,12.87,12.09,3265.35,3281.81,3282.99,3260.86,12.54,-0.12,...,3275.58,3263.67,11.91,16.46,1,1,10,4,1,0
2020-01-13,12.32,12.84,13.09,12.32,3288.13,3271.13,3288.13,3268.43,12.56,0.28,...,3282.99,3260.86,22.13,17.00,-1,1,13,0,1,0
2020-01-24,14.56,12.75,15.98,12.62,3295.47,3333.10,3333.18,3281.53,12.98,-0.23,...,3326.88,3301.87,25.01,37.63,1,1,24,4,1,0
2020-01-29,16.39,15.68,16.65,14.94,3273.40,3289.46,3293.47,3271.89,16.28,-0.60,...,3285.78,3253.22,32.56,16.06,1,1,29,2,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-01-14,23.25,22.22,23.47,21.66,3795.54,3814.98,3823.60,3792.86,22.21,0.01,...,3820.96,3791.50,29.46,19.44,1,1,14,3,1,0
2021-01-15,24.34,23.52,25.80,23.08,3768.25,3788.73,3788.73,3749.62,23.25,0.27,...,3823.60,3792.86,30.74,20.48,1,1,15,4,1,0
2021-01-19,23.24,23.03,23.56,22.53,3798.91,3781.88,3804.53,3780.37,24.34,-1.31,...,3788.73,3749.62,39.11,17.03,-1,1,19,1,1,0
2021-01-20,21.58,22.82,22.86,21.37,3851.85,3816.22,3859.75,3816.22,23.24,-0.42,...,3804.53,3780.37,24.16,35.63,-1,1,20,2,1,0


we can see from the above simple logistic regression, we only get a 43 percent of accuracy, which is not good enough. What we want to try before we try to switch to other models , we can try to add more dependent variables. We first try some technical indicators try take some properties of historical datas into account.

We will try ann next

In [80]:
import tensorflow as tf

ann = tf.keras.models.Sequential()
ann.add(tf.keras.layers.Dense(units= 30, activation='relu'))
ann.add(tf.keras.layers.Dense(units= 60, activation='relu'))
ann.add(tf.keras.layers.Dense(units= 30, activation='relu'))
ann.add(tf.keras.layers.Dense(units= 1, activation='sigmoid'))

ann.compile(optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"])

ann.fit(x_train, training_set["SPX_MOVE_GT_15"], batch_size = 32, epochs = 150)
result = ann.predict(x_test)


Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78

In [81]:
test_set["ann predict"] = [1 if x>0.5 else 0 for x in result]
test_set[test_set["ann predict"] != test_set["SPX_MOVE_GT_15"]]

Unnamed: 0_level_0,Vix_Close,Vix_Open,Vix_High,Vix_Low,SPX_Close,SPX_Open,SPX_High,SPX_Low,Vix_Prev_Close,Vix_OC_Gap,...,SPX_Low_shift_1,SPX_PRE_HL_RANGE,SPX_ABS_Mov,SPX_MOV_Direction,MONTH,DAY,WEEKDAY,SPX_MOVE_GT_15,range_predict,ann predict
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-06,13.85,15.45,16.39,13.54,3246.28,3217.55,3246.84,3214.64,14.02,1.43,...,3222.34,23.81,28.73,-1,1,6,0,1,0,0
2020-01-10,12.56,12.42,12.87,12.09,3265.35,3281.81,3282.99,3260.86,12.54,-0.12,...,3263.67,11.91,16.46,1,1,10,4,1,0,0
2020-01-13,12.32,12.84,13.09,12.32,3288.13,3271.13,3288.13,3268.43,12.56,0.28,...,3260.86,22.13,17.00,-1,1,13,0,1,0,0
2020-01-24,14.56,12.75,15.98,12.62,3295.47,3333.10,3333.18,3281.53,12.98,-0.23,...,3301.87,25.01,37.63,1,1,24,4,1,0,0
2020-01-27,18.23,17.42,19.02,16.82,3243.63,3247.16,3258.85,3234.50,14.56,2.86,...,3281.53,51.65,3.53,1,1,27,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-02-03,22.91,24.59,25.43,22.91,3830.17,3840.27,3847.51,3816.68,25.56,-0.97,...,3791.84,51.25,10.10,1,2,3,2,0,1,1
2021-02-05,20.87,21.99,22.16,20.86,3886.83,3878.30,3894.56,3874.93,21.77,0.22,...,3836.66,35.76,8.53,-1,2,5,4,0,1,1
2021-02-09,21.63,21.57,22.26,20.65,3911.23,3910.49,3918.35,3902.64,21.24,0.33,...,3892.59,23.18,0.74,-1,2,9,1,0,1,1
2021-02-10,21.99,21.64,23.85,19.69,3909.88,3920.78,3931.50,3884.94,21.63,0.01,...,3902.64,15.71,10.90,1,2,10,2,0,0,1


In [91]:
# pd.set_option('display.max_rows', None)
len(test_set[test_set["ann predict"] == 1 ])
test_set[(test_set["ann predict"] == 0) & (test_set["SPX_MOVE_GT_15"] ==1) & (test_set["WEEKDAY"] == 4)]
# acc = metrics.accuracy_score(test_set["SPX_MOVE_GT_15"], test_set["ann predict"])
# acc

Unnamed: 0_level_0,Vix_Close,Vix_Open,Vix_High,Vix_Low,SPX_Close,SPX_Open,SPX_High,SPX_Low,Vix_Prev_Close,Vix_OC_Gap,...,SPX_Low_shift_1,SPX_PRE_HL_RANGE,SPX_ABS_Mov,SPX_MOV_Direction,MONTH,DAY,WEEKDAY,SPX_MOVE_GT_15,range_predict,ann predict
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-10,12.56,12.42,12.87,12.09,3265.35,3281.81,3282.99,3260.86,12.54,-0.12,...,3263.67,11.91,16.46,1,1,10,4,1,0,0
2020-01-24,14.56,12.75,15.98,12.62,3295.47,3333.1,3333.18,3281.53,12.98,-0.23,...,3301.87,25.01,37.63,1,1,24,4,1,0,0
2020-02-21,17.08,17.33,18.21,16.19,3337.75,3360.5,3360.76,3328.45,15.56,1.77,...,3341.02,48.13,22.75,1,2,21,4,1,0,0
2021-02-12,19.97,21.6,22.45,19.95,3934.83,3911.65,3937.23,3905.78,21.25,0.35,...,3890.39,35.6,23.18,-1,2,12,4,1,0,0
