#  Final Report
## Sentiment of stock headlines to improve on time-series models predicting VIX  


## 1. Introduction
### a. Predicting the unpredictable

The VIX index also known as the "fear index" is a measurement designed by the Chicago Board Option Exchange to measure of how much volitility will be seen in the S & P 500 (SPX) index over the next 30 days. The CBOE presents the following in the FAQS.

The VIX Index is a financial benchmark designed to be an up-to-the-minute market estimate of expected volatility of the S&P 500 Index, and is calculated by using the midpoint of real-time S&P 500® Index (SPX) option bid/ask quotes. More specifically, the VIX Index is intended to provide an instantaneous measure of how much the market thinks the S&P 500 Index will fluctuate in the 30 days from the time of each tick of the VIX Index.


Cboe Options Exchange® (Cboe Options®) calculates the VIX Index using standard SPX options and weekly SPX options that are listed for trading on Cboe Options. Standard SPX options expire on the third Friday of each month and weekly SPX options expire on all other Fridays. Only SPX options with Friday expirations are used to calculate the VIX Index.* Only SPX options with more than 23 days and less than 37 days to the Friday SPX expiration are used to calculate the VIX Index. These SPX options are then weighted to yield a constant maturity 30-day measure of the expected volatility of the S&P 500 Index.

* Cboe Options lists SPX options that expire on days other than Fridays. Non-Friday SPX expirations are not used to calculate the VIX Index.

Intraday VIX Index values are based on snapshots of SPX option bid/ask quotes every 15 seconds and are intended to provide an indication of the fair market price of expected volatility at particular points in time. As such, these VIX Index values are often referred to as "indicative" or "spot" values. Cboe Options currently calculates VIX Index spot values between 3:15 a.m. ET and 9:15 a.m. ET (Cboe GTH session), and between 9:30 a.m. ET and 4:15 p.m. ET (Cboe RTH session) according to the VIX Index formula that is set forth in the White Paper.

The generalized formula used in the VIX Index calculation is:

![Vix](../Figures/vix_formula.png)

VIX Index Formula
Originally posted (Apr 14 2016); updated (Jun 29 2016); updated (May 15 2018); updated (Oct 8, 2019)[1]

[1]: https://www.cboe.com/tradable_products/vix/faqs/

Since volitility is presumably due to events currently unforseen to the market, the objective here will be to see if improvement can be made on traditional time-series methods of prediction by including exogenous features based on the sentiment financial headlines with references to over 8000 stocks and funds. The hope is to be able to refine and improve the predictive ability of traditional Autoregression/Moving average models by using the headline data as an exogenous set in the model and to see if other models can generate better predictions than the time-series methods. 

This approach is vastly different from the approach in *Predicting VIX with Adaptive Machine Learning*, by Bai and Cai which used a set of 278 technical market indictors and factors but did not examine sentiment or news articles. [2] They also did a binary(and multi) classification of the direction signal as opposed to attepting to predict he price of the index itself. I will infer direction signal from the directions of the predicted values of the model. So although some classifacation is done, it is based on the predicted values not modeled by typical classification models.  

[2]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3866415




### b. Client 

  
The clients for this question would be a financial investment firms. If sufficient information gain can be developed and appropriate option stategy can be implemented with large potential returns.
    
    
### c. Time Series Analysis

  
Time series data is any data such as the price of the VIX index which changes at fixed time intervals in this case we are looking at closing prices on trading days from 2010 until early June 2020. The standard method of time series analysis is the Box-Jenkins method using the ARIMA, autoregression integrated moving average, model which uses autoregression, differencing and moving averages. This is essentially our baseline for performance as it works purely off the time series data itself. 

First, I will attempt to incorporate the headline data with the time series data in a traditional linear regression model, then I will see if a neural network will work primarily testing GRU, Gated Recurrent Units, and LSTM, Long Short Term Memory to look for improvements. Finally, I will try a SARIMAX model which is an ARIMA model with weighted linear regression on an exogenous set.

## 2. Data

### a. Raw Sources

The headlines are from *Daily Financial News for 6000+ Stocks* on Kaggle.

https://www.kaggle.com/datasets/miguelaenlle/massive-stock-news-analysis-db-for-nlpbacktests

The price of the VIX was downloaded from Yahoo finance.



### b. Libraries

<table>

<tr>
  <td>Data Wrangle</td>
  <td>scipy, plotly, numpy, pandas, matplotlib, seaborn, datetime, kaggle</td>
</tr>

<tr>
  <td>Data Visualization</td>
  <td>itertools, collections, PIL, wordcloud, seaborn, nltk, bs4, bokeh, gensim, plotly</td>
</tr>

<tr>
  <td>Classicial Classification of Text</td>
  <td>seaborn, sklearn, matplotlib, xgboost</td>
</tr>

<tr>
  <td>BERT and DistilBert</td>
  <td>huggingface, ktrain, pytorch, tensorflow</td>
</tr>

<tr>
  <td>Arima</td>
  <td>scipy, statsmodel, math</td>
</tr>

<tr>
  <td>Xgboost and RandomForest regression</td>
  <td>sklearn, plotly, xgboost</td>
</tr>

<tr>
  <td>RNN, LSTM and GRU Neural Networks</td>
  <td>tensorflow, sklearn, statsmodel</td>

</table>


## Data Wrangle

* [Data Wrangle Notebook](https://github.com/mikedshadow/Capstone-3/blob/main/Notebooks/Data-Wrangle.ipynb)

For the Headlines:

	* There was about 1,400,000 headlines.
	* The kaggler scraped them from Benzinga.
	* Each instance has an Unnamed:0 index , a title, a date and a stock ticker. 
	* Sectors and groupings of stocks were considered but required too much labeling.
	* Vader was used to created a sentiment of Positive, Negative or Neutral for each headline.  
	* Just over half the headlines are Neutral and the other half are roughly 2 to 1 in favor of Positive.
	* The headlines are aggregated by date with the sentiment and ticker in corresponding lists and the titles concatenated.
    * The vader(compound) scores are summed and averaged as features.
    * Features Positive, Negative, Neutral and Total count these sentiments per day.
	
For Financial Data:
	
	* Yahoo data was collected using pandas_datareader
	* Since the headlines are sparse before 2010 all financial data before that was not used.

Merging:
	
	* Since there were many nontrading days I merged the sets to just the trading days.  
	* All headline data from after the last trading day is merged into the next trading day.
    * An overall sentiment for each day was also found with Vader on the concatenated headlines of the day.

The idea is to use all of the previous trading day's headline and price data to predict the next day's close.

The saved csv file has the following features:

 `CLOSE` the closing price of the VIX 
 
 `Positive` a count of that day's positive headlines
 
 `Negative` a count of that day's negative headlines
 
 `Neutral` a count of that day's neutral headlines
 
 `Total` a total of all headlines for the day
 
 `summed_vader`	sum of all the coumpound Vader scores for each headline in the day
 
 `sentiments` a list (length = Total) of the sentiments for the headlines of the day
 
 `headlines` a concatenated string of all the headlines that day
 
 `stocks` a list, also length = Total and in same order as `sentiments`, of all the stock tickers of the headlines
 
 `date`	the date
 
 `dayOfWeek`	the day of the week
 
 `ave_vader` the average vader score e.g. `summed_vader` / `Total`
 
 `daily_sentiment`	sentiment of the average vader score (Positive, Negative, Neutral)
 
 `compound` the compound vader score for the day's concatenated headlines
 
 `overall_sentiment` the sentiment of `compound` (Positive, Negative, Neutral)
 
 `Pos_minus_Neg` a count of Positive minus Negative
 
 `Pos_minus_Neg_pct` the percent difference between Positive and Negative e.g. `Pos_minus_Neg` / `Total`
 
 `Neg_pct` the percentage of Negative headlines `Negative` / `Total`
 
## EDA
 
* [EDA Notebook](https://github.com/mikedshadow/Capstone-3/blob/main/Notebooks/Wrangle-2.ipynb)

The VIX is already is good shape for time series analysis.

<img src="../Figures/fig2.jpg" width=800 height=600 />

It is stationary as seen with the Ad-Fuller test and has no seasonality.

<img src="../Figures/fig4.jpg" width=600 height=400 />

Conveniently, the exogenous features also have extremely low p-values in the Ad-Fuller test indicating the rejection of the null hypothesis that the time series is non-stationary.

At this point, the baseline case for modeling is determined by finding the best ARIMA model for the time series.

The engineered features `Neg_pct` and `Pos_minus_Neg_pct` both seem related to the spikes in the VIX as the average percentage of negative headlines is higher when the VIX is high and the spread between positive and negative headlines decrease when the VIX is high.

## ARIMA

* [ARIMA Notebook](https://github.com/mikedshadow/Capstone-3/blob/main/Notebooks/EDA-ARIMA.ipynb)

Various ARIMA models were tested using root mean squared error (RMSE) as the metric to determine the best model.

Using a 90% training set and the final 10% with walk-forward validation at each step. The test set was 263 trading days or slightly over a year. 

The best RMSE was the ARIMA(1,1,0) model with a RMSE of 3.297, which will be the level we would like to improve on.

As seen below the model reacts quickly with the walk forward validation but is out of sync and therefore does very poorly at predicting directional signals for each day. 

<img src="../Figures/fig3.jpg" width=800 height=400 />

It is worth noting that the huge spike in the data is from March 2020 and Covid.

After setting our baseline with the ARIMA model we look at the exogenous features an prepare them for adding to the model. The numerical features are scaled using MinMaxScaler while the categorical features are OneHotEncoded. These features can now be used with the closing price as the input to a Neural Network.

## Neural Networks

* [Neural Network Notebook](https://github.com/mikedshadow/Capstone-3/blob/main/Notebooks/Black-box.ipynb)

### GRU 

The Gated Recurrent Unit neural network was tried first. It ran fairly slow and did not produce good results. Since the time per epoch was high (around 30s) it was largely impractical to use a high number of epochs but the model seem to settle after a reasonably small number of epochs. Unforunately the RMSE on the walk forward was far worse than the ARIMA model at about 15. Numerous numbers of neurons and epochs were tried and while some did end up doing better than the initial attempt they were initially starting with a straight line and making small adaptations which missed the peaks altogether. As the number of epochs rose it seemed that the model would eventually stabilize and then failed to continue learning. Perhaps with greater processing power and time a model could be trained but it would likely not be very responsive to sudden changes. The opacity of the neural network made this model impractical as it did not indicate which factors mighht have the greatest impact on the VIX.

### LSTM 

The long short term memory model was also tried. Sadly, the results did not show much if any improvement on the GRU models.
It seems like without a much larger model and far more processing power than my laptop can offer that these neural network models are a dead end.

## SARIMAX 


* [SARIMAX Notebook](https://github.com/mikedshadow/Capstone-3/blob/main/Notebooks/SARIMAX.ipynb)

The SARIMAX model is an ARIMA type model where Seasonality can be added, although that is not the case here, and an exogenous feature set which is used as a linear regression model in concert with the time series model. At first, I looked for improvements using just two of the features engineered `Neg_pct` and `Pos_minus_Neg_pct`. 

Recall the ARIMA model as shown below with RMSE of 3.297:

<img src="../Figures/fig5.jpg" width=800 height=400 />

The SARIMAX(1,1,0) model shows slight improvement with the two given features with a RMSE of 3.164.

The SARIMAX(2,2,2) model was also tested and performed just slightly worse with a RMSE of 3.181.

<img src="../Figures/fig6.jpg" width=800 height=400 />

In both these cases a train/test split of 90/10 was used which resulted in a test set of length 263. This was almost identical  to using modulo 10 division on the length of the set, 262, in the `test_size` parameter for a TimeSeriesSplit while using time series cross-validation with 5 folds. While the final fold typically was extremly close to the earlier 90/10 split the earlier folds had much better RMSE. Unfortunately, the directional signals were still out of sync and the accuracy was not significantly better than random. 

<img src="../Figures/fig8.jpg" width=800 height=400 />

Adding more features and especially the categorical features `sentiments` and `stocks` were problematic. These two features were both lists of length `Total` so the OneHot Encoding would turn them into thousands of columns which caused the SARIMAX model to slow down by a factor so high that I could not even see how long it was. The run with just the two features took under 10 minutes to run while with only one of `sentiments` or `stocks` added in the model was still running after 15 hours. Removing just those two features while maintaining the other categorical features `overall_sentiment`, `daily_sentiment` and `DayofWeek` alowed the model to run in a far more reasonable time. There was noticable and significant improvement in RMSE when using the numerical features and the three remaining categorical features of about 10\% in the 90/10 split. 

At this point the TimeSeries Split was used with this model and the improvements where still noticable at each split. This seemed like success in this case as at least the proof of concept was determined. The best model ended up being the following using the 3 categorical features and the numerical features. 
<img src="../Figures/fig7.jpg" width=800 height=400 />

<img src="../Figures/fig11.jpeg" width=600 height=300 />

## Conclusions

Clearly, some things went well and some did not. The improvement over a standard ARIMA model was demonstrated in the SARIMAX model adding in most of the exogenous features. Unfortunately, the possibility of further improvement based on the `sentiments` or `stocks` features was too computationally expensive for my system. It was not clear if this would be an issue if processing power were higher. Grouping and labeling of stocks could into S & P ( or Russell 2000) categories would be a reasonable next attempt. This would eliminate some of the explosion of the `stocks` / `sentiments` columns under encoding. Another potential solution may just be to try converting the lists into a set of tuples of stock with sentiment.

The walk forward validation was at a single day this may not be practical with the data acquision and cleaning running on a 7 day window instead might lead to greater variability as a tradeoff for the data scraping and cleaning required for real time application. The direction signals were also affected by the walk forward method as the frequent retraining pushed the predictions to follow the time-series by being out of phase which leads to a nearly random direction signal seen in the confusion matrix above. 

The hope that a neural network using either Gated Recurrent Units or Long Short Term Memory could beat the baseline model went up in a puff of GPUs as the number of epochs did not seem to decrease the RMSE and the models themselves appeared to fail miserably at predicting the sudden increases and did not even think the type of spike that occured during Covid was possible. these methods should be re-evaluated with a redesign of the experiment as a classification problem on the direction signal. Either as an A/B test for up/down or a multiclass problem with big up, up, down, big down or some similar larger set of classes.


Additionally, the model should be integrated with existing models using additional technical features such as in [2]. The model did make a modest improvement so from the point of view of proof of concept the problem was a success. Future work would be to try some of the above refinements and to examine the data scraping protocals more thoroughly to see if a real time model using the headlines could be implemented.


