# Environment

Please setup environment according to `environments.yml` or `requirements.txt`. Crucially, Theano backend for Keras is required

Ensure you have `models` Folder and `utils.py` for module imports.

In [1]:
import pandas as pd
tweets = pd.read_csv("data/rich_tweets.csv", low_memory=False)
stock_data = pd.read_csv("data/S&P500.csv", parse_dates=['Date'], infer_datetime_format=True)
vix = pd.read_csv('data/vix.csv', parse_dates=['Date'], infer_datetime_format=True)
val_time = pd.Timestamp('2019')
test_time = pd.Timestamp('2020')
data_dict = {}
data_dict['tweets'] = tweets
data_dict['vix'] = vix
data_dict['stock_data'] = stock_data

# Usage
1. Supply a `tweets` scraped from Twitter data
2. Supply a `VIX` as for market volatility
3. Supply a `stocks_data` S&P or similar for labels
4. Supply a `val_time` and `test_time` you are predicting for Volatility

The model will automatically Test performance for you, along with validation as well as set of hyperparameters

In [2]:
from utils import preprocess_tweets, preprocess_stocks_data, preprocess_vix, apply_sentiments, feature_selection, model_selection

Using Theano backend.


Keras version: 1.1.0
Keras Backend: theano


In [3]:
tweets = preprocess_tweets(data_dict)
stocks_data_df = preprocess_stocks_data(data_dict)
vix_ = preprocess_vix(data_dict)

Set `verbose` to **False** to get rid of runtime information, defaults to **True**

In [4]:
%%time
features, merged_data = apply_sentiments(stocks_data_df, tweets, verbose=True)
selected_features, shifted_data = feature_selection(features, stocks_data_df, merged_data, verbose=True)

Bidrectional RNN sentiments runtime: 215.7 second(s)
VADER sentiments runtime: 8.4 second(s)
pos affects Close at lag 1, pvalue = 0.004148718448717387
weighted_Disgust affects Close at lag 1, pvalue = 0.00048534438451145335
weighted_Fear affects Close at lag 1, pvalue = 8.911128506877189e-06
weighted_Joy affects Close at lag 1, pvalue = 0.005988865792200514
weighted_Surprise affects Close at lag 1, pvalue = 0.0026136991039728604
weighted_neu affects Close at lag 1, pvalue = 5.858989503785753e-05
['Disgust', 'Fear', 'Joy', 'Surprise', 'neu', 'pos', 'weighted_Disgust', 'weighted_Fear', 'weighted_Joy', 'weighted_Surprise', 'weighted_neu', 'pos1', 'weighted_Disgust1', 'weighted_Fear1', 'weighted_Joy1', 'weighted_Surprise1', 'weighted_neu1']
Wall time: 3min 44s


Set `verbose` to **False** to get rid of runtime information, defaults to **True**

In [5]:
estimators, cross_validation_scores, results = model_selection(selected_features, val_time, test_time, shifted_data, vix_, verbose=True)

Nearest Neighbors's Validation F2 Score: 91.51%, Runtime: 2.0 second(s)
Gaussian Process's Validation F2 Score: 95.71%, Runtime: 74.7 second(s)
Decision Tree's Validation F2 Score: 92.82%, Runtime: 1.3 second(s)
Random Forest's Validation F2 Score: 97.62%, Runtime: 6.1 second(s)
Neural Net's Validation F2 Score: 93.14%, Runtime: 1.9 second(s)
AdaBoost's Validation F2 Score: 91.84%, Runtime: 1.4 second(s)
Naive Bayes's Validation F2 Score: 0.64%, Runtime: 0.0 second(s)
QDA's Validation F2 Score: 97.62%, Runtime: 0.0 second(s)
Logistic Regression's Validation F2 Score: 81.47%, Runtime: 0.1 second(s)
LightGBM's Validation F2 Score: 97.62%, Runtime: 11.9 second(s)
XGBoost's Validation F2 Score: 97.62%, Runtime: 1.4 second(s)
Best Estimator: Random Forest, Specifications: Pipeline(steps=[('Scaler', StandardScaler()),
                ('RandomOverSampling', RandomOverSampler(random_state=4222)),
                ('Model',
                 RandomForestClassifier(max_depth=8, n_estimators=50,
  

In [6]:
results

Unnamed: 0,sentiment only,vix only,combined
acc,0.615385,0.650641,0.701923
f1,0.576122,0.575467,0.578987
f2,0.597892,0.616747,0.646975


In [7]:
cross_validation_scores.sort_values(by='F2',ascending=False)

Unnamed: 0,F2
Random Forest,0.976206
QDA,0.976206
LightGBM,0.976206
XGBoost,0.976206
Gaussian Process,0.957124
Neural Net,0.931391
Decision Tree,0.92815
AdaBoost,0.918366
Nearest Neighbors,0.915136
Logistic Regression,0.814742


In [8]:
estimators

{'Nearest Neighbors': Pipeline(steps=[('Scaler', StandardScaler()),
                 ('RandomOverSampling', RandomOverSampler(random_state=4222)),
                 ('Model', KNeighborsClassifier(n_neighbors=3))]),
 'Gaussian Process': Pipeline(steps=[('Scaler', StandardScaler()),
                 ('RandomOverSampling', RandomOverSampler(random_state=4222)),
                 ('Model',
                  GaussianProcessClassifier(kernel=1**2 * RBF(length_scale=1),
                                            random_state=4222))]),
 'Decision Tree': Pipeline(steps=[('Scaler', StandardScaler()),
                 ('RandomOverSampling', RandomOverSampler(random_state=4222)),
                 ('Model', DecisionTreeClassifier(random_state=4222))]),
 'Random Forest': Pipeline(steps=[('Scaler', StandardScaler()),
                 ('RandomOverSampling', RandomOverSampler(random_state=4222)),
                 ('Model',
                  RandomForestClassifier(max_depth=8, n_estimators=50,
          