## Off-the-shelf Model Test

In [2]:
import Query as query
import numpy as np
import pandas as pd
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline as imb_pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import accuracy_score,f1_score,roc_auc_score

## Goal:
Given a more formal defintion to "market event" can we proceed to forecast
using the standard indicators in the gdelt events database using some simple off-the-shelf methods. This is to be seen as the "litmus test" for the project, a first test to get a feel for the possible modeling challenges and effectiveness of our standard methods.

## Market Event Definition:
SP500 up/down 1.5 standard deviations measured by rolling historical daily 
volatility  over preceeding one month period. This is a classification 
problem. Can be extended to more sophisticated time series estimate or 
thresholding.


### Load Data
After you've queried the database once, say on your first run, then set this to True so you don't query it again.

In [3]:
load_data = False
gdelt_df = pd.read_csv('Gdelt_events_20160101_20171006.csv')
gdelt_df = gdelt_df.set_index('sqldate',drop=True).sort_index()

### Query Parameters

In [4]:
proj_id = 'capstone-v0'
start_date = '2016-01-01'
end_date = '2017-10-06'
ticker = '^GSPC'
my_query = query.query_tool(proj_id,start_date,end_date,ticker)
sql_query = """
            SELECT Actor1Name, Actor2Name, GoldsteinScale,NumMentions,sourceurl,
            sqldate, avgtone, numarticles, numsources,  
            FROM [gdelt-bq:full.events] 
            WHERE sqldate > 20160101 and sqldate <= 20171006  and 
                Actor1Geo_CountryCode like "%US%" and 
                Actor1Code like "%BUS%"
            """
if load_data == False:
    my_query.query_gdelt(sql_query)
    my_query.gdelt_df = my_query.gdelt_df.set_index('sqldate',drop=True).sort_index()
    df = my_query.gdelt_df.copy(True)
    my_query.save_gdelt_df('Gdelt_events_20160101_20171006')

### Creating Labels:
i.e. If SP500 up/down 1.5 standard deviations then consider that an "event"

In [5]:
spx_prices = my_query.query_yahoo()
spx_return = np.log(spx_prices['Adj Close']).diff() #log Return
spx_vol = spx_return.rolling(window=20).std().dropna()
spx_return = spx_return.loc[spx_vol.index] #First entry is NAN because of return
day_over_day_diff = np.abs(spx_return.diff())#can subtract because of log returns
event_idx = [spx_vol*1.5 < day_over_day_diff]
event_idx = np.array(event_idx).astype(int).flatten()

### Create Feature Vectors from GDELT

In [6]:
collapsed = gdelt_df.groupby(by=gdelt_df.index).sum()
collapsed.index = pd.to_datetime(collapsed.index,format='%Y%m%d')
collapsed = collapsed.loc[spx_vol.index].dropna()

### Models : Gradient Boosted Trees / Support Vector Machine

Considerations:

1.) Inbalanced Data - So using SMOTE

2.) "Small" sized dataset - so leveraging the Kernel SVM

In [7]:
X_train,X_test,Y_train,Y_test = train_test_split(collapsed,event_idx,shuffle=False)
pipeline = imb_pipeline(PolynomialFeatures(3),StandardScaler(),SMOTE(),GradientBoostingClassifier())
param_grid = {'gradientboostingclassifier__n_estimators':[100,500,1000,3000],
              'gradientboostingclassifier__max_depth':[1,3,5]}
cv_svc = GridSearchCV(pipeline,param_grid=param_grid)
cv_svc.fit(X_train,Y_train)
pred = cv_svc.predict(X_test)
acc = accuracy_score(Y_test,pred)
f1 = f1_score(Y_test,pred)
roc = roc_auc_score(Y_test,pred)

print ('GBC Accuracy: {}'.format(acc))
print ('GBC F1: {}'.format(f1))
print ('GBC RoC: {}'.format(roc))

pipeline = imb_pipeline(PolynomialFeatures(3),StandardScaler(),SMOTE(),SVC())
param_grid = {'svc__C':[0.5,1,100,500,1000]}
cv_svc = GridSearchCV(pipeline,param_grid=param_grid)
cv_svc.fit(X_train,Y_train)
pred = cv_svc.predict(X_test)
acc = accuracy_score(Y_test,pred)
f1 = f1_score(Y_test,pred)
roc = roc_auc_score(Y_test,pred)

print ('SVC Accuracy: {}'.format(acc))
print ('SVC F1: {}'.format(f1))
print ('SVC RoC: {}'.format(roc))



GBC Accuracy: 0.5887850467289719
GBC F1: 0.35294117647058815
GBC RoC: 0.5509756097560976
SVC Accuracy: 0.5514018691588785
SVC F1: 0.3142857142857143
SVC RoC: 0.5126829268292683


## Actual Comparison of prediction vs truth

In [10]:
compare = pd.DataFrame([pred,Y_test],index=['Pred','Truth'])
compare

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,97,98,99,100,101,102,103,104,105,106
Pred,0,1,1,0,0,1,0,0,0,1,...,0,1,0,1,1,0,0,0,0,0
Truth,0,0,0,0,0,1,0,1,1,0,...,0,0,0,0,0,0,0,0,0,1


## Criticism

The results leave much to be desired. In fact if we just predicted the most probable class (0) we'd get around 80% accuracy, but that isn't useful to us. But we should be happy with this result, this proves this is  a challenging modeling problem that can't be solved immediately with  off-the-shelf methods. We'd have to employ something clever. Next steps are: Voting Ensembles, more feature engineering/selection/discovery, and a more sophisticated metric of "event" . As our mentors have stated they want to stay with traditional ML if possible, and use interpretable models.   