# Prediction of EUR/USD FX with sentiment data

In this notebook we applied some simple machine learning technique in order to try to predict the EUR-USD market price.

We have at our disposal chronological data spanning from 18th of March 2016 to 23rd of September 2016 for both price and sentiment data. We also have at our disposal some linear and non linear transformations of these data, in particular simple moving averages and exponential moving averages.

The model we assume here is that market price at time t follows a Normal distribution of which the mean is a linear combination of the historical features mentionned above and the variance is constant.

```
p[t] ~ N(intercept + alpha_1 * feature_1 + ... + alpha_k * feature_k, sigma)
```

We try to find the `intercept` and the `alpha_i` by maximizing the Likelihood of those paraemters on the training set we will select, which, all calculations made, boils down to minimizing the sum of squared error between observations and estimate : `|p_1[t] - p*_1[t]|^2 + ... + |p_N[t] - p*_N[t]|^2`. We assess the quality of our solution by computing the `r2` parameter on a validation set, which is distinct from the training set (actually it corresponds to a period following the training set). The `r2` parameter captures the part of the variance that is explained by the model.

Such optimal set of coefficients is unique if the covariance matrix of the features is non singular, which means that features are linearly independent. However in the present case many features are either equal (we drop these to start with) or quite similar, leading to poor generalization results on the testing set. Thus we perform some feature selection. In the present case we picked random feature bags, as the small size of the data set lets us perform trainings quite quickly. Another approach would have been to use the Lasso regularization (L1 norm) which favors sparse solutions : less meaningful features coefficient are set to 0.

We have used 3 variations of the linear regression :
- Ordinary least squared : which is classical linear regression
- Ridge regression : which penalizes models that have to high coefficient to limit overfitting and yield to better results on the validation set (this method needs one extra hyper-parameter to be defined, to control the regularization power of the model, named `alpha` in the code). From a bayesian stand point this boils down to making the assumptions that the prior distribution of the coefficients follow a Normal law of mean 0 and variance `sqrt(alpha)`
- Bayesian regression : which makes the assumption that the prior distribution of the coefficient follows a log-normal law, and optimizes the slightly modified resulting likelihood.

## Results

We have carried out 31 trainings on 20 week long training sets and validated on the following 2 weeks. There is an offset of 1 day between each pair of training/testing sets.

The use of sentimental data combined with historical price data yields to good results compared to basic models that resorts only on historical price data. At the end of this notebooks models are ranked according their average performance (average r2) on all the 2

In [1]:
import pandas as pd 
import numpy as np
import functools
import dateparser

In [2]:
# Load Data Set
parse_fn = lambda s: dateparser.parse(date_string = s, date_formats = ["%d-%m-%Y"])
df = pd.read_csv("./data/EUR_USD.csv", parse_dates=[0], date_parser=parse_fn, infer_datetime_format = True)

In [3]:
# Some columns look the same...
identical_columns = {
    'pOpen' : ['pHigh', 'pLow', 'pClose'],
    'sentiment' : ['sHigh', 'sLow', 'sOpen', 'sClose', 'sBuzz', 'SumSentiment', 'AvgSentiment']
}
for ref, cols_to_check in identical_columns.items() :
    it = zip(cols_to_check, map(lambda x: (df[ref] == df[x]).all(), cols_to_check))
    for col, do_drop in it :
        if do_drop :
            print('removing ' + col + '(same as ' + ref + ')')
            df = df.drop([col], axis = 1)

# ... rename the only one left
df = df.rename(columns = {'pOpen': 'p'})

removing pHigh(same as pOpen)
removing pLow(same as pOpen)
removing pClose(same as pOpen)
removing sOpen(same as sentiment)
removing sClose(same as sentiment)
removing SumSentiment(same as sentiment)
removing AvgSentiment(same as sentiment)


In [4]:
# pVolume seems always worth 0
if (df['pVolume'] == 0).all() :
    print('removing pVolume (all values are 0)')
    df = df.drop(['pVolume'], axis = 1)

removing pVolume (all values are 0)


In [5]:
# price values seem to be missing for some rows, fill them with previous values
pSma_cols = list(filter(lambda col: col.startswith('pSma'), df.columns))
pSma_cols.insert(0, 'p')

null_cols_to_concat = []
for col in pSma_cols:
    null_cols_to_concat.append(df[df[col] == 0.0][col])

null_pSma_df = pd.concat(null_cols_to_concat, axis = 1)

# if some NaN in null_price_df, then some rows were mistmatching in each
# individual selected columns. This means that values in these selected columns rows are not all null
all_pSma_values_simultaneously_null = not null_pSma_df.isnull().values.any()

df0 = df.copy()
for col in pSma_cols:
    df0[col].replace(to_replace=0, inplace='true', method='ffill')

In [6]:
# Add historical features for k days
k = 5
historical_columns = {}
feature_to_historize = ['p', 'sentiment']
for feat in feature_to_historize :
    for i in range(1, k) :
        hist_feat = df0[feat].shift(i)
        historical_columns[feat + '_minus_' + str(i)] = hist_feat

historical_df = pd.DataFrame(historical_columns)
df1 = pd.concat([df0, historical_df], axis = 1)

In [7]:
# Removing SMA_3 since k > 3 and SMA_3 are linear combination of historical features
df1 = df1.drop(['pSma_3', 'sSma_3'], axis = 1)

In [8]:
# Dummy encoding for the 'match' feature
match_feature = df1['match'].apply(lambda boolean: float(boolean))
nomatch_feature = df1['match'].apply(lambda boolean: float(not boolean))
dummy_encoded_match_df = pd.DataFrame({'match': match_feature, 'nomatch': nomatch_feature})
df1 = df1.drop(['match'], axis = 1)
df1 = pd.concat([df1, dummy_encoded_match_df], axis = 1)

In [9]:
# Add p at day + 1 (which is the target variable we will want to predict)
label_df = pd.DataFrame({'p_plus_1': df1['p'].shift(-1)})
df1 = pd.concat([df1, label_df], axis = 1)

In [10]:
# Drop edge rows where historical values are not defined
df1 = df1.dropna()

In [11]:
df1.columns

Index(['date', 'p', 'pSma_w', 'pEma_w', 'pEma_3', 'pSma_7', 'pEma_7',
       'pSma_15', 'pEma_15', 'pSma_10', 'pEma_10', 'pSma_30', 'pEma_30',
       'pSma_60', 'pEma_60', 'pSma_90', 'pEma_90', 'pK', 'pD', 'pMacd',
       'pSignal', 'sHigh', 'sLow', 'sBuzz', 'sVolume', 'sentiment', 'sSma_w',
       'sEma_w', 'sEma_3', 'sSma_7', 'sEma_7', 'sSma_15', 'sEma_15', 'sSma_10',
       'sEma_10', 'sSma_30', 'sEma_30', 'sSma_60', 'sEma_60', 'sSma_90',
       'sEma_90', 'sK', 'sD', 'sMacd', 'sSignal', 'pMacd_Hist', 'sMacd_Hist',
       'p_minus_1', 'p_minus_2', 'p_minus_3', 'p_minus_4', 'sentiment_minus_1',
       'sentiment_minus_2', 'sentiment_minus_3', 'sentiment_minus_4', 'match',
       'nomatch', 'p_plus_1'],
      dtype='object')

In [12]:
# Select consecutive training and validation sets
train_set_size = 20 * 7 # 20 weeks
test_set_size = 2 * 7   # 2 weeks
sliding_offset = 1      # 1 day

train_sets = []
test_sets = []
train_set_start = 0
train_set_end = train_set_start + train_set_size
test_set_start = train_set_end
test_set_end = test_set_start + test_set_size

while test_set_end < len(df1.index):
    train_sets.append(df1[train_set_start:train_set_end])
    test_sets.append(df1[test_set_start:test_set_end])
    train_set_start += sliding_offset
    train_set_end += sliding_offset
    test_set_start += sliding_offset
    test_set_end += sliding_offset

In [13]:
len(train_sets)

31

In [14]:
len(test_sets)

31

In [15]:
# Set up 20 random bags of 10 features
available_features = df1.drop(['date', 'p_plus_1'], axis = 1).columns.values

random_feature_bags = {}
for i in range(20) :
    copy = available_features.copy()
    np.random.shuffle(copy)
    random_feature_bags['bag' + str(i)] = copy[:10]

In [16]:
for bag in sorted(random_feature_bags.keys()):
    print(bag + ' :')
    print(random_feature_bags[bag])

bag0 :
['pEma_90' 'sentiment_minus_3' 'sLow' 'p_minus_1' 'pEma_3' 'sSma_60'
 'sEma_3' 'sSma_30' 'pK' 'pSma_60']
bag1 :
['nomatch' 'pSma_7' 'sSma_w' 'pD' 'sEma_w' 'pEma_15' 'p_minus_3' 'pEma_3'
 'p' 'sHigh']
bag10 :
['p_minus_4' 'sD' 'pSma_15' 'pEma_7' 'pMacd_Hist' 'sentiment' 'sSma_w'
 'pEma_w' 'sEma_w' 'sentiment_minus_1']
bag11 :
['pEma_7' 'p' 'sSma_60' 'pMacd' 'sSma_7' 'sSma_10' 'sentiment_minus_3'
 'sEma_7' 'pSma_w' 'pSma_60']
bag12 :
['sSma_90' 'pSma_7' 'pEma_3' 'sEma_7' 'pK' 'sEma_3' 'pSignal' 'sBuzz'
 'match' 'sSma_30']
bag13 :
['sHigh' 'sEma_90' 'sentiment_minus_1' 'pEma_90' 'sSignal' 'p_minus_2'
 'sVolume' 'sEma_30' 'sEma_7' 'sSma_30']
bag14 :
['sMacd_Hist' 'sEma_90' 'pEma_30' 'pEma_7' 'sVolume' 'pSma_w'
 'sentiment_minus_2' 'sSma_90' 'sEma_3' 'sentiment_minus_3']
bag15 :
['nomatch' 'pSma_w' 'sentiment' 'sMacd_Hist' 'pSignal' 'sEma_10' 'pSma_7'
 'p_minus_3' 'sVolume' 'sSma_15']
bag16 :
['sEma_15' 'sentiment_minus_1' 'sentiment_minus_2' 'pSma_60' 'p_minus_1'
 'sD' 'p' 'pEma_10'

In [17]:
# Test some basic bags of features as well along the way
random_feature_bags['baseBag0'] = ['p']
random_feature_bags['baseBag1'] = ['sentiment']
random_feature_bags['baseBag2'] = ['p', 'p_minus_1']
random_feature_bags['baseBag3'] = ['p', 'sentiment']
random_feature_bags['baseBag4'] = ['p', 'p_minus_1', 'sentiment']
random_feature_bags['baseBag5'] = ['p', 'p_minus_1', 'sBuzz', 'sentiment']
random_feature_bags['baseBag6'] = ['p', 'p_minus_1', 'sBuzz', 'sentiment', 'sVolume']

In [18]:
# Perform feature selection for linear regression, bayesian linear regression, bayesian linear regression
from sklearn import linear_model

ols = linear_model.LinearRegression()
bayesian=linear_model.BayesianRidge()
ridge = linear_model.Ridge()

res = {}

for key, bag in random_feature_bags.items() :

    ols_r2s = []
    for i in range(len(train_sets)) :
        train_df = train_sets[i].copy()
        train_Y = train_df.as_matrix(['p_plus_1']).flatten()
        train_X = (train_df[bag]).as_matrix()
        ols.fit(train_X, train_Y)

        test_df = test_sets[i].copy()
        test_Y = test_df.as_matrix(['p_plus_1']).flatten()
        test_X = (test_df[bag]).as_matrix()
        r2 = ols.score(test_X, test_Y)

        ols_r2s.append(r2)

    res[key] = {}
    res[key]["ols"] = {}
#    res[key]["ols"]["r2s"] = ols_r2s
    res[key]["ols"]["avg_r2"] = np.average(ols_r2s)

    bayesian_r2s = []
    for i in range(len(train_sets)) :
        train_df = train_sets[i].copy()
        train_Y = train_df.as_matrix(['p_plus_1']).flatten()
        train_X = (train_df[bag]).as_matrix()
        bayesian.fit(train_X, train_Y)

        test_df = test_sets[i].copy()
        test_Y = test_df.as_matrix(['p_plus_1']).flatten()
        test_X = (test_df[bag]).as_matrix()
        r2 = bayesian.score(test_X, test_Y)
        bayesian_r2s.append(r2)

    res[key]["bayesian"] = {}
#    res[key]["bayesian"]["r2s"] = bayesian_r2s
    res[key]["bayesian"]["avg_r2"] = np.average(bayesian_r2s)

    alphas = [0.0001, 0.001, 0.01, 0.1, 1.0, 20.0, 100.0, 10000.0]
    ridge_r2s = []
    for a in alphas :
        r2s = []
        ridge.set_params(alpha = a)
        for i in range(len(train_sets)) :
            train_df = train_sets[i].copy()
            train_Y = train_df.as_matrix(['p_plus_1']).flatten()
            train_X = (train_df[bag]).as_matrix()
            ridge.fit(train_X, train_Y)

            test_df = test_sets[i].copy()
            test_Y = test_df.as_matrix(['p_plus_1']).flatten()
            test_X = (test_df[bag]).as_matrix()
            r2 = ridge.score(test_X, test_Y)

            r2s.append(r2)

        ridge_r2s.append(r2s)

    avg_r2s = list(map(lambda x: np.average(x), ridge_r2s))

    best_alpha_index = 0
    max_r2 = avg_r2s[best_alpha_index]
    for i in range(len(avg_r2s)):
        if avg_r2s[i] > max_r2 :
            best_alpha_index = i
            max_r2 = avg_r2s[best_alpha_index]

    res[key]["ridge"] = {}
#    res[key]["ridge"]["r2s"] = ridge_r2s[best_alpha_index]
    res[key]["ridge"]["avg_r2"] = avg_r2s[best_alpha_index]
    res[key]["ridge"]["best_alpha"] = alphas[best_alpha_index]



In [19]:
for bag in sorted(random_feature_bags.keys()):
    print(bag + ' :')
    for model in res[bag]:
        print("\t" + str(model) + ' :')
        print("\t" + str(res[bag][model]))

bag0 :
	ridge :
	{'avg_r2': 0.67677018221609264, 'best_alpha': 0.0001}
	bayesian :
	{'avg_r2': 0.67248447912815812}
	ols :
	{'avg_r2': 0.65255133696556811}
bag1 :
	ridge :
	{'avg_r2': 0.64884838165243131, 'best_alpha': 0.0001}
	bayesian :
	{'avg_r2': 0.66119237582282697}
	ols :
	{'avg_r2': 0.51069346712102037}
bag10 :
	ridge :
	{'avg_r2': 0.53767498498467015, 'best_alpha': 0.0001}
	bayesian :
	{'avg_r2': 0.69343128357595962}
	ols :
	{'avg_r2': 0.68329942899647744}
bag11 :
	ridge :
	{'avg_r2': 0.6346855732570591, 'best_alpha': 0.001}
	bayesian :
	{'avg_r2': 0.62937572433744227}
	ols :
	{'avg_r2': 0.57814072928922677}
bag12 :
	ridge :
	{'avg_r2': 0.71441742628924731, 'best_alpha': 0.001}
	bayesian :
	{'avg_r2': 0.67287537566139033}
	ols :
	{'avg_r2': 0.61951756437377881}
bag13 :
	ridge :
	{'avg_r2': -0.66815322424987655, 'best_alpha': 0.1}
	bayesian :
	{'avg_r2': -1.0121726888995655}
	ols :
	{'avg_r2': -1.7157402459543285}
bag14 :
	ridge :
	{'avg_r2': 0.51207574839296166, 'best_alpha': 0

In [20]:
for key, modelDict in res.items():
    best_model = "ols"
    best_r2 = modelDict[best_model]["avg_r2"]
    if modelDict["ridge"]["avg_r2"] > best_r2:
        best_model = "ridge"
        best_r2 = modelDict["ridge"]["avg_r2"]
    if modelDict["bayesian"]["avg_r2"] > best_r2:
        best_model = "bayesian"
        best_r2 = modelDict["bayesian"]["avg_r2"]
    res[key]["best_model"] = best_model

In [21]:
r2_to_bags = {}
for bag, modelDict in res.items():
    bag_model = modelDict["best_model"]
    bag_r2 = modelDict[bag_model]["avg_r2"]
    r2_to_bags[bag_r2] = bag

best_bags = []
for r2 in sorted(r2_to_bags.keys(), reverse = True):
    best_bags.append(r2_to_bags[r2])

In [22]:
best_bags

['bag17',
 'bag6',
 'bag12',
 'bag18',
 'bag10',
 'bag19',
 'bag0',
 'bag5',
 'bag1',
 'bag11',
 'bag7',
 'bag3',
 'bag15',
 'bag14',
 'bag16',
 'baseBag0',
 'baseBag3',
 'baseBag2',
 'baseBag4',
 'baseBag6',
 'baseBag5',
 'bag2',
 'bag8',
 'bag4',
 'bag13',
 'bag9',
 'baseBag1']

In [23]:
for bag in best_bags[:10]:
    print(bag + ' :')
    print('features = ' + str(random_feature_bags[bag]))
    print('r2 = ' + str(res[bag][res[bag]["best_model"]]["avg_r2"]))
    print('model = ' + str(res[bag]["best_model"]))
    if res[bag]["best_model"] == "ridge":
        print('alpha = ' + str(res[bag]["ridge"]["best_alpha"]))

bag17 :
features = ['sMacd_Hist' 'match' 'pSignal' 'pSma_60' 'pSma_10' 'nomatch' 'pEma_3'
 'pEma_w' 'pEma_30' 'pEma_15']
r2 = 0.742498195216
model = ridge
alpha = 0.0001
bag6 :
features = ['sEma_w' 'sMacd' 'match' 'pEma_90' 'sSma_15' 'sEma_30' 'pEma_3' 'sEma_10'
 'pSignal' 'sEma_60']
r2 = 0.715620620965
model = ridge
alpha = 0.001
bag12 :
features = ['sSma_90' 'pSma_7' 'pEma_3' 'sEma_7' 'pK' 'sEma_3' 'pSignal' 'sBuzz'
 'match' 'sSma_30']
r2 = 0.714417426289
model = ridge
alpha = 0.001
bag18 :
features = ['pD' 'pSma_10' 'pEma_3' 'pEma_10' 'sEma_30' 'sentiment_minus_1'
 'p_minus_3' 'pEma_90' 'pMacd_Hist' 'p_minus_2']
r2 = 0.699764348777
model = ridge
alpha = 0.0001
bag10 :
features = ['p_minus_4' 'sD' 'pSma_15' 'pEma_7' 'pMacd_Hist' 'sentiment' 'sSma_w'
 'pEma_w' 'sEma_w' 'sentiment_minus_1']
r2 = 0.693431283576
model = bayesian
bag19 :
features = ['pSma_15' 'pSignal' 'p_minus_3' 'p_minus_4' 'nomatch' 'p_minus_2'
 'sSma_60' 'pEma_3' 'sSma_90' 'p']
r2 = 0.685410979198
model = ridge
alpha 