# Exercise 2
Using your select stock / market index and your decision criterion (binary or ternary) on the daily return of the next day or on the trend (daily return after 5/10 days), can you generate a correlation visualization of volume, and the moving average (with a period of 5, 10, 20, 50 or 200).

In [1]:
import pandas as pd

pd.set_option('display.max_colwidth', None)
IBM = pd.read_csv('../Data/IBM.txt.zst', delimiter=' ', index_col='Date')
IBM.head(5)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adjusted
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2007-01-03,97.18,98.400002,96.260002,97.269997,9196800,63.127567
2007-01-04,97.25,98.790001,96.879997,98.309998,10524500,63.802544
2007-01-05,97.599998,97.949997,96.910004,97.419998,7221300,63.22493
2007-01-08,98.5,99.5,98.349998,98.900002,10340000,64.185463
2007-01-09,99.080002,100.330002,99.07,100.07,11108200,64.944771


In [2]:
IBM.describe(include='all', datetime_is_numeric=True)

Unnamed: 0,Open,High,Low,Close,Volume,Adjusted
count,3692.0,3692.0,3692.0,3692.0,3692.0,3692.0
mean,147.198976,148.40318,146.074512,147.269085,5576247.0,113.375997
std,30.808862,30.780935,30.844083,30.796603,3229710.0,25.410689
min,72.739998,76.980003,69.5,71.739998,1027500.0,48.040176
25%,124.189999,125.349998,123.072502,124.332496,3482300.0,90.296511
50%,144.75,145.619995,143.830002,144.730003,4653750.0,120.980629
75%,168.8475,170.442501,167.659996,169.169998,6660725.0,132.155647
max,215.380005,215.899994,214.300003,215.800003,38063500.0,155.979538


First we compute the `Daily Returns` as according to the formular:
$r_{t+1} = \frac{p_{t+1} - p_t}{p_t}$

In [None]:
import numpy as np
import matplotlib.pyplot as plt

daily_returns = np.empty(IBM['Close'].shape)
daily_returns[0] = float('NaN')
daily_returns[1:] = np.ediff1d(IBM['Close']) / IBM['Close'][:-1]
IBM['Daily Returns'] = daily_returns

num_bins = int(len(daily_returns) / 32)
plt.hist(daily_returns, bins=num_bins)
plt.show()

Next we compute the `Binary Decision` (up/down) and the `Ternary decision` (up/flat/down).

In [None]:
binary_decision = (daily_returns > 0).astype(int)
IBM['Binary Decision'] = binary_decision

CUTOFF = 0.005
ternary_decision = np.full(shape=daily_returns.shape, fill_value=1)
ternary_decision[np.where(daily_returns > CUTOFF)] = 2
ternary_decision[np.where(daily_returns < CUTOFF)] = 0
IBM['Ternary Decision'] = ternary_decision

And we compute the simple moving average (SMA).

In [None]:
sma_features = []
for period in [5, 10, 20, 50, 200]:
    label = 'SMA-{}'.format(period)
    sma_features.append(label)

    IBM[label] = IBM['Close'].rolling(period).mean()
    IBM[['Close', label]].plot(label=label, figsize=(9, 3), xlabel='days', ylabel='price')

We replace all `NaN` with `0`.

In [None]:
IBM.fillna(0, inplace=True)
IBM.tail(5)

## (a) Select the two features that have the most significant correlation to the target feature, daily return.
We use the `Ternary Decision`.

In [None]:
all_features = ['Ternary Decision', 'Volume'] + sma_features
correlation = IBM[all_features].corr()['Ternary Decision'].abs().sort_values(ascending=False)
del correlation['Ternary Decision']
correlation

The two most correlating features are:

In [None]:
ms_features = list(correlation.index[0:2])
ms_features

## (b) Using Naive Bayes classifier and the most two significant features predict daily return.
You can learn on all days except the last 100 (that will be used as the test set).

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB


def bayes_accuracy(X_idx):
    X = IBM[X_idx]
    y = IBM['Ternary Decision']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=100 / IBM.shape[0], random_state=224)

    clf = GaussianNB()
    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    return accuracy

In [None]:
ms_accuracies = pd.DataFrame(
    data=[[ms_features, bayes_accuracy(ms_features)]],
    columns=['Features', 'Accuracy'],
)

print('Accuracy of the most correlating features:')
ms_accuracies

In [None]:
all_accuracies = pd.DataFrame(
    data=[['All', bayes_accuracy(X_idx=all_features)]],
    columns=['Features', 'Accuracy'],
)

print('Accuracy of using all features:')
all_accuracies