Now that our two classifiers are ready to predict which stock price will see a 5% increase tomorrow, let's test them!

Something interesting to investigate is if the models can only predict price increase for stocks it has been trained with, or if they can generalize to all stocks as-is, without having to be re-trained on a dataset that would include each new stock's historical data.

In [1]:
!pip install -r requirements.txt



In [2]:
import utils
import pandas as pd
import numpy as np
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, fbeta_score
from sklearn.preprocessing import MinMaxScaler

In [3]:
# Load RF model
rf_model = pickle.load(open('randomforest-clf.pickle', 'rb'))

In [4]:
# Load KNN model
kn_model = pickle.load(open('kneighbors-clf.pickle', 'rb'))

Let's build a new test dataset, exclusively made of stocks absent from the initial training and testing sets:

In [5]:
symbol_list = ['HO.PA', 'ALCAR.PA']

X_df = pd.DataFrame()
y_df = pd.DataFrame()

for symbol in symbol_list:
    symbol_X_df = utils.get_stock_feature_dataset(symbol)
    symbol_X_df, symbol_y_df = utils.make_labels_dataset(symbol_X_df)

    # reset index since dates are not required for classification
    X_df = X_df.append(symbol_X_df.reset_index(drop=True), ignore_index=True)
    y_df = y_df.append(symbol_y_df.reset_index(drop=True), ignore_index=True)
    print('Done processing {}! new X_df shape: {}, new y_df shape: {}'.format(symbol, X_df.shape, y_df.shape))

X_df = X_df.astype(float)
X_df.replace(np.inf, np.nan, inplace=True)
X_df.replace(-np.inf, np.nan, inplace=True)
X_df.interpolate(axis=0, limit_direction='both', inplace=True)

print('Check number of NaNs in X_df: {}, and in y_df: {}'.format(X_df.isna().sum().sum(), y_df.isna().sum().sum()))

print('New testing set contains {:.2f}% records labeled as 1'.format(y_df.values.sum()/y_df.shape[0] * 100))

# Scale all values to have the same range:
X_scaler = MinMaxScaler().fit(X_df.values)
X_scaled = X_scaler.transform(X_df.values)
y_true = y_df.values.reshape(-1).astype(float)

  dip[i] = 100 * (self._dip[i]/self._trs[i])
  din[i] = 100 * (self._din[i]/self._trs[i])


Done processing HO.PA! new X_df shape: (5273, 86), new y_df shape: (5273, 1)
Done processing ALCAR.PA! new X_df shape: (7329, 86), new y_df shape: (7329, 1)
Check number of NaNs in X_df: 0, and in y_df: 0
New testing set contains 1.62% records labeled as 1


In [6]:
y_pred = kn_model.predict(X_scaled)

In [7]:
print('Results for KNN model:')
print('\taccuracy: {:.2f}%'.format(accuracy_score(y_true, y_pred) * 100))
print('\tprecision: {:.2f}%'.format(precision_score(y_true, y_pred) * 100))
print('\tfbeta: {:.3f}'.format(fbeta_score(y_true, y_pred, beta=0.5)))

Results for KNN model:
	accuracy: 98.01%
	precision: 11.43%
	fbeta: 0.077


In [8]:
y_pred = rf_model.predict(X_scaled)

In [9]:
print('Results for RF model:')
print('\taccuracy: {:.2f}%'.format(accuracy_score(y_true, y_pred) * 100))
print('\tprecision: {:.2f}%'.format(precision_score(y_true, y_pred) * 100))
print('\tfbeta: {:.3f}'.format(fbeta_score(y_true, y_pred, beta=0.5)))

Results for RF model:
	accuracy: 98.38%
	precision: 0.00%
	fbeta: 0.000


  _warn_prf(average, modifier, msg_start, len(result))


It appears that both models are not good at predicting increase for stocks outside of their training set.

Let's now check results obtained on stocks that KNN and RF models have been trained on, and see if their predictions were true. For this, I will make them predict for all of the stocks included in the training set, and check if there was an increase or not on the day after.

I start by building the prediction set:

In [10]:
training_stock_list = ['AI.PA', 'SAF.PA', 'GNFT.PA', 'ALNOV.PA', 'FDJ.PA', 'ETL.PA', 'DBV.PA',
                       'BN.PA', 'KER.PA', 'AIR.PA', 'ENGI.PA', 'FP.PA', 'DG.PA', 'VIV.PA',
                       'UG.PA', 'SU.PA', 'VIE.PA', 'ALPHA.PA', 'ALBIO.PA', 'CRI.PA', 'ALERS.PA']

In [11]:
pred_X = pd.DataFrame()

for symbol in training_stock_list:
    # Download data for today, July 8th 2020
    X_df = utils.get_stock_feature_dataset(symbol, 1594195200, 1594231200)
    pred_X = pred_X.append(X_df.reset_index(drop=True), ignore_index=True)

pred_X = pred_X.astype(float)
pred_X.replace(np.inf, np.nan, inplace=True)
pred_X.replace(-np.inf, np.nan, inplace=True)
pred_X.interpolate(axis=0, limit_direction='both', inplace=True)

print('Check number of NaNs in pred_X: {}'.format(X_df.isna().sum().sum()))

# Scale all values to have the same range:
X_scaler = MinMaxScaler().fit(pred_X.values)
pred_X_scaled = X_scaler.transform(pred_X.values)

IndexError: index 9 is out of bounds for axis 0 with size 1

In [41]:
pred_X.head()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
0,131.100006,133.449997,130.449997,132.75,132.75,564014.0
1,90.099998,90.919998,88.599998,90.160004,90.160004,505772.0
2,4.912,5.03,4.83,4.974,4.974,372150.0
3,3.21,3.21,3.06,3.11,3.11,834793.0
4,28.190001,28.219999,27.76,27.99,27.99,113035.0


In [None]:
# Perform predictions
kn_pred_y = kn_model.predict(pred_X_scaled)
rf_pred_y = rf_model.predict(pred_X_scaled)

In [None]:
# Save KNN preds to a file
outfile = open('kneighbors-preds.pickle', 'wb')
np.save(outfile, kn_pred_y)
outfile.close()

# Same for RF predictions
outfile = open('randomforest-preds.pickle', 'wb')
np.save(outfile, rf_pred_y)
outfile.close()

Today is now July 9th 2020, market is now closed, let's check yesterday's predictions!

In [None]:
# Construct the true target dataset:

y_true = pd.DataFrame()

for symbol in training_stock_list:
    # Download data for July 8th pre-open and 9th post-close
    X_df = utils.get_stock_feature_dataset(symbol, 1594195200, 1594317600)
    X_df, y_df = utils.make_labels_dataset(X_df)
    y_true = y_true.append(y_df.reset_index(drop=True), ignore_index=True)

y_true = y_df.values.reshape(-1).astype(float)

In [None]:
# Load predictions made yesterday
kn_pred_y = np.load(open('kneighbors-preds.pickle', 'rb'))
rf_pred_y = np.load(open('randomforest-preds.pickle', 'rb'))

In [None]:
print('Results for KNN model:')
print('\taccuracy: {:.2f}%'.format(accuracy_score(y_true, kn_pred_y) * 100))
print('\tprecision: {:.2f}%'.format(precision_score(y_true, kn_pred_y) * 100))
print('\tfbeta: {:.3f}'.format(fbeta_score(y_true, kn_pred_y, beta=0.5)))

In [None]:
print('Results for RF model:')
print('\taccuracy: {:.2f}%'.format(accuracy_score(y_true, rf_pred_y) * 100))
print('\tprecision: {:.2f}%'.format(precision_score(y_true, rf_pred_y) * 100))
print('\tfbeta: {:.3f}'.format(fbeta_score(y_true, rf_pred_y, beta=0.5)))