## Feature Selection and ML to predict stock price movement

To anticipate the stock market is impossible, by all means today. Stock market (or perhaps any other market where irrationality supercedes rationalitity) is the most whimsical, unpredictable field. Even the most advanced machine learning or deep learning techniques have failed to bring consistent sum to gold miners.

However, in a vain attempt to build a reliable, or at least a surviving mechanism that barely loses in the market, I have decided to derive several features, those of particular stock and those of commodities and indices that investors often refer to measure the sentiment of the market.

I will be using ML tools to build a binary classifier, which tells the investors to either buy or sell 5 days prior to the next opening day. After that, we will see in the back testing to find out if the model has any usage. 

We will be testing with random forest classifier and neural network. 

## What features are we using?

There are various, in fact hundreds of possible indicators. The following are examples of indicators we are going to use.

- % Price change from previous trading day
- Trade Volume
- Moving Averages 
- MACD
- Bollinger Bands



In [7]:
# Import necessary packages
%matplotlib inline
import pandas as pd
import numpy as np
import sys
import itertools
import re
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import yfinance as yf
from finta import TA
import ppscore as pps

import torch
import torch.nn.functional as F
import torch.utils.data as data_utils
import torch.optim as optim

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, roc_curve, accuracy_score
from sklearn.decomposition import PCA
from xgboost import XGBClassifier


import warnings
warnings.filterwarnings('ignore')
plt.rcParams['figure.figsize'] = (12,6)


In [15]:
def df_caller(name):
    """
    Fetches basic info about the stock
    
    :param 1 name: ticker symbol
    :return: dataframe including volume, closing price, opening price, high price (during the day), low price
    """
    df = yf.download(name)
    if len(df) == 0:
        raise NameError
    
    # Into lower case
    df.rename(columns={'Volume':'volume','Close':'close','Open':'open','High':'high','Low':'low'}, inplace=True)
    
    # Fill any NaN values with prior 
    df.dropna(axis=0, inplace=True)
    
    """
    Add column that shows whether closing price will go up or down the next trading day - serving as our target variable
    """
    df['Target'] = 0
    change_pct = (df.shift(-1)['close'] - df['close']) / df['close']
    df['Target'][change_pct > 0] = 1
    df['Target'][change_pct <= 0] = 0
    
    df = df.round(3)
    
    
    def finta(df):
        """
        Finta is a package that calls financial indicators for each ticker symbol.
        For more info, refer to https://github.com/peerchemist/finta

        Fetches +80 indicators from the given ticker symbol
        """

        finta_df = pd.DataFrame()
        indicators = dir(TA)[0:85]

        # These indicators are either null or non-usable. Remove them from the list
        n_a = ['ALMA','MAMA','FRAMA','LWMA','VIDYA','SWI','TMF','VR','QSTICK','STC','OBV']
        indicators = [i for i in indicators if i not in n_a]

        for ind in indicators:
            series = eval("TA."+ind+"(df)")
            ind_df = pd.DataFrame(series)

            # When the indicator has more than one column
            if len(ind_df.columns) > 1:
                for col in ind_df:
                    # Include only the indicator that has more than 90% of the rows filled
                    if ind_df[col].count()/len(series) > 0.9:
                        finta_df[ind+"_"+ind_df[col].name] = ind_df[col]
            else: 
                if ind_df.count().values[0]/len(ind_df) > 0.9:
                    finta_df[ind] = series

        # Replace all infinity values with NaN
        finta_df.replace([np.inf, -np.inf], np.nan, inplace=True)

        return finta_df
 
    df_finta = finta(df)
    
    # The date indices of df & df_finta must match
    mask = df_finta.index.isin(df.index)
    df = pd.concat([df, df_finta[mask]], axis=1)
    
    # Since the percentage of missing entries are low, fill NaN values with previous values
    df = df.fillna(method='ffill').fillna(method='bfill')
    
    return df 
       
def standardize(df):
    scaler = StandardScaler()
    for col in df:
        df[col] = scaler.fit_transform(np.array(df[col]).reshape(-1,1))
    
    return df
    
def reduce_dimension(df, num_att, mode=["pps","pca"], show_pps=False):
        
    if mode == "pps":
        """
        Returns top num_att number of indicators, based on their predictive power to target value.
        """
            
        # Get top indicators based on their predictive powers with respect to the target variable
        top = pps.predictors(df, "Target", sorted=True)[1:num_att+1]['x']
        df_top = df[top]

        # Drop any rows with nan 
        df_top.dropna(axis='index', inplace=True)
        
        # If the user wants to see the ppscore in the descending order
        if show_pps == True:
            print(pps.predictors(df, "Target", sorted=True)[1:num_att+1])
        
        df_top = df_top.join(df['Target'], on="Date")
        
        return df_top
        
    elif mode == "pca":
        target = df['Target']
        df.drop(['Target'], axis=1, inplace=True)
            
        pca = PCA(n_components=num_att)
        df_top = pd.DataFrame(pca.fit_transform(df))
        df_top.index = df.index
        df_top = df_top.join(target, on="Date")
        
        return df_top
        
    else:
        raise ValueError("Not a valid mode")
   

def split(X, y, train_size=None, start_date=None, split_mode=["Size","Date"]):
    """
    Split based on either time frame, or size
    """
    #x = df.drop(['Target'], axis=1)
    #y = df['Target']
    
    if split_mode == "Size":
        # Split test and train by size
        train_len = int(len(X)*train_size)
        X_train = X.iloc[0:train_len]
        X_test = X.iloc[train_len:]
        y_train = y.iloc[0:train_len]
        y_test = y.iloc[train_len:]
            
    elif split_mode == "Date":
        # Split test and train by starting date
        X_train = X.loc[:start_date]
        X_test = X.loc[start_date:]
        y_train = y.loc[:start_date]
        y_test = y.loc[start_date:]
    else:
        raise ValueError("Not a valid split mode")
    
    return X_train, X_test, y_train, y_test

def add_change(df, n):
    """
    Add changes with respect to previous days' indicators
    """    
    target = df["Target"]
    df = df.drop("Target",1)

    drop_columns = df.loc[:, (df == 0.0).any(axis=0)].columns

    df = df.drop(drop_columns,1)
    df_tmp = df.copy()
    
    for i in range(1,n+1):
        delta = df_tmp - df_tmp.shift(i)
        df_final = delta / df_tmp.shift(i)
        df_final.columns = [c + "_" + str(i) for c in df_final.columns]
        df = df.join(df_final)
        
    df["Target"] = target
    df = df[n:]
    
    return df

In [44]:
"""
Neural Network 
"""

class FeedForward(torch.nn.Module):
    def __init__(self, input_size, hidden_dim, hidden_dim_2):
        """
        In the constructor we instantiate two nn.Linear modules and 
        assign them as member variables.
        """
        super(FeedForward, self).__init__()
        self.lin_1 = torch.nn.Linear(input_size, hidden_dim)
        self.lin_2 = torch.nn.Linear(hidden_dim, hidden_dim_2)
        self.lin_3 = torch.nn.Linear(hidden_dim_2, 2)

    def forward(self, x):
        """
        Compute the forward pass of our model, which outputs logits.
        """
        x = self.lin_1(x)
        x = F.relu(x)
        x = self.lin_2(x)
        x = F.relu(x)
        x = self.lin_3(x)

        return x

In [3]:
"""
Predictor Class 
"""

class Predictor:
    def __init__(self):
        self.model = None
        self.pred = None
    
    def train(self, X_train, y_train, model_type=["random_forest"]): 
    
        if model_type == "random_forest":
            """
            Random Forest Classifier
            """
            """
            # Grid Search - Hyperparameter tuning first
            param_grid = { 
                'n_estimators': [150,200,300,400],
                'max_features': ['auto'],
                'max_depth' : [2,3,4],
                'criterion' :['gini']
            }

            forest_clf = RandomForestClassifier(random_state=42)
            CV = GridSearchCV(estimator=forest_clf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
            CV.fit(X_train, y_train)
            self.model = CV.best_estimator_
            """
            self.model = RandomForestClassifier(max_depth=2, n_estimators=400, random_state=42)
            self.model.fit(X_train, y_train)
            
            print("Cross Validation Score: ", cross_val_score(self.model, X_train, y_train, cv=5))
            
            print("Training Complete")
        
        else:
            raise ValueError("Wrong Model Input")
        
        return self.model
    
    def predict(self, X_test):
        pred = self.model.predict(X_test)
        df_pred = pd.DataFrame(pred)
        df_pred.index = X_test.index
        df_pred.columns = ['Target']
        
        self.pred = df_pred
        
        return df_pred
    
    def confusion_matrix(self, pred, y_test):
        matrix = confusion_matrix(y_test, pred, labels=[1,0,-1])

        """
        # Show precision/recall and F1 Score for each class
        precision_up = matrix[0,0] / np.sum(matrix[:,0])
        recall_up = matrix[0,0] / np.sum(matrix[0,:])
        precision_down = matrix[2,2] / np.sum(matrix[:,2])
        recall_down = matrix[2,2] / np.sum(matrix[2,:])
        f1_up = 2*(precision_up * recall_up)/(precision_up + recall_up)
        f1_down = 2*(precision_down * recall_down)/(precision_down + recall_down)

        print("Up Precision: ", precision_up, "Up Recall: ", recall_up, "Up F1 Score: ", f1_up)
        print("Down Precision: ", precision_down, "Down Recall: ", recall_down, "Down F1 Score: ", f1_down)
        """
        print("<Confusion Matrix>")
        display(matrix)

In [4]:
"""
BackTester Class - We want to see if the trained model will make us money
"""

class BackTester:
    
    def __init__(self, df, df_pred, seed):
        self.df_merge = pd.merge(df, df_pred, how='left', on='Date')
        
        # Dates with long signal
        self.long_date = df_pred[df_pred['Target'] == 1].index.sort_values(ascending=True)
        # Dates with short signal
        self.short_date = df_pred[df_pred['Target'] == -1].index.sort_values(ascending=True)
        # Seed money for each transaction 
        self.seed = 1000 
        self.balance = 0
    
    
    def show(self):
        return self.df_merge
    
    def long_transaction(self):
        df_shift = self.df_merge.shift(-1)['close'] - self.df_merge['close']
        df_share_amt = self.seed / self.df_merge.loc[self.long_date]['close']
        profit = np.sum(df_shift[self.long_date] * df_share_amt)
        self.balance += 100 
        print("Profit from Long Positions: ", profit)
        print("Total invested amount: ", len(self.long_date) * self.seed)

        return profit

    def short_transaction(self):
        df_shift = self.df_merge['close'] - self.df_merge.shift(-1)['close']
        df_share_amt = self.seed / self.df_merge.loc[self.short_date]['close']
        profit = np.sum(df_shift[self.short_date] * df_share_amt)
        print("Profit from Short Positions: ", profit)
        return profit
        


In [16]:
#comp_list = ["WMT","SPY","F","AAL","AMZN"]
comp_list = ["BTC-USD"]
# First make list of data frames, so we only have to call once
df_list = []

for c in comp_list:
    df_list.append(df_caller(c))

[*********************100%***********************]  1 of 1 completed


In [17]:
df_list[0]

Unnamed: 0_level_0,open,high,low,close,Adj Close,volume,Target,up_move,down_move,plus,...,VPT,VWAP,VW_MACD_MACD,VW_MACD_SIGNAL,VZO,WILLIAMS,WMA,WOBV,WTO_WT1.,WTO_WT2.
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2014-09-17,465.864,468.174,452.422,457.334,457.334,21056800,0,-11.314,39.318,0.000,...,-2.280530e+07,459.310000,0.000000,0.000000,0.000000,-84.672796,414.968400,-1.134290e+09,-121.212121,-107.305199
2014-09-18,456.860,456.860,413.104,424.440,424.440,34483200,0,-11.314,39.318,0.000,...,-7.390436e+07,442.023697,-0.674881,-0.374934,-65.392780,-84.672796,414.968400,-1.134290e+09,-121.212121,-107.305199
2014-09-19,424.103,427.835,384.532,394.796,394.796,37919700,1,-29.025,28.572,0.000,...,-1.252316e+08,425.942045,-1.611998,-0.881927,-81.086126,-84.672796,414.968400,-2.258382e+09,-119.243038,-107.305199
2014-09-20,394.673,423.296,389.883,408.904,408.904,36863600,0,-4.539,-5.351,0.000,...,-9.383034e+07,420.686161,-1.383672,-1.051895,-20.030918,-84.672796,414.968400,-1.738310e+09,-99.111982,-107.305199
2014-09-21,408.085,412.426,393.181,398.821,398.821,26580100,1,-10.870,-3.298,0.000,...,-1.194202e+08,417.431878,-1.610392,-1.218035,-37.548910,-84.672796,414.968400,-2.006317e+09,-89.653657,-107.305199
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-12-13,50114.742,50205.000,45894.848,46737.480,46737.480,32166727776,0,-519.867,2831.004,0.000,...,4.089329e+12,26644.066324,-2984.745073,-2609.352038,-12.538408,-76.212118,48743.012067,4.599596e+14,-57.030737,-58.351018
2021-12-14,46709.824,48431.398,46424.496,46612.633,46612.633,34638619079,1,-1773.602,-529.648,0.000,...,4.085974e+12,26662.990709,-3100.444691,-2707.570569,-24.965471,-76.878824,48237.314089,4.556351e+14,-57.889124,-57.654383
2021-12-15,48379.754,49473.957,46671.965,48896.723,48896.723,36541828520,0,1042.559,-247.469,1042.559,...,4.099458e+12,26684.075497,-2935.777347,-2753.211925,-6.539174,-58.774020,48249.683333,5.390999e+14,-55.281234,-56.536352
2021-12-16,48900.465,49425.574,47529.879,47665.426,47665.426,27268150947,0,-48.383,-857.914,0.000,...,4.063928e+12,26699.680705,-2898.727524,-2782.315044,-17.068076,-67.203202,48053.257667,5.055247e+14,-53.070140,-55.817809


In [45]:
# For neural network

for df in df_list:
    df_final = add_change(df,4)
    #df_final = reduce_dimension(df_final, 50, 'pca')
    
    y = df_final['Target']
    X = df_final.drop(['Target'], axis=1)
    
    columns = X.columns
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    X_scaled = pd.DataFrame(X, columns=columns, index=X.index)
    
    # Split the train and test set
    X_train, X_test, y_train, y_test = split(X_scaled, y, train_size = 0.9, split_mode="Size")
    
    model = FeedForward(X_train.shape[1], 100,30)
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.002)

    for step in range(1, 30001):
        i = np.random.choice(X_train.shape[0], size=int(X_train.shape[0]/10), replace=False)
        x = torch.from_numpy(np.array(X_train.iloc[i]).astype(np.float32))
        y = torch.from_numpy(np.array(y_train.iloc[i]).astype(np.int))
        y = y.type(torch.LongTensor)
        
        # Forward pass: Get logits for x
        logits = model(x)
        # Compute loss
        loss = F.cross_entropy(logits, y)
        # Zero gradients, perform a backward pass, and update the weights.
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if step % 2000 == 0:
            print("Current Step: ", step)
            idxs = np.random.choice(len(X_train), int(X_train.shape[0]/10), replace=False)
            x = torch.from_numpy(np.array(X_train.iloc[idxs]).astype(np.float32))
            y = torch.from_numpy(np.array(y_train.iloc[idxs]).astype(np.int))
            y = y.type(torch.LongTensor)
            
            logits = model(x)
            loss = F.cross_entropy(logits, y)
            y_pred = torch.max(logits, 1)[1]
            print("Training Accuracy: ", accuracy_score(y_train.iloc[idxs], y_pred.numpy()))
            print("Loss: ", loss.item())

Current Step:  2000
Training Accuracy:  0.45569620253164556
Loss:  1209901.375
Current Step:  4000
Training Accuracy:  0.5316455696202531
Loss:  207104.609375
Current Step:  6000
Training Accuracy:  0.5654008438818565
Loss:  67807.2421875
Current Step:  8000
Training Accuracy:  0.5738396624472574
Loss:  29672.57421875
Current Step:  10000
Training Accuracy:  0.5527426160337553
Loss:  5355.74755859375
Current Step:  12000
Training Accuracy:  0.5611814345991561
Loss:  5277.498046875
Current Step:  14000
Training Accuracy:  0.6413502109704642
Loss:  2788.289794921875
Current Step:  16000
Training Accuracy:  0.5443037974683544
Loss:  2831.52294921875
Current Step:  18000
Training Accuracy:  0.5232067510548524
Loss:  0.6907839179039001
Current Step:  20000
Training Accuracy:  0.5274261603375527
Loss:  0.6922550797462463
Current Step:  22000
Training Accuracy:  0.5358649789029536
Loss:  0.6882731914520264
Current Step:  24000
Training Accuracy:  0.4978902953586498
Loss:  0.6976093649864197
C

In [49]:
for param in model.parameters():
    print(param)

Parameter containing:
tensor([[ 0.9778,  1.0072,  1.0967,  ..., -1.9315, -1.8892, -2.9066],
        [-0.5674, -0.6120, -0.6623,  ..., -4.7677, -1.5417,  3.0041],
        [ 0.0113, -0.0415, -0.0386,  ...,  0.0362, -0.0066,  0.0386],
        ...,
        [-0.0320, -0.0325,  0.0133,  ..., -0.0380, -0.0391,  0.0248],
        [-0.4847, -0.4746, -0.4839,  ..., -0.5350, -1.1098,  1.8839],
        [-0.3863, -0.3519, -0.2928,  ..., -1.9841, -1.9160, -2.9267]],
       requires_grad=True)
Parameter containing:
tensor([-5.8195e+00, -3.5720e+00, -3.3043e-02, -4.2634e+00,  1.9632e-02,
        -9.8726e-02, -3.2843e-02,  2.8026e-02, -8.6571e-03,  2.3046e-02,
         8.2812e-05, -4.1152e-02,  1.4630e+00, -8.2467e+00,  1.9157e-02,
         3.1640e-03,  4.2267e-02,  3.9871e+00, -4.5641e-02,  3.5866e-02,
        -3.3285e-02,  8.7165e+00, -3.4968e-02, -9.2006e+00, -9.1567e-01,
        -1.2748e+00,  1.1885e-02,  8.8165e+00, -4.2710e-02, -4.8104e+00,
        -1.3292e-02,  1.5529e-02,  1.3616e+00, -7.1117e-0

In [46]:
# Test

x = torch.from_numpy(np.array(X_test).astype(np.float32))
logits = model(x)
pred = torch.max(logits, 1)[1]
print("Test accuracy: ", accuracy_score(pred, y_test))
print(pred)

Test accuracy:  0.5056603773584906
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1])


In [8]:
# Training and Testing with Random Forest 

for df in df_list:
    df_final = add_change(df,3)
    df_final = reduce_dimension(df_final, 30, 'pca')
    
    y = df_final['Target']
    X = df_final.drop(['Target'], axis=1)
    
    # Split the train and test set
    X_train, X_test, y_train, y_test = split(X, y, train_size = 0.9, split_mode="Size")

    predictor = Predictor()
    predictor.train(X_train, y_train, model_type = "random_forest")
    pred = predictor.predict(X_test)
    
    print("Accuracy: ", accuracy_score(pred, y_test))
    display(predictor.confusion_matrix(pred, y_test))
    


Cross Validation Score:  [0.45168067 0.54736842 0.54736842 0.54315789 0.45894737]
Training Complete
Accuracy:  0.5056603773584906
<Confusion Matrix>


array([[125,   0,   9],
       [  1,   0,   0],
       [121,   0,   9]], dtype=int64)

None