## Feature Selection and ML to predict stock price movement

To anticipate the stock market is impossible, by all means today. Stock market (or perhaps any other market where irrationality supercedes rationalitity) is the most whimsical, unpredictable field. Even the most advanced machine learning or deep learning techniques have failed to bring consistent sum to gold miners.

However, in a vain attempt to build a reliable, or at least a surviving mechanism that barely loses in the market, I have decided to derive several features, those of particular stock and those of commodities and indices that investors often refer to measure the sentiment of the market.

I will be using ML tools to build a binary classifier, which tells the investors to either buy or sell 5 days prior to the next opening day. After that, we will see in the back testing to find out if the model has any usage. 

We will be testing with random forest classifier. 

## What features are we using?

There are various, in fact hundreds of possible indicators. The following are examples of indicators we are going to use.

- % Price change from previous trading day
- Trade Volume
- Moving Averages 
- MACD
- Bollinger Bands



In [93]:
# Import necessary packages
%matplotlib inline
import pandas as pd
import numpy as np
import sys
import itertools
import re
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import yfinance as yf
from finta import TA
import ppscore as pps

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, roc_curve, accuracy_score
from sklearn.decomposition import PCA
from xgboost import XGBClassifier

import warnings
warnings.filterwarnings('ignore')
plt.rcParams['figure.figsize'] = (12,6)


In [201]:
def df_caller(name):
    """
    Fetches basic info about the stock
    
    :param 1 name: ticker symbol
    :return: dataframe including volume, closing price, opening price, high price (during the day), low price
    """
    df = yf.download(name)
    if len(df) == 0:
        raise NameError
    
    # Into lower case
    df.rename(columns={'Volume':'volume','Close':'close','Open':'open','High':'high','Low':'low'}, inplace=True)
    
    # Fill any NaN values with prior 
    df.dropna(axis=0, inplace=True)
    
    """
    Add column that shows whether closing price will go up or down the next trading day - serving as our target variable
    """
    df['Target'] = 0
    change_pct = (df.shift(-1)['close'] - df['close']) / df['close']
    df['Target'][change_pct > 0] = 1
    df['Target'][change_pct < 0] = -1
    
    df = df.round(3)
    
    
    def finta(df):
        """
        Finta is a package that calls financial indicators for each ticker symbol.
        For more info, refer to https://github.com/peerchemist/finta

        Fetches +80 indicators from the given ticker symbol
        """

        finta_df = pd.DataFrame()
        indicators = dir(TA)[0:85]

        # These indicators are either null or non-usable. Remove them from the list
        n_a = ['ALMA','MAMA','FRAMA','LWMA','VIDYA','SWI','TMF','VR','QSTICK','STC','OBV']
        indicators = [i for i in indicators if i not in n_a]

        for ind in indicators:
            series = eval("TA."+ind+"(df)")
            ind_df = pd.DataFrame(series)

            # When the indicator has more than one column
            if len(ind_df.columns) > 1:
                for col in ind_df:
                    # Include only the indicator that has more than 90% of the rows filled
                    if ind_df[col].count()/len(series) > 0.9:
                        finta_df[ind+"_"+ind_df[col].name] = ind_df[col]
            else: 
                if ind_df.count().values[0]/len(ind_df) > 0.9:
                    finta_df[ind] = series

        # Replace all infinity values with NaN
        finta_df.replace([np.inf, -np.inf], np.nan, inplace=True)

        return finta_df
 
    df_finta = finta(df)
    
    # The date indices of df & df_finta must match
    mask = df_finta.index.isin(df.index)
    df = pd.concat([df, df_finta[mask]], axis=1)
    
    # Since the percentage of missing entries are low, fill NaN values with previous values
    df = df.fillna(method='ffill').fillna(method='bfill')
    
    return df 
       
def standardize(df):
    scaler = StandardScaler()
    for col in df:
        df[col] = scaler.fit_transform(np.array(df[col]).reshape(-1,1))
    
    return df
    
def reduce_dimension(df, num_att, mode=["pps","pca"], show_pps=False):
        
    if mode == "pps":
        """
        Returns top num_att number of indicators, based on their predictive power to target value.
        """
            
        # Get top indicators based on their predictive powers with respect to the target variable
        top = pps.predictors(df, "Target", sorted=True)[1:num_att+1]['x']
        df_top = df[top]

        # Drop any rows with nan 
        df_top.dropna(axis='index', inplace=True)
        
        # If the user wants to see the ppscore in the descending order
        if show_pps == True:
            print(pps.predictors(df, "Target", sorted=True)[1:num_att+1])
        
        df_top = df_top.join(df['Target'], on="Date")
        
        return df_top
        
    elif mode == "pca":
        target = df['Target']
        df.drop(['Target'], axis=1, inplace=True)
            
        pca = PCA(n_components=num_att)
        df_top = pd.DataFrame(pca.fit_transform(df))
        df_top.index = df.index
        df_top = df_top.join(target, on="Date")
        
        return df_top
        
    else:
        raise ValueError("Not a valid mode")
   

def split(X, y, train_size=None, start_date=None, split_mode=["Size","Date"]):
    """
    Split based on either time frame, or size
    """
    #x = df.drop(['Target'], axis=1)
    #y = df['Target']
    
    if split_mode == "Size":
        # Split test and train by size
        train_len = int(len(X)*train_size)
        X_train = X.iloc[0:train_len]
        X_test = X.iloc[train_len:]
        y_train = y.iloc[0:train_len]
        y_test = y.iloc[train_len:]
            
    elif split_mode == "Date":
        # Split test and train by starting date
        X_train = X.loc[:start_date]
        X_test = X.loc[start_date:]
        y_train = y.loc[:start_date]
        y_test = y.loc[start_date:]
    else:
        raise ValueError("Not a valid split mode")
    
    return X_train, X_test, y_train, y_test

def add_change(df, n):
    """
    Add changes with respect to previous days' indicators
    """    
    target = df["Target"]
    df = df.drop("Target",1)

    drop_columns = df.loc[:, (df == 0.0).any(axis=0)].columns

    df = df.drop(drop_columns,1)
    df_tmp = df.copy()
    
    for i in range(1,n+1):
        delta = df_tmp - df_tmp.shift(i)
        df_final = delta / df_tmp.shift(i)
        df_final.columns = [c + "_" + str(i) for c in df_final.columns]
        df = df.join(df_final)
        
    df["Target"] = target
    df = df[n:]
    
    return df

In [191]:
"""
Predictor Class 
"""

class Predictor:
    def __init__(self):
        self.model = None
        self.pred = None
    
    def train(self, X_train, y_train, model_type=["random_forest"]): 
    
        if model_type == "random_forest":
            """
            Random Forest Classifier
            """
            """
            # Grid Search - Hyperparameter tuning first
            param_grid = { 
                'n_estimators': [150,200,300,400],
                'max_features': ['auto'],
                'max_depth' : [2,3,4],
                'criterion' :['gini']
            }

            forest_clf = RandomForestClassifier(random_state=42)
            CV = GridSearchCV(estimator=forest_clf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
            CV.fit(X_train, y_train)
            self.model = CV.best_estimator_
            """
            self.model = RandomForestClassifier(max_depth=2, n_estimators=400, random_state=42)
            self.model.fit(X_train, y_train)
            
            print("Cross Validation Score: ", cross_val_score(self.model, X_train, y_train, cv=5))
            
            print("Training Complete")
        
        else:
            raise ValueError("Wrong Model Input")
        
        return self.model
    
    def predict(self, X_test):
        pred = self.model.predict(X_test)
        df_pred = pd.DataFrame(pred)
        df_pred.index = X_test.index
        df_pred.columns = ['Target']
        
        self.pred = df_pred
        
        return df_pred
    
    def confusion_matrix(self, pred, y_test):
        matrix = confusion_matrix(y_test, pred, labels=[1,0,-1])

        """
        # Show precision/recall and F1 Score for each class
        precision_up = matrix[0,0] / np.sum(matrix[:,0])
        recall_up = matrix[0,0] / np.sum(matrix[0,:])
        precision_down = matrix[2,2] / np.sum(matrix[:,2])
        recall_down = matrix[2,2] / np.sum(matrix[2,:])
        f1_up = 2*(precision_up * recall_up)/(precision_up + recall_up)
        f1_down = 2*(precision_down * recall_down)/(precision_down + recall_down)

        print("Up Precision: ", precision_up, "Up Recall: ", recall_up, "Up F1 Score: ", f1_up)
        print("Down Precision: ", precision_down, "Down Recall: ", recall_down, "Down F1 Score: ", f1_down)
        """
        print("<Confusion Matrix>")
        display(matrix)

In [216]:
"""
BackTester Class - We want to see if the trained model will make us money
"""

class BackTester:
    
    def __init__(self, df, df_pred, seed):
        self.df_merge = pd.merge(df, df_pred, how='left', on='Date')
        
        # Dates with long signal
        self.long_date = df_pred[df_pred['Target'] == 1].index.sort_values(ascending=True)
        # Dates with short signal
        self.short_date = df_pred[df_pred['Target'] == -1].index.sort_values(ascending=True)
        # Seed money for each transaction 
        self.seed = 1000 
        self.balance = 0
    
    
    def show(self):
        return self.df_merge
    
    def long_transaction(self):
        df_shift = self.df_merge.shift(-1)['close'] - self.df_merge['close']
        df_share_amt = self.seed / self.df_merge.loc[self.long_date]['close']
        profit = np.sum(df_shift[self.long_date] * df_share_amt)
        self.balance += 100 
        print("Profit from Long Positions: ", profit)
        print("Total invested amount: ", len(self.long_date) * self.seed)

        return profit

    def short_transaction(self):
        df_shift = self.df_merge['close'] - self.df_merge.shift(-1)['close']
        df_share_amt = self.seed / self.df_merge.loc[self.short_date]['close']
        profit = np.sum(df_shift[self.short_date] * df_share_amt)
        print("Profit from Short Positions: ", profit)
        return profit
        


In [144]:
#quote = df_caller("LADR")

display(add_change(quote))

Unnamed: 0_level_0,open,high,low,close,Adj Close,volume,deltawma,ADL,ADX,AO,...,VFI_4,VORTEX_VIm_4,VORTEX_VIp_4,VPT_4,VWAP_4,WMA_4,WOBV_4,WTO_WT1._4,WTO_WT2._4,Target
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2014-02-12,17.23,17.29,16.75,16.88,8.790,145800,16.942794,3.234239e+06,100.000000,1.026088,...,0.000000,0.000000,0.000000,0.017691,0.004819,0.000000,-0.182852,-0.712356,0.000000,1
2014-02-13,16.80,17.40,16.74,17.34,9.030,243200,16.942794,3.358050e+06,100.000000,1.026088,...,0.000000,0.000000,0.000000,0.015742,0.002521,0.000000,0.311027,-0.817841,-0.285157,-1
2014-02-14,17.25,17.40,17.10,17.25,8.983,153900,16.942794,3.304185e+06,100.000000,1.026088,...,0.000000,0.000000,0.000000,0.002730,0.001542,0.000000,-0.003074,-0.789969,-0.525718,-1
2014-02-18,17.15,17.18,16.93,16.97,8.837,53800,16.942794,3.261145e+06,100.000000,1.026088,...,0.000000,0.000000,0.000000,0.008799,0.000851,0.000000,0.088177,-0.959934,-0.762358,-1
2014-02-19,16.90,17.05,16.86,16.95,8.827,30100,16.942794,3.248012e+06,100.000000,1.026088,...,0.000000,0.000000,0.000000,0.022793,0.000791,0.000000,0.444926,-1.336959,-0.896445,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-11-30,11.35,11.47,11.26,11.39,11.390,1065900,11.493840,-3.305778e+08,23.879474,-0.303294,...,-0.189389,0.155185,-0.193166,-0.014402,-0.000182,-0.027905,-0.012088,-6.037643,-1.981816,-1
2021-12-01,11.64,11.82,11.31,11.35,11.350,1167400,11.373121,-3.316069e+08,23.147617,-0.378471,...,-0.408636,0.067302,-0.154358,-0.157423,-0.000249,-0.034501,-0.013374,-12.113066,-3.621177,1
2021-12-02,11.30,12.00,11.30,11.91,11.910,912000,11.467565,-3.311678e+08,21.802548,-0.439588,...,-0.275085,-0.084015,0.109002,0.015664,-0.000261,-0.022852,0.009862,1.239369,-20.748837,-1
2021-12-03,12.00,12.01,11.71,11.77,11.770,658700,11.516054,-3.316553e+08,20.518060,-0.355676,...,-0.292589,-0.070408,0.181095,-0.035904,-0.000242,-0.013558,0.008259,0.108972,3.415905,1


## Feature Engineering




In [210]:
#comp_list = ["WMT","SPY","F","AAL","AMZN"]
comp_list = ["BTC-USD"]
# First make list of data frames, so we only have to call once
df_list = []

for c in comp_list:
    df_list.append(df_caller(c))

[*********************100%***********************]  1 of 1 completed


In [213]:
for df in df_list:
    df_final = add_change(df,1)
    df_final = reduce_dimension(df_final, 30, 'pca')
    
    y = df_final['Target']
    X = df_final.drop(['Target'], axis=1)
    
    # Split the train and test set
    #X_train, X_test, y_train, y_test = split(X, y, start_date="2021-09-01", split_mode="Date")
    X_train, X_test, y_train, y_test = split(X, y, train_size = 0.9, split_mode="Size")

    predictor = Predictor()
    predictor.train(X_train, y_train, model_type = "random_forest")
    pred = predictor.predict(X_test)
    
    print("Accuracy: ", accuracy_score(pred, y_test))
    display(predictor.confusion_matrix(pred, y_test))
    


Cross Validation Score:  [0.4535865  0.54852321 0.5464135  0.53586498 0.47468354]
Training Complete
Accuracy:  0.4962121212121212
<Confusion Matrix>


array([[130,   0,   4],
       [  1,   0,   0],
       [128,   0,   1]], dtype=int64)

None