Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using the CNN benchmark to predict rather than the GAN/XGB hybrid? #9

Open
BreakingDusk397 opened this issue Apr 11, 2023 · 1 comment

Comments

@BreakingDusk397
Copy link

BreakingDusk397 commented Apr 11, 2023

Could you walk me through how to modify the code in the predictions cell? I can't seem to untangle the GAN and XGB models. At first, I tried to just replace the GAN model for the CNN, but the CNN class doesn't contain features for the XGB to read. I'm also not very familiar with tf v1 code. Any help would be much appreciated.

This is my code for the XGB training cell:

class TrainXGBBoost:
def init(self, num_historical_days, days=10, pct_change=0):
self.data = []
self.labels = []
self.test_data = []
self.test_labels = []

    assert os.path.exists(f"{googlepath}cnn_models/checkpoint")
    cnn = CNN(num_features=5, num_historical_days=num_historical_days, is_train=False)
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        saver = tf.train.Saver()
        if os.path.exists(f'{googlepath}cnn_models/checkpoint'):
                
                with open(f'{googlepath}cnn_models/checkpoint', 'rb') as f:
                    model_name = next(f).split('"'.encode())[1]
                filename = "{}cnn_models/{}".format(googlepath, model_name.decode())
                currentStep = filename.split("-")[1]
                new_saver = tf.train.import_meta_graph('{}.meta'.format(filename))
                new_saver.restore(sess, "{}".format(filename))
        files = [os.path.join(f'{googlepath}stock_data', f) for f in os.listdir(f'{googlepath}/stock_data')]

        for file in files:
            print(file)
            #Read in file -- note that parse_dates will be need later
            df = pd.read_csv(file, index_col='timestamp', parse_dates=True)
            

            if len(df) > 12: 

              df = df[['open','high','low','close','volume']]

              df = df.fillna(0)
          
              

              #Normilize using a of size num_historical_days
              labels = df.close.pct_change(days).map(lambda x: int(x > pct_change/100.0))
              df = ((df -
              df.rolling(num_historical_days).mean().shift(-num_historical_days))
              /(df.rolling(num_historical_days).max().shift(-num_historical_days)
              -df.rolling(num_historical_days).min().shift(-num_historical_days)))

              df['labels'] = labels
              df = df.apply(pd.to_numeric, downcast='float')
              df = df.apply(pd.to_numeric, downcast='integer')

              df = df.dropna()

              #Hold out the testing data
              test_df = df[:500]
              df = df[500:]

              data = df[['open','high','low','close','volume']].values
              labels = df['labels'].values
              for i in range(num_historical_days, len(df), num_historical_days):
                  features = sess.run(cnn.features, feed_dict={cnn.X:[data[i-num_historical_days:i]]})
                  self.data.append(features[0])

                  self.labels.append(labels[i-1])
              data = test_df[['open','high','low','close','volume']].values
              labels = test_df['labels'].values
              for i in range(num_historical_days, len(test_df), 1):
                  features = sess.run(cnn.features, feed_dict={cnn.X:[data[i-num_historical_days:i]]})
                  self.test_data.append(features[0])
                  self.test_labels.append(labels[i-1])



def train(self):
    params = {}
    params['objective'] = 'multi:softprob'
    params['eta'] = 0.01
    params['num_class'] = 2
    params['max_depth'] = 20
    params['subsample'] = 0.05
    params['colsample_bytree'] = 0.05
    params['eval_metric'] = 'mlogloss'

    train = xgb.DMatrix(self.data, self.labels)
    test = xgb.DMatrix(self.test_data, self.test_labels)

    watchlist = [(train, 'train'), (test, 'test')]
    clf = xgb.train(params, train, 2000, evals=watchlist, early_stopping_rounds=100)
    joblib.dump(clf, f'{googlepath}models/clf.pkl')
    cm = confusion_matrix(self.test_labels, list(map(lambda x: int(x[1] > .5), clf.predict(test))))
    print(cm)
    plot_confusion_matrix(cm, ['Down', 'Up'], normalize=True, title="Confusion Matrix")

tf.compat.v1.reset_default_graph()
boost_model = TrainXGBBoost(num_historical_days=HISTORICAL_DAYS_AMOUNT, days=DAYS_AHEAD, pct_change=PCT_CHANGE_AMOUNT)
boost_model.train()

This is the error code I get when trying to train the XGB on the CNN:

AttributeError Traceback (most recent call last)

ipython-input-7-b77e4ceeeb9ahttps://localhost:8080/# in <cell line: 118>()
116
117 tf.compat.v1.reset_default_graph()
--> 118 boost_model = TrainXGBBoost(num_historical_days=HISTORICAL_DAYS_AMOUNT, days=DAYS_AHEAD, pct_change=PCT_CHANGE_AMOUNT)
119 boost_model.train()

ipython-input-7-b77e4ceeeb9ahttps://localhost:8080/# in init(self, num_historical_days, days, pct_change)
82 labels = df['labels'].values
83 for i in range(num_historical_days, len(df), num_historical_days):
---> 84 features = sess.run(cnn.features, feed_dict={cnn.X:[data[i-num_historical_days:i]]})
85 self.data.append(features[0])
86 # print(features[0])

AttributeError: 'CNN' object has no attribute 'features'

Here's what I have for my Predictions cell:

import os
import pandas as pd
import random
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import xgboost as xgb
from sklearn.externals import joblib

class Predict:

def __init__(self, num_historical_days=20, days=10, pct_change=0, 
             cnn_model=f'{googlepath}deployed_model/cnn', 
             xgb_model=f'{googlepath}deployed_model/xgb'):
    self.data = []
    self.num_historical_days = num_historical_days
    self.cnn_model = cnn_model
    self.xgb_model = xgb_model

    files = [os.path.join(f'{googlepath}stock_data', f) for f in os.listdir(f'{googlepath}stock_data')]
    for file in files:
        print(file)
        df = pd.read_csv(file, index_col='timestamp', parse_dates=True)
        df = df.sort_index(ascending=False)
        df = df[['open', 'high', 'low', 'close', 'volume']]
        df = ((df -
              df.rolling(num_historical_days).mean().shift(-num_historical_days)) /
              (df.rolling(num_historical_days).max().shift(-num_historical_days) -
              df.rolling(num_historical_days).min().shift(-num_historical_days)))
        df = df.dropna()

        """
        file.split --> is the symbol of the current file. Append a tuple of
        that symbol and the dataframe index[0] which is the timestamp, and
        thirdly append the data for 200 to 200 + num_historical_days values
        (open, high, low, close, volume). For each symbol we have, we are
        predicting based on the df[200:200+num_historical_days].values...
        """
        self.data.append((file.split('/')[-1], df.index[0], df[200:200+num_historical_days].values))
        
def cnn_predict(self):
    # clears the default graph stack and resets the global default graph.

    tf.compat.v1.reset_default_graph()
    cnn = CNN(num_features=5, num_historical_days=self.num_historical_days, is_train=False)
    # A class for running Tensorflow operations. A session object
    # encapsulates the environment in which Operation objects are executed,
    # and Tensor objects are evaluated. A session may own resources, such as
    # tf.Variable, tf.QueueBase and tf.ReaderBase. It is important to
    # release these resources when they are no longer required. Invoke
    # tf.Session.close method on the session or use the session as a context
    # manager. 
    # with tf.Session() as sess:
    #   sess.run(...)
    # or
    # sess = tf.Session()
    # sess.run(...)
    # sess.close()     
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        saver = tf.train.Saver()
        saver.restore(sess, self.cnn_model)
        # Reconstruct a Python object from a file persisted with joblib.dump
        clf = joblib.load(self.xgb_model)
        for sym, date, data in self.data:
            # run takes in feed_dict=None, session=None. A feed_dict is a
            # dictionary that maps Tensor objects to feed values. In this
            # case, I believe we are doing run( fetches, feed_dict=None...)
            # case where the fetches is gan.features and the feed_dict
            # points to the gan.X dictionary which points to data. The
            # fetches argument may be a single graph element, or an
            # arbitrarily nested list, tuple, namedtuple, dict, or
            # OrderedDict containing graph elements at its leaves.

            try:

              features = sess.run(cnn.features, feed_dict={cnn.X:[data]})
              # Value returned by run() has the same shape as the fetches
              # argument, where the leaves are replaced by the corresponding
              # values returned by TensorFlow.  
              
              # xgb.DMatrix, construct one from either a dense matrix, a
              # sparse matrix, or a local file. Supported input file formats
              # are either a libsvm text file or a binary file that was
              # created previously by xgb.DMatrix.save. Internal data
              # structure that is used by XGBoost which is optimized for both
              # memory efficiency and training speed.
              features = xgb.DMatrix(features)
            

              # The clf predict is the xgb classifier that is used on the gan
              # features (the flattened last layer of the convolutional neural
              # network, that is the discriminator). As far as I can tell, we
              # are using the GAN on the past 20 days to come up with some
              # features. Then these features are plugged into the XGBoost
              # Classifier. Then the XGBoost Classifier makes a prediction for
              # the stock (going Up or Down).
              print('{} {} {}'.format(str(date).split(' ')[0], sym, clf.predict(features)[0][1] > 0.5))
            except Exception as e:
              print(Exception)

p = Predict(num_historical_days=HISTORICAL_DAYS_AMOUNT, days=DAYS_AHEAD, pct_change=PCT_CHANGE_AMOUNT)
p.cnn_predict()

If you could help me through modifying the rest of my prediction cell to only use the CNN model to predict, I would be forever in your debt. I think that would be easier than trying to feed the CNN features into the XGB training. Although on a tangent, I wonder if one could extract more accuracy that way.

@BreakingDusk397
Copy link
Author

I apologize for the awful formatting. It wouldn't recognize the entire cell as a code block for whatever reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant