Data Leakage in Training the CNN #10

hamiGH · 2023-06-25T07:44:01Z

Your implementation demonstrates a brilliant and ingenious approach that truly stands out. However, during my examination of the code, I noticed a potential issue that I believe requires your attention.
It appears that there is a case of data leakage in your CNN classifier. Specifically, the classifier seems to be utilizing information from the same day to predict the outcome for that day. Data leakage can lead to inflated performance metrics during testing but result in poor performance when applied to real-world scenarios.

There is a data leakage issue in the training CNN section of the STOCK_Market_GAN:

# start at num_historical_days and iterate the full length of the training
# data at intervals of num_historical_days
for i in range(num_historical_days, len(df), num_historical_days):
    # split the df into arrays of length num_historical_days and append
    # to data, i.e. array of df[curr - num_days : curr] -> a batch of values
    self.data.append(data[i-num_historical_days:i])

    # appending if price went up or down in curr day of "i" we are looking
    # at
    self.labels.append(labels[i-1])

# do same for test data
data = test_df[['open','high','low','close','volume']].values

You should change self.labels.append(labels[i-1]) with self.labels.append(labels[i])

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Leakage in Training the CNN #10

Data Leakage in Training the CNN #10

hamiGH commented Jun 25, 2023

Data Leakage in Training the CNN #10

Data Leakage in Training the CNN #10

Comments

hamiGH commented Jun 25, 2023