# Candle Dataset Cleaning

After running the first training job on the full dataset, we can see that despite the training loss decreasing, the validation loss is remaining fairly constant. As a potential remedy, we are going to try removing all of the rows from a closed market from the dataset. This will have a few advantages:
1. It will drasitcally reduce the size of the data by approximately 55 - 60%, by keeping only the examples that are relevant to the task we need to perform. 
2. It will ensure that the loss metrics are mearusing the models performnace on market data where trading is occuring. Currently, the loss function is likely dilluted, as so much of the data is just a flat time-series, that the model is likely performing perfectly on these examples, causing the loss metric to be diluted, and likely not allowing a high enough gradient to be built up for the backward pass.

## Plan for Cleaning Data 

1. Remove any rows where the market is closed. We will never be using this in the actual model.
2. This should give us a a couple hundred time-series per ticker. We want to create a time-series ID for each continuous time series in the dataset. This will be used as a column ID.
3. Then iterate over each ticker. For each ticker:
 - Use our date indicices, to create a train, validation, and test set for that ticker. Be sure that the date the sets do not slice any time-series (e.g. each of the 3 sets should have a unique set of time-series IDs)
 - Append eacah of the three sets to a master train, validation, and test set respectively. 
 - These steps will replace the current `select_by_index` usage
4. Train the `TimeSeriesPreprocessor` on the train set
5. Create the train, validation, and test datasets, using the trained preprocessor



In [1]:
# Standard
import os
import random

# Third Party
from transformers import (
    EarlyStoppingCallback,
    PatchTSMixerConfig,
    PatchTSMixerForPrediction,
    Trainer,
    TrainingArguments,
)
import numpy as np
import pandas as pd
import torch
from sklearn.model_selection import train_test_split

# First Party
from tsfm_public.toolkit.dataset import ForecastDFDataset
from tsfm_public.toolkit.time_series_preprocessor import TimeSeriesPreprocessor


In [2]:
# Set seed for reproducibility
SEED = 42
torch.manual_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)

## Dataset Preprocessing

We want to meticulously craft our examples for each trading day. Our primary focus here is to ensure that the model only is ever asked to forecast into live market data. We need it to understand this type of forecasting, and not be thrown off at all by examples from extended hours trading.

There are a few general steps we must take:
1. Ensure that all the data from the closed market is removed from the dataset, as this data will never be used in context or in forecasting.
2. Ensure that the timestamps are localized to America/New_York, so that we can get accurate date_strings to use for identifying each trading day.

In [3]:
# Load the Dataset from the CSV file
DATA_DIR = "/home/ubuntu/verb-workspace/data" # set this accordingly to the location of the data

TRAIN_DATASET = f"{DATA_DIR}/1min-candles-train-w-CANDLES.csv"
VALID_DATASET = f"{DATA_DIR}/1min-candles-valid-w-CANDLES.csv"
TEST_DATASET = f"{DATA_DIR}/1min-candles-test-w-CANDLES.csv"

timestamp_col = 't'

train_data = pd.read_csv(
    TRAIN_DATASET,
    parse_dates=[timestamp_col]
)

valid_data = pd.read_csv(
    VALID_DATASET,
    parse_dates=[timestamp_col]
)

test_data = pd.read_csv(
    TEST_DATASET,
    parse_dates=[timestamp_col]
)

In [4]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 937440 entries, 0 to 937439
Data columns (total 13 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   ticker                937440 non-null  object 
 1   date_string           937440 non-null  object 
 2   t                     937440 non-null  object 
 3   targ_o                937440 non-null  float64
 4   targ_h                937440 non-null  float64
 5   targ_l                937440 non-null  float64
 6   targ_c                937440 non-null  float64
 7   targ_v                937440 non-null  float64
 8   targ_vwap             937440 non-null  float64
 9   targ_red              937440 non-null  int64  
 10  targ_green            937440 non-null  int64  
 11  cont_market_open      937440 non-null  int64  
 12  cont_market_extended  937440 non-null  int64  
dtypes: float64(6), int64(4), object(3)
memory usage: 93.0+ MB


In [5]:
# Split the dataset into a morning set, maybe the first 90 minutes of the day

# Convert each df to have a timestamp val as ther 't' column
train_data['t'] = pd.to_datetime(train_data['t'], utc=True).dt.tz_convert('America/New_York')
valid_data['t'] = pd.to_datetime(valid_data['t'], utc=True).dt.tz_convert('America/New_York')
test_data['t'] = pd.to_datetime(test_data['t'], utc=True).dt.tz_convert('America/New_York')

# Create a time object representing 11 AM
eleven_am = pd.to_datetime('11:00:00', format='%H:%M:%S').time()

# Filter each dataset to keep rows where 't' is before or equal to 11 AM
morning_train_data = train_data[train_data['t'].dt.time <= eleven_am]
morning_valid_data = valid_data[valid_data['t'].dt.time <= eleven_am]
morning_test_data = test_data[test_data['t'].dt.time <= eleven_am]

In [6]:
assert morning_train_data.isna().sum().sum() == 0
assert morning_valid_data.isna().sum().sum() == 0
assert morning_test_data.isna().sum().sum() == 0

In [7]:
TRAIN_DATASET_1_MIN = f"{DATA_DIR}/1min-candles-train-MORNING.csv"
VALID_DATASET_1_MIN = f"{DATA_DIR}/1min-candles-valid-MORNING.csv"
TEST_DATASET_1_MIN = f"{DATA_DIR}/1min-candles-test-MORNING.csv"

morning_train_data.to_csv(TRAIN_DATASET_1_MIN, index=False)
morning_valid_data.to_csv(VALID_DATASET_1_MIN, index=False)
morning_test_data.to_csv(TEST_DATASET_1_MIN, index=False)

In [8]:
# Now, we want to get the afternoon split, which involves taking everything until 2:30 PM

# Create a time object representing 2:30 PM
two_thirty_pm = pd.to_datetime('14:30:00', format='%H:%M:%S').time()

# Create a time object representing 9 AM (4 hours leading up to 11 AM)
nine_am = pd.to_datetime('07:00:00', format='%H:%M:%S').time()


# Filter each dataset to keep rows where 't' is between 10:30 AM and 2:30 PM
day_train_data = train_data[(train_data['t'].dt.time >= nine_am) & (train_data['t'].dt.time <= two_thirty_pm)]
day_valid_data = valid_data[(valid_data['t'].dt.time >= nine_am) & (valid_data['t'].dt.time <= two_thirty_pm)]
day_test_data = test_data[(test_data['t'].dt.time >= nine_am) & (test_data['t'].dt.time <= two_thirty_pm)]

In [9]:
TRAIN_DATASET_1_MIN = f"{DATA_DIR}/1min-candles-train-DAY.csv"
VALID_DATASET_1_MIN = f"{DATA_DIR}/1min-candles-valid-DAY.csv"
TEST_DATASET_1_MIN = f"{DATA_DIR}/1min-candles-test-DAY.csv"

day_train_data.to_csv(TRAIN_DATASET_1_MIN, index=False)
day_valid_data.to_csv(VALID_DATASET_1_MIN, index=False)
day_test_data.to_csv(TEST_DATASET_1_MIN, index=False)

In [10]:
# Now, we want to get the afternoon split, which involves taking everything after 2:30 PM

# Create a time object representing 10:30 AM (4 hours leading up to 2:30 PM)
ten_thirty_am = pd.to_datetime('10:30:00', format='%H:%M:%S').time()


# Filter each dataset to keep rows where 't' is between 10:30 AM and 2:30 PM
afternoon_train_data = train_data[train_data['t'].dt.time >= ten_thirty_am]
afternoon_valid_data = valid_data[valid_data['t'].dt.time >= ten_thirty_am]
afternoon_test_data = test_data[test_data['t'].dt.time >= ten_thirty_am]

In [11]:
TRAIN_DATASET_1_MIN = f"{DATA_DIR}/1min-candles-train-AFTERNOON.csv"
VALID_DATASET_1_MIN = f"{DATA_DIR}/1min-candles-valid-AFTERNOON.csv"
TEST_DATASET_1_MIN = f"{DATA_DIR}/1min-candles-test-AFTERNOON.csv"

afternoon_train_data.to_csv(TRAIN_DATASET_1_MIN, index=False)
afternoon_valid_data.to_csv(VALID_DATASET_1_MIN, index=False)
afternoon_test_data.to_csv(TEST_DATASET_1_MIN, index=False)