# Candle Dataset Cleaning

After running the first training job on the full dataset, we can see that despite the training loss decreasing, the validation loss is remaining fairly constant. As a potential remedy, we are going to try removing all of the rows from a closed market from the dataset. This will have a few advantages:
1. It will drasitcally reduce the size of the data by approximately 55 - 60%, by keeping only the examples that are relevant to the task we need to perform. 
2. It will ensure that the loss metrics are mearusing the models performnace on market data where trading is occuring. Currently, the loss function is likely dilluted, as so much of the data is just a flat time-series, that the model is likely performing perfectly on these examples, causing the loss metric to be diluted, and likely not allowing a high enough gradient to be built up for the backward pass.

## Plan for Cleaning Data 

1. Remove any rows where the market is closed. We will never be using this in the actual model.
2. This should give us a a couple hundred time-series per ticker. We want to create a time-series ID for each continuous time series in the dataset. This will be used as a column ID.
3. Then iterate over each ticker. For each ticker:
 - Use our date indicices, to create a train, validation, and test set for that ticker. Be sure that the date the sets do not slice any time-series (e.g. each of the 3 sets should have a unique set of time-series IDs)
 - Append eacah of the three sets to a master train, validation, and test set respectively. 
 - These steps will replace the current `select_by_index` usage
4. Train the `TimeSeriesPreprocessor` on the train set
5. Create the train, validation, and test datasets, using the trained preprocessor



In [1]:
# Standard
import os
import random

# Third Party
from transformers import (
    EarlyStoppingCallback,
    PatchTSMixerConfig,
    PatchTSMixerForPrediction,
    Trainer,
    TrainingArguments,
)
import numpy as np
import pandas as pd
import torch
from sklearn.model_selection import train_test_split

# First Party
from tsfm_public.toolkit.dataset import ForecastDFDataset
from tsfm_public.toolkit.time_series_preprocessor import TimeSeriesPreprocessor


In [2]:
# Set seed for reproducibility
SEED = 42
torch.manual_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)

## Dataset Preprocessing

We want to meticulously craft our examples for each trading day. Our primary focus here is to ensure that the model only is ever asked to forecast into live market data. We need it to understand this type of forecasting, and not be thrown off at all by examples from extended hours trading.

There are a few general steps we must take:
1. Ensure that all the data from the closed market is removed from the dataset, as this data will never be used in context or in forecasting.
2. Ensure that the timestamps are localized to America/New_York, so that we can get accurate date_strings to use for identifying each trading day.

In [3]:
# Load the Dataset from the CSV file
DATA_DIR = "/home/ubuntu/verb-workspace/data" # set this accordingly to the location of the data
DATASET = "10s-candles-2023-train.csv"

TRAIN_DATASET = f"{DATA_DIR}/10s-candles-train.csv"
VALID_DATASET = f"{DATA_DIR}/10s-candles-valid.csv"
TEST_DATASET = f"{DATA_DIR}/10s-candles-test.csv"

timestamp_col = 't'

train_data = pd.read_csv(
    TRAIN_DATASET,
    parse_dates=[timestamp_col]
)

valid_data = pd.read_csv(
    VALID_DATASET,
    parse_dates=[timestamp_col]
)

test_data = pd.read_csv(
    TEST_DATASET,
    parse_dates=[timestamp_col]
)

KeyboardInterrupt: 

In [14]:
train_data

Unnamed: 0,t,targ_o,targ_h,targ_l,targ_c,targ_v,ticker,market_extended,market_open,date_string
0,2023-01-03 05:30:00-05:00,130.800,130.800,130.80,130.800,0.0,AAPL,1,0,2023-01-03
1,2023-01-03 05:30:10-05:00,130.800,130.800,130.80,130.800,0.0,AAPL,1,0,2023-01-03
2,2023-01-03 05:30:20-05:00,130.800,130.800,130.80,130.800,0.0,AAPL,1,0,2023-01-03
3,2023-01-03 05:30:30-05:00,130.800,130.800,130.80,130.800,0.0,AAPL,1,0,2023-01-03
4,2023-01-03 05:30:40-05:00,130.800,130.800,130.80,130.800,0.0,AAPL,1,0,2023-01-03
...,...,...,...,...,...,...,...,...,...,...
5624635,2023-11-17 15:59:10-05:00,249.640,249.645,249.59,249.610,12233.0,V,0,1,2023-11-17
5624636,2023-11-17 15:59:20-05:00,249.605,249.655,249.59,249.635,13620.0,V,0,1,2023-11-17
5624637,2023-11-17 15:59:30-05:00,249.640,249.640,249.59,249.595,9923.0,V,0,1,2023-11-17
5624638,2023-11-17 15:59:40-05:00,249.590,249.630,249.55,249.550,20377.0,V,0,1,2023-11-17


In [34]:
from tqdm import tqdm

def aggregate_min_candles(df: pd.DataFrame) -> pd.DataFrame:
    
    # Set 'date' as the index of the dataframe
    # Be sure to make a copy here, so as to not mutate the original
    _df = df.set_index(timestamp_col)
    
    # Group by 'ticker' and 'date_string', and resample the data to 1-minute frequency
    groups = _df.groupby(['ticker', 'date_string'])
    resampled_groups = []
    for (ticker, date_string), _group in tqdm(groups, total=len(groups)):
        _group.index = pd.to_datetime(_group.index)
        group = _group.resample('1min').agg({
            'targ_o': 'first',
            'targ_h': 'max',
            'targ_l': 'min',
            'targ_c': 'last',
            'targ_v': 'sum',
            'market_extended': 'first',
            'market_open': 'first'
        })
        group['ticker'] = ticker
        group['date_string'] = date_string
        resampled_groups.append(group)
    
    # Reset the index to flatten the dataframe
    resampled_df = pd.concat(resampled_groups)
    resampled_df.reset_index(inplace=True)
    return resampled_df
    
print("Aggregating Training Data")
train_data_1min = aggregate_min_candles(train_data)

print("Aggregating Validation Data")
valid_data_1min = aggregate_min_candles(valid_data)

print("Aggregating Testing Data")
test_data_1min = aggregate_min_candles(test_data)

Aggregating Training Data


100%|██████████| 1488/1488 [00:42<00:00, 35.34it/s]


Aggregating Validation Data


100%|██████████| 186/186 [00:05<00:00, 35.68it/s]


Aggregating Testing Data


100%|██████████| 187/187 [00:05<00:00, 35.70it/s]


In [37]:
assert len(train_data_1min) * 6 == len(train_data)
assert len(valid_data_1min) * 6 == len(valid_data)
assert len(test_data_1min) * 6 == len(test_data)

In [38]:
TRAIN_DATASET_1_MIN = f"{DATA_DIR}/1min-candles-train.csv"
VALID_DATASET_1_MIN = f"{DATA_DIR}/1min-candles-valid.csv"
TEST_DATASET_1_MIN = f"{DATA_DIR}/1min-candles-test.csv"

train_data_1min.to_csv(TRAIN_DATASET_1_MIN, index=False)
valid_data_1min.to_csv(VALID_DATASET_1_MIN, index=False)
test_data_1min.to_csv(TEST_DATASET_1_MIN, index=False)