# Candle Dataset Cleaning

After running the first training job on the full dataset, we can see that despite the training loss decreasing, the validation loss is remaining fairly constant. As a potential remedy, we are going to try removing all of the rows from a closed market from the dataset. This will have a few advantages:
1. It will drasitcally reduce the size of the data by approximately 55 - 60%, by keeping only the examples that are relevant to the task we need to perform. 
2. It will ensure that the loss metrics are mearusing the models performnace on market data where trading is occuring. Currently, the loss function is likely dilluted, as so much of the data is just a flat time-series, that the model is likely performing perfectly on these examples, causing the loss metric to be diluted, and likely not allowing a high enough gradient to be built up for the backward pass.

## Plan for Cleaning Data 

1. Remove any rows where the market is closed. We will never be using this in the actual model.
2. This should give us a a couple hundred time-series per ticker. We want to create a time-series ID for each continuous time series in the dataset. This will be used as a column ID.
3. Then iterate over each ticker. For each ticker:
 - Use our date indicices, to create a train, validation, and test set for that ticker. Be sure that the date the sets do not slice any time-series (e.g. each of the 3 sets should have a unique set of time-series IDs)
 - Append eacah of the three sets to a master train, validation, and test set respectively. 
 - These steps will replace the current `select_by_index` usage
4. Train the `TimeSeriesPreprocessor` on the train set
5. Create the train, validation, and test datasets, using the trained preprocessor



In [1]:
# Standard
import os
import random

# Third Party
from transformers import (
    EarlyStoppingCallback,
    PatchTSMixerConfig,
    PatchTSMixerForPrediction,
    Trainer,
    TrainingArguments,
)
import numpy as np
import pandas as pd
import torch
from sklearn.model_selection import train_test_split

# First Party
from tsfm_public.toolkit.dataset import ForecastDFDataset
from tsfm_public.toolkit.time_series_preprocessor import TimeSeriesPreprocessor


In [3]:
# Set seed for reproducibility
SEED = 42
torch.manual_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)

In [5]:
# Load the Dataset from the CSV file
DATA_DIR = "/home/ubuntu/verb-workspace/data" # set this accordingly to the location of the data
DATASET = "10s-candles-2023.csv"

DATASET_PATH = os.path.join(DATA_DIR, DATASET)
timestamp_col = 't'

data = pd.read_csv(
    DATASET_PATH,
    parse_dates=[timestamp_col]
)

In [6]:
import gc

gc.collect()

0

In [7]:
# To start, let's reset the types on the columns to ensure that the dataset size is compressed

# Convert float64 columns to float32
float_columns = data.select_dtypes(include=['float64']).columns
data[float_columns] = data[float_columns].astype('float32')

# Convert object columns to category
object_columns = data.select_dtypes(include=['object']).columns
data[object_columns] = data[object_columns].astype('category')

int_columns = data.select_dtypes(include=['int64']).columns
data[int_columns] = data[int_columns].astype('int8')

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31160015 entries, 0 to 31160014
Data columns (total 10 columns):
 #   Column                             Dtype              
---  ------                             -----              
 0   t                                  datetime64[ns, UTC]
 1   targ_o                             float32            
 2   targ_h                             float32            
 3   targ_l                             float32            
 4   targ_c                             float32            
 5   targ_v                             float32            
 6   ticker                             category           
 7   market_state_MarketState.CLOSED    int8               
 8   market_state_MarketState.EXTENDED  int8               
 9   market_state_MarketState.OPEN      int8               
dtypes: category(1), datetime64[ns, UTC](1), float32(5), int8(3)
memory usage: 950.9 MB


In [8]:
# Now, let's trim down the dataset, by removing all of the Market Closed columns, this will create several individual time series for each ticker. 
# We can then add a date_string column, to use a second ID column for each time-series

# Remove all rows where the market is closed and reset the index
data = data[data['market_state_MarketState.CLOSED'] != 1].reset_index(drop=True)

# Add a date_string column to use as a second ID for each time series
data['date_string'] = data[timestamp_col].dt.strftime('%Y-%m-%d')

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14488415 entries, 0 to 14488414
Data columns (total 11 columns):
 #   Column                             Dtype              
---  ------                             -----              
 0   t                                  datetime64[ns, UTC]
 1   targ_o                             float32            
 2   targ_h                             float32            
 3   targ_l                             float32            
 4   targ_c                             float32            
 5   targ_v                             float32            
 6   ticker                             category           
 7   market_state_MarketState.CLOSED    int8               
 8   market_state_MarketState.EXTENDED  int8               
 9   market_state_MarketState.OPEN      int8               
 10  date_string                        object             
dtypes: category(1), datetime64[ns, UTC](1), float32(5), int8(3), object(1)
memory usage: 552.7+ MB


As we can see, even after adding the additional column, we have reduced the size of the dataset again by almost 50%. We can now remove all of the `market_state` columns, to ensure that we are ready for the full processing steps.

In [9]:

# Remove the unused cols
try:
    data.drop(columns=["market_state_MarketState.CLOSED", "market_state_MarketState.OPEN", "market_state_MarketState.EXTENDED"], inplace=True)
except KeyError:
    pass

# Forward fill any straggler NA values
data.ffill(inplace=True)

gc.collect()

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14488415 entries, 0 to 14488414
Data columns (total 8 columns):
 #   Column       Dtype              
---  ------       -----              
 0   t            datetime64[ns, UTC]
 1   targ_o       float32            
 2   targ_h       float32            
 3   targ_l       float32            
 4   targ_c       float32            
 5   targ_v       float32            
 6   ticker       category           
 7   date_string  object             
dtypes: category(1), datetime64[ns, UTC](1), float32(5), object(1)
memory usage: 511.2+ MB


In [10]:
data_ticker_dates = data[['ticker', 'date_string']].drop_duplicates()

print(f"Number of unique ticker-date pairs in train set: {len(data_ticker_dates)}")

Number of unique ticker-date pairs in train set: 2689


In [11]:
# We can't use the train_test_split method, to create our different sets, as it simply create a train test split for each of the 2689 unique pairs.
# The best way to create a train test split, is to take 90% of the groups for training, 5% for validation, and 5% for testing. 


# Split the ticker-date pairs into train, valid, and test sets
train_ticker_dates, temp_ticker_dates = train_test_split(data_ticker_dates, test_size=0.1, random_state=SEED)
valid_ticker_dates, test_ticker_dates = train_test_split(temp_ticker_dates, test_size=0.5, random_state=SEED)

# Create train, valid, and test dataframes based on the ticker-date pairs
train_data = data[data.set_index(['ticker', 'date_string']).index.isin(train_ticker_dates.set_index(['ticker', 'date_string']).index)]
valid_data = data[data.set_index(['ticker', 'date_string']).index.isin(valid_ticker_dates.set_index(['ticker', 'date_string']).index)]
test_data = data[data.set_index(['ticker', 'date_string']).index.isin(test_ticker_dates.set_index(['ticker', 'date_string']).index)]

print(f"Train data shape: {train_data.shape}")
print(f"Validation data shape: {valid_data.shape}")
print(f"Test data shape: {test_data.shape}")

Train data shape: (13027535, 8)
Validation data shape: (739440, 8)
Test data shape: (721440, 8)


In [12]:
# Validation of the ticker date-strings

# Get all unique pairs of ticker and date_string for the train data frame
validated_train_ticker_dates = train_data[['ticker', 'date_string']].drop_duplicates()
validated_valid_ticker_dates = valid_data[['ticker', 'date_string']].drop_duplicates()
validated_test_ticker_dates = test_data[['ticker', 'date_string']].drop_duplicates()

print(f"Number of unique ticker-date pairs in train set: {len(validated_train_ticker_dates)}")
print(f"Number of unique ticker-date pairs in valid set: {len(validated_valid_ticker_dates)}")
print(f"Number of unique ticker-date pairs in test set: {len(validated_test_ticker_dates)}")

Number of unique ticker-date pairs in train set: 2420
Number of unique ticker-date pairs in valid set: 134
Number of unique ticker-date pairs in test set: 135


In [13]:
# Verify no overlap between train and validation sets
train_valid_overlap = validated_train_ticker_dates.merge(validated_valid_ticker_dates, on=['ticker', 'date_string'])
print(f"Overlap between train and validation sets: {len(train_valid_overlap)}")

# Verify no overlap between train and test sets
train_test_overlap = validated_train_ticker_dates.merge(validated_test_ticker_dates, on=['ticker', 'date_string'])
print(f"Overlap between train and test sets: {len(train_test_overlap)}")

# Verify no overlap between validation and test sets
valid_test_overlap = validated_valid_ticker_dates.merge(validated_test_ticker_dates, on=['ticker', 'date_string'])
print(f"Overlap between validation and test sets: {len(valid_test_overlap)}")

Overlap between train and validation sets: 0
Overlap between train and test sets: 0
Overlap between validation and test sets: 0


In [14]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13027535 entries, 0 to 14488414
Data columns (total 8 columns):
 #   Column       Dtype              
---  ------       -----              
 0   t            datetime64[ns, UTC]
 1   targ_o       float32            
 2   targ_h       float32            
 3   targ_l       float32            
 4   targ_c       float32            
 5   targ_v       float32            
 6   ticker       category           
 7   date_string  object             
dtypes: category(1), datetime64[ns, UTC](1), float32(5), object(1)
memory usage: 559.1+ MB


In [15]:
valid_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 739440 entries, 137880 to 14313218
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype              
---  ------       --------------   -----              
 0   t            739440 non-null  datetime64[ns, UTC]
 1   targ_o       739440 non-null  float32            
 2   targ_h       739440 non-null  float32            
 3   targ_l       739440 non-null  float32            
 4   targ_c       739440 non-null  float32            
 5   targ_v       739440 non-null  float32            
 6   ticker       739440 non-null  category           
 7   date_string  739440 non-null  object             
dtypes: category(1), datetime64[ns, UTC](1), float32(5), object(1)
memory usage: 31.7+ MB


In [16]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 721440 entries, 207000 to 14347778
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype              
---  ------       --------------   -----              
 0   t            721440 non-null  datetime64[ns, UTC]
 1   targ_o       721440 non-null  float32            
 2   targ_h       721440 non-null  float32            
 3   targ_l       721440 non-null  float32            
 4   targ_c       721440 non-null  float32            
 5   targ_v       721440 non-null  float32            
 6   ticker       721440 non-null  category           
 7   date_string  721440 non-null  object             
dtypes: category(1), datetime64[ns, UTC](1), float32(5), object(1)
memory usage: 31.0+ MB


In [17]:
train_dataset_path = os.path.join(DATA_DIR, "10s-candles-train.csv")
valid_dataset_path = os.path.join(DATA_DIR, "10s-candles-valid.csv")
test_dataset_path = os.path.join(DATA_DIR, "10s-candles-test.csv")

train_data.to_csv(train_dataset_path, index=False)
valid_data.to_csv(valid_dataset_path, index=False)
test_data.to_csv(test_dataset_path, index=False)