In [1]:
import pandas as pd
df = pd.read_parquet('TRAIN_Reco_2021_2022_2023.parquet.gzip').reset_index()

In [2]:
df['ExecutionTime'] = pd.to_datetime(df['ExecutionTime'])

# Attempt to reduce precision to reduce memory usage
numerical_columns = ['high', 'low', 'close', 'volume']
df[numerical_columns] = df[numerical_columns].astype('float16')
df.dtypes

ExecutionTime    datetime64[ns, Europe/Berlin]
ID                                      object
high                                   float16
low                                    float16
close                                  float16
volume                                 float16
dtype: object

A train-validation split was performed on the dataset based on specific date ranges. 

- **Train Set**: The data between `'2023-01-01'` and `'2023-09-30'` is selected for training the model.
- **Validation Set**: The data between `'2023-10-01'` and `'2023-12-31'` is set aside for validation. 

By splitting the data in this way, the model can be trained and validated on the available resources.

In [3]:
train_start_date = '2023-01-01'
train_end_date = '2023-09-30'

val_start_date = '2023-10-01'
val_end_date = '2023-12-31'

train_df = df[(df['ExecutionTime'] >= train_start_date) & (df['ExecutionTime'] <= train_end_date)]
val_df = df[(df['ExecutionTime'] >= val_start_date) & (df['ExecutionTime'] <= val_end_date)]

In [4]:
print(train_df["ID"].nunique(), val_df["ID"].nunique())

672 672


In [5]:
print(train_df.shape, val_df.shape)

(17545248, 6) (5483520, 6)


In [6]:
train_df.head()

Unnamed: 0,ExecutionTime,ID,high,low,close,volume
69513,2023-01-01 00:00:00+01:00,Fri00Q1,0.0,0.0,0.0,0.0
69514,2023-01-01 00:15:00+01:00,Fri00Q1,0.0,0.0,0.0,0.0
69515,2023-01-01 00:30:00+01:00,Fri00Q1,0.0,0.0,0.0,0.0
69516,2023-01-01 00:45:00+01:00,Fri00Q1,0.0,0.0,0.0,0.0
69517,2023-01-01 01:00:00+01:00,Fri00Q1,0.0,0.0,0.0,0.0


Lag Features: The function create_lag_rolling_features creates lagged versions of the 'low', 'high', 'close', and 'volume' columns. These features help the model learn from past data points, capturing temporal dependencies. The model is required to predict the next 10 time steps (as per the project description). By using 10 lag features, the model is given access to the values from the previous 10 time steps, which are the most relevant for short-term forecasting. This helps the model to capture patterns that are directly influential for the next forecasted period.

Rolling Window Features: The function also calculates rolling means over a window of the last 10 periods for each of the columns. A rolling window of 10 periods is used to smooth out the last 10 data points, providing a trend or average that helps capture the recent movement in the data. Since the task involves predicting the next 10 steps, using the rolling mean over the last 10 periods gives the model information about recent trends over the same time frame.

In [7]:
train_df.set_index('ExecutionTime', inplace=True)
val_df.set_index('ExecutionTime', inplace=True)

def create_lag_rolling_features(df):
    # Lag features
    for column in ['low', 'high', 'close', 'volume']:
        for lag in range(1, 11):  # Create 10 lags
            df[f'{column}_lag_{lag}'] = df[column].shift(lag)
    
    # Rolling window features (rolling mean of the last 10 periods)
    for column in ['low', 'high', 'close', 'volume']:
        df[f'{column}_rolling_mean_10'] = df[column].rolling(window=10).mean()
    
    return df

In [8]:
# Apply the lag and rolling window function to each asset group separately in the training set
train_df = train_df.groupby('ID').apply(create_lag_rolling_features)

# Apply the lag and rolling window function to each asset group separately in the validation set
val_df = val_df.groupby('ID').apply(create_lag_rolling_features)

  train_df = train_df.groupby('ID').apply(create_lag_rolling_features)
  val_df = val_df.groupby('ID').apply(create_lag_rolling_features)


In [9]:
# Handle missing values resulting from lagging
train_df.dropna(inplace=True)
val_df.dropna(inplace=True)

In [10]:
train_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,ID,high,low,close,volume,low_lag_1,low_lag_2,low_lag_3,low_lag_4,low_lag_5,...,volume_lag_5,volume_lag_6,volume_lag_7,volume_lag_8,volume_lag_9,volume_lag_10,low_rolling_mean_10,high_rolling_mean_10,close_rolling_mean_10,volume_rolling_mean_10
ID,ExecutionTime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Fri00Q1,2023-01-01 02:30:00+01:00,Fri00Q1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Fri00Q1,2023-01-01 02:45:00+01:00,Fri00Q1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Fri00Q1,2023-01-01 03:00:00+01:00,Fri00Q1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Fri00Q1,2023-01-01 03:15:00+01:00,Fri00Q1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Fri00Q1,2023-01-01 03:30:00+01:00,Fri00Q1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


By definition, the MinMaxScaler is a better fit because it scales the data to a fixed range, typically between 0 and 1, which is ideal for models like LSTMs that use activation functions such as sigmoid. Since these activation functions operate optimally with inputs in this range, leading to better gradient stability and faster convergence. Also since the MinMaxScaler doesn't assume any specific distribution of the data, making it suitable for time series data where values might not follow a normal distribution.

In [11]:
from sklearn.preprocessing import MinMaxScaler

# List of columns to scale
columns_to_scale = ['high', 'low', 'close', 'volume']

# Create copies of the DataFrames
train_df_scaled = train_df.copy()
val_df_scaled = val_df.copy()

# Dictionary to store scalers for each asset
scalers = {}

# Assets present in training data
assets_in_train = train_df_scaled.index.get_level_values('ID').unique()

for asset in assets_in_train:
    # Training data for this asset
    asset_train_data = train_df_scaled.loc[asset, columns_to_scale]
    
    # Initialize and fit the scaler
    scaler = MinMaxScaler()
    scaled_train_values = scaler.fit_transform(asset_train_data)
    
    # Replace training data with scaled values
    train_df_scaled.loc[asset, columns_to_scale] = scaled_train_values
    
    # Store the scaler
    scalers[asset] = scaler
    
    # Check if the asset exists in validation data
    if asset in val_df_scaled.index.get_level_values('ID'):
        asset_val_data = val_df_scaled.loc[asset, columns_to_scale]
        
        # Transform validation data
        scaled_val_values = scaler.transform(asset_val_data)
        
        # Replace validation data with scaled values
        val_df_scaled.loc[asset, columns_to_scale] = scaled_val_values
    else:
        # Asset not in validation data; no action needed
        pass

In [12]:
train_df_scaled['ID_numeric'] = train_df_scaled['ID'].astype('category').cat.codes
train_df_scaled.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,ID,high,low,close,volume,low_lag_1,low_lag_2,low_lag_3,low_lag_4,low_lag_5,...,volume_lag_6,volume_lag_7,volume_lag_8,volume_lag_9,volume_lag_10,low_rolling_mean_10,high_rolling_mean_10,close_rolling_mean_10,volume_rolling_mean_10,ID_numeric
ID,ExecutionTime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Fri00Q1,2023-01-01 02:30:00+01:00,Fri00Q1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
Fri00Q1,2023-01-01 02:45:00+01:00,Fri00Q1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
Fri00Q1,2023-01-01 03:00:00+01:00,Fri00Q1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
Fri00Q1,2023-01-01 03:15:00+01:00,Fri00Q1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
Fri00Q1,2023-01-01 03:30:00+01:00,Fri00Q1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [13]:
import pandas as pd
from darts import TimeSeries
from darts.models import RNNModel

  from .autonotebook import tqdm as notebook_tqdm


In [14]:
train_df_scaled = train_df_scaled.reset_index(level='ID', drop=True)
train_df_scaled = train_df_scaled.reset_index()
train_df_scaled['ExecutionTime'] = pd.to_datetime(train_df_scaled['ExecutionTime']).dt.tz_localize(None)

train_df_scaled.head()

Unnamed: 0,ExecutionTime,ID,high,low,close,volume,low_lag_1,low_lag_2,low_lag_3,low_lag_4,...,volume_lag_6,volume_lag_7,volume_lag_8,volume_lag_9,volume_lag_10,low_rolling_mean_10,high_rolling_mean_10,close_rolling_mean_10,volume_rolling_mean_10,ID_numeric
0,2023-01-01 02:30:00,Fri00Q1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,2023-01-01 02:45:00,Fri00Q1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,2023-01-01 03:00:00,Fri00Q1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,2023-01-01 03:15:00,Fri00Q1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,2023-01-01 03:30:00,Fri00Q1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [15]:
val_df_scaled = val_df_scaled.reset_index(level='ID', drop=True)
val_df_scaled = val_df_scaled.reset_index()
val_df_scaled['ExecutionTime'] = pd.to_datetime(val_df_scaled['ExecutionTime']).dt.tz_localize(None)

val_df_scaled.head()

Unnamed: 0,ExecutionTime,ID,high,low,close,volume,low_lag_1,low_lag_2,low_lag_3,low_lag_4,...,volume_lag_5,volume_lag_6,volume_lag_7,volume_lag_8,volume_lag_9,volume_lag_10,low_rolling_mean_10,high_rolling_mean_10,close_rolling_mean_10,volume_rolling_mean_10
0,2023-10-01 02:30:00,Fri00Q1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2023-10-01 02:45:00,Fri00Q1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2023-10-01 03:00:00,Fri00Q1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2023-10-01 03:15:00,Fri00Q1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2023-10-01 03:30:00,Fri00Q1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
# val_df_scaled.to_csv('val.csv')
# train_df_scaled.to_csv('train.csv')

In [17]:
# Prepare the time series for both targets and covariates for each asset
def create_time_series(df):
    asset_time_series = {}
    asset_covariates = {}
    for asset in df['ID'].unique():
        # Filter the data for each asset
        asset_data = df[df['ID'] == asset]
        
        # Create TimeSeries object for target columns (high, low, close, volume)
        ts = TimeSeries.from_dataframe(asset_data, 'ExecutionTime', 
                                       ['high', 'low', 'close', 'volume'],
                                       fill_missing_dates=True, freq='15T')
        
        # Create TimeSeries object for covariates (lag features and rolling means)
        covariates = TimeSeries.from_dataframe(asset_data, 'ExecutionTime', 
                                               [col for col in df.columns if 'lag' in col or 'rolling_mean' in col],
                                               fill_missing_dates=True, freq='15T')
        
        asset_time_series[asset] = ts
        asset_covariates[asset] = covariates
    return asset_time_series, asset_covariates

asset_time_series, asset_covariates = create_time_series(train_df_scaled)

  resampled_time_index = resampled_time_index.asfreq(freq)
  resampled_time_index = resampled_time_index.asfreq(freq)
  resampled_time_index = resampled_time_index.asfreq(freq)
  resampled_time_index = resampled_time_index.asfreq(freq)
  resampled_time_index = resampled_time_index.asfreq(freq)
  resampled_time_index = resampled_time_index.asfreq(freq)
  resampled_time_index = resampled_time_index.asfreq(freq)
  resampled_time_index = resampled_time_index.asfreq(freq)
  resampled_time_index = resampled_time_index.asfreq(freq)
  resampled_time_index = resampled_time_index.asfreq(freq)
  resampled_time_index = resampled_time_index.asfreq(freq)
  resampled_time_index = resampled_time_index.asfreq(freq)
  resampled_time_index = resampled_time_index.asfreq(freq)
  resampled_time_index = resampled_time_index.asfreq(freq)
  resampled_time_index = resampled_time_index.asfreq(freq)
  resampled_time_index = resampled_time_index.asfreq(freq)
  resampled_time_index = resampled_time_index.asfreq(fre

In [21]:
# Run the model for each asset with covariates
def run_model_for_each_asset(asset_time_series, asset_covariates):
    models = {}
    predictions = {}

    for asset, ts in asset_time_series.items():
        covariates = asset_covariates[asset]
        
        # Define the model
        model = RNNModel(input_chunk_length=15, output_chunk_length=10, model="LSTM", n_epochs=10)
        
        # Train the model on the asset's time series and covariates
        model.fit(ts, future_covariates=covariates)
        models[asset] = model
        
        # Predict the next 10 time steps
        prediction = model.predict(10, future_covariates=covariates)
        predictions[asset] = prediction
        print(f"Asset {asset} prediction:\n", prediction)
    
    return models, predictions

models, predictions = run_model_for_each_asset(asset_time_series, asset_covariates)

ignoring user defined `output_chunk_length`. RNNModel uses a fixed `output_chunk_length=1`.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name            | Type             | Params | Mode 
-------------------------------------------------------------
0 | criterion       | MSELoss          | 0      | train
1 | train_criterion | MSELoss          | 0      | train
2 | val_criterion   | MSELoss          | 0      | train
3 | train_metrics   | MetricCollection | 0      | train
4 | val_metrics     | MetricCollection | 0      | train
5 | rnn             | LSTM             | 7.5 K  | train
6 | V               | Linear           | 104    | train
-------------------------------------------------------------
7.6 K     Trainable params
0         Non-trainable params
7.6 K     Total params
0.030     Total estimated model params size (MB)
7         Modules in train mode
0         Modules in eval mode


Epoch 9: 100%|██████████| 815/815 [00:17<00:00, 46.18it/s, train_loss=nan.0] 

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: 100%|██████████| 815/815 [00:17<00:00, 46.17it/s, train_loss=nan.0]

ValueError: For the given forecasting horizon `n=10`, the provided future covariates at dataset index `0` do not extend far enough into the future. As `n > output_chunk_length` the future covariates must end at time step `2023-09-30 02:30:00`, whereas now they end at time step `2023-09-30 00:00:00`.





ValueError: For the given forecasting horizon `n=10`, the provided future covariates at dataset index `0` do not extend far enough into the future. As `n > output_chunk_length` the future covariates must end at time step `2023-09-30 02:30:00`, whereas now they end at time step `2023-09-30 00:00:00`.