# Toy Models

In this notebook, we will explore Transformer Architectures to perform classification of price movement on stock data. We will preprocess the stock data to contain the Times in the form of sines and cosines in order to feed additional data to our model. We will follow the approach outlined in the [paper](https://arxiv.org/pdf/2010.02803.pdf), where the features are projected into high dimensional space and a time/sequence representation is learned by our model

### Library Import

In [1]:
import os
import sys
import numpy as np
import pandas as pd
import pandas_ta as ta
import tensorflow as tf
import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (20, 10)
%matplotlib inline

### Local Imports

In [2]:
from window_generator import WindowGenerator

In [3]:
# for python scripts use: "os.path.dirname(__file__)" instead of "os.path.abspath('')"
sys.path.append(
    os.path.abspath(os.path.join(os.path.abspath(''), os.path.pardir)))

from data_clean import get_trading_times

#### Ensure that GPU is available

In [4]:
tf.config.list_physical_devices('GPU')  

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

### Get the Data

In [72]:
data_path = r'..\data\raw\AAPL_15min.csv'
df = pd.read_csv(data_path, index_col=0, 
                 parse_dates=True, infer_datetime_format=True)

# df = get_trading_times(df)
df = df.dropna()


# add days, hours, and minutes to the dataset
dayofweek = df.index.dayofweek
hour = df.index.hour
minute = df.index.minute

# encode the days, hours, and minutes with sin and cos functions
days_in_week = 7
hours_in_day = 24
minutes_in_hour = 60

df['sin_day'] = np.sin(2*np.pi*dayofweek/days_in_week)
df['cos_day'] = np.cos(2*np.pi*dayofweek/days_in_week)
df['sin_hour'] = np.sin(2*np.pi*hour/hours_in_day)
df['cos_hour'] = np.cos(2*np.pi*hour/hours_in_day)
df['sin_minute'] = np.sin(2*np.pi*minute/minutes_in_hour)
df['cos_minute'] = np.cos(2*np.pi*minute/minutes_in_hour)


### Add target columns
We will add a column for price change at each interval, this will be our regression target variable. We will also add another column that quantifys the magnitude of the price change, this will be out target variable for classification.

In [73]:
df['price_diff'] = df['close'].diff()

thresh = 0.07 # 0.25 # dollars
df['price_change'] = 1 # price stays the same
df['price_change'][df['price_diff'] < -thresh] = 0 # downward price movement
df['price_change'][df['price_diff'] > thresh] = 2 # upward prive movement

# # possibly predict two classes of price movements
# # on s scale from 0-4
# thresh1 = 0.03
# thresh2 = 0.15

# df['price_change'] = 2 # price stays the same

# df['price_change'][(df['price_diff'] < -thresh1) 
#                    & (df['price_diff'] >= -thresh2)] = 1 # downward price movement
# df['price_change'][df['price_diff'] < -thresh2] = 0 # large downward price movement

# df['price_change'][(df['price_diff'] > thresh1) 
#                    & (df['price_diff'] <= thresh2)] = 3 # upward price movement
# df['price_change'][df['price_diff'] > thresh2] = 4 # large upward prive movement

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_change'][df['price_diff'] < -thresh] = 0 # downward price movement
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_change'][df['price_diff'] > thresh] = 2 # upward prive movement


In [74]:
df = df.dropna()
df.head()

Unnamed: 0_level_0,open,high,low,close,volume,sin_day,cos_day,sin_hour,cos_hour,sin_minute,cos_minute,price_diff,price_change
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2020-10-01 04:30:00,115.634512,115.792604,115.407254,115.407254,13550.0,0.433884,-0.900969,0.866025,0.5,5.665539e-16,-1.0,-0.207496,0
2020-10-01 04:45:00,115.367731,115.367731,115.120712,115.308447,12857.0,0.433884,-0.900969,0.866025,0.5,-1.0,-1.83697e-16,-0.098808,0
2020-10-01 05:00:00,115.308447,115.397374,115.298566,115.318327,10079.0,0.433884,-0.900969,0.965926,0.258819,0.0,1.0,0.009881,1
2020-10-01 05:15:00,115.417135,115.604869,115.377612,115.604869,3534.0,0.433884,-0.900969,0.965926,0.258819,1.0,2.832769e-16,0.286542,2
2020-10-01 05:30:00,115.604869,115.703677,115.555466,115.703677,7688.0,0.433884,-0.900969,0.965926,0.258819,5.665539e-16,-1.0,0.098808,2


In [75]:
print(f'Number of Downward Price Movements: {np.sum(df.price_change == 0)}')
print(f'Number of no/small changes in price: {np.sum(df.price_change == 1)}')
print(f'Number of Upward Price Movements: {np.sum(df.price_change == 2)}')


# print(f'Large Downward Price Movements: {np.sum(df.price_change == 0)}')
# print(f'Downward Price Movements: {np.sum(df.price_change == 1)}')
# print(f'Small Price Movements: {np.sum(df.price_change == 2)}')
# print(f'Upward Price Movements: {np.sum(df.price_change == 3)}')
# print(f'Large Upward Price Movements: {np.sum(df.price_change == 4)}')

Number of Downward Price Movements: 9981
Number of no/small changes in price: 11291
Number of Upward Price Movements: 10350


## Check data

In [76]:
df.describe()

Unnamed: 0,open,high,low,close,volume,sin_day,cos_day,sin_hour,cos_hour,sin_minute,cos_minute,price_diff,price_change
count,31622.0,31622.0,31622.0,31622.0,31622.0,31622.0,31622.0,31622.0,31622.0,31622.0,31622.0,31622.0,31622.0
mean,144.179509,144.392513,143.960073,144.178949,1345607.0,0.366732,-0.089911,0.02728801,-0.4121733,0.000284612,9.487066e-05,0.001293,1.011669
std,18.370809,18.394063,18.344012,18.370677,2021395.0,0.51386,0.770323,0.7743114,0.4794128,0.7071179,0.707118,0.350126,0.801762
min,106.134163,106.593618,106.040296,106.131989,103.0,-0.433884,-0.900969,-1.0,-1.0,-1.0,-1.0,-5.529062,0.0
25%,128.513797,128.685394,128.347936,128.512853,18027.25,0.0,-0.900969,-0.8660254,-0.8660254,0.0,-1.83697e-16,-0.118774,0.0
50%,144.840287,144.998089,144.690389,144.840287,156044.5,0.433884,-0.222521,1.224647e-16,-0.5,5.665539e-16,2.832769e-16,0.0,1.0
75%,159.175568,159.494582,158.866376,159.175493,2165501.0,0.781831,0.62349,0.8660254,-1.83697e-16,1.0,1.0,0.119833,2.0
max,182.594933,182.624809,182.475427,182.604892,26337530.0,0.974928,1.0,1.0,0.5,1.0,1.0,5.965332,2.0


We see that the stock prices are widelt different from the trading volume, we will take the log of the data to get them into the same neihghborhood

In [77]:
# df.iloc[:, :5] = np.log(df.iloc[:, :5])

# or maybe just get the log volumn?
df['volume'] = np.log(df['volume'])

In [78]:
df.describe()

Unnamed: 0,open,high,low,close,volume,sin_day,cos_day,sin_hour,cos_hour,sin_minute,cos_minute,price_diff,price_change
count,31622.0,31622.0,31622.0,31622.0,31622.0,31622.0,31622.0,31622.0,31622.0,31622.0,31622.0,31622.0,31622.0
mean,144.179509,144.392513,143.960073,144.178949,12.100037,0.366732,-0.089911,0.02728801,-0.4121733,0.000284612,9.487066e-05,0.001293,1.011669
std,18.370809,18.394063,18.344012,18.370677,2.593724,0.51386,0.770323,0.7743114,0.4794128,0.7071179,0.707118,0.350126,0.801762
min,106.134163,106.593618,106.040296,106.131989,4.634729,-0.433884,-0.900969,-1.0,-1.0,-1.0,-1.0,-5.529062,0.0
25%,128.513797,128.685394,128.347936,128.512853,9.79964,0.0,-0.900969,-0.8660254,-0.8660254,0.0,-1.83697e-16,-0.118774,0.0
50%,144.840287,144.998089,144.690389,144.840287,11.957896,0.433884,-0.222521,1.224647e-16,-0.5,5.665539e-16,2.832769e-16,0.0,1.0
75%,159.175568,159.494582,158.866376,159.175493,14.588162,0.781831,0.62349,0.8660254,-1.83697e-16,1.0,1.0,0.119833,2.0
max,182.594933,182.624809,182.475427,182.604892,17.086506,0.974928,1.0,1.0,0.5,1.0,1.0,5.965332,2.0


In [47]:
df.head()

Unnamed: 0_level_0,open,high,low,close,volume,price_diff,price_change,sin_day,cos_day,sin_hour,cos_hour,sin_minute,cos_minute
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2020-10-01 04:30:00,115.634512,115.792604,115.407254,115.407254,13550.0,-0.207496,0,0.433884,-0.900969,0.866025,0.5,5.665539e-16,-1.0
2020-10-01 04:45:00,115.367731,115.367731,115.120712,115.308447,12857.0,-0.098808,0,0.433884,-0.900969,0.866025,0.5,-1.0,-1.83697e-16
2020-10-01 05:00:00,115.308447,115.397374,115.298566,115.318327,10079.0,0.009881,1,0.433884,-0.900969,0.965926,0.258819,0.0,1.0
2020-10-01 05:15:00,115.417135,115.604869,115.377612,115.604869,3534.0,0.286542,2,0.433884,-0.900969,0.965926,0.258819,1.0,2.832769e-16
2020-10-01 05:30:00,115.604869,115.703677,115.555466,115.703677,7688.0,0.098808,2,0.433884,-0.900969,0.965926,0.258819,5.665539e-16,-1.0


## Compute Technical Indicators

In this portion we will compute several Technical Indicators that will help feed the model more useful information.

We will compute the:
- [Awsome Oscillator](https://www.ifcm.co.uk/ntx-indicators/awesome-oscillator)
- RSI 
- SMA
- EMA


In [84]:
# momentum indicators
awsome_oscillator = ta.momentum.ao(df.high, df.low, fast=5, slow=34)

rsi_14 = ta.momentum.rsi(df.close, length=14)
rsi_24 = ta.momentum.rsi(df.close, length=24)

stoch_rsi_14 = ta.momentum.rsi(df.close, rsi_length=14, length=14)
stoch_rsi_24 = ta.momentum.rsi(df.close, rsi_length=24, length=24)

tsi = ta.momentum.tsi(df.close)

# ema_10 = ta.ema(df.close, length=10)
ema_20 = ta.ema(df.close, length=20)
# ema_30 = ta.ema(df.close, length=30)

# volume indicators
acc_dist = ta.volume.ad(df.high, df.low, df.close, df.open)

In [None]:
help(ta.ema)

In [None]:
dir(ta.trend)

Place everything in the data frame

In [85]:
indicators = [
    awsome_oscillator,
    rsi_14, 
    rsi_24, 
    stoch_rsi_14, 
    stoch_rsi_24,
    tsi, 
    # ema_10, 
    ema_20, 
    # ema_30, 
    acc_dist 
]

df = pd.concat([df, pd.concat(indicators, axis=1)], axis=1)

df = df.dropna()
df.head()

Unnamed: 0_level_0,open,high,low,close,volume,sin_day,cos_day,sin_hour,cos_hour,sin_minute,...,price_change,AO_5_34,RSI_14,RSI_24,RSI_14,RSI_24,TSI_13_25_13,TSIs_13_25_13,EMA_20,AD
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-10-01 12:45:00,114.863813,115.061428,114.626674,114.794548,2690445.0,0.433884,-0.900969,1.224647e-16,-1.0,-1.0,...,0,-0.702274,37.401455,40.240973,37.401455,40.240973,-8.961924,-4.83003,115.317195,420.912656
2020-10-01 13:00:00,114.789707,114.972501,114.685959,114.942859,2110737.0,0.433884,-0.900969,-0.258819,-0.965926,0.0,...,2,-0.783354,40.976625,42.61357,40.976625,42.61357,-9.05096,-5.43302,115.281544,511.952769
2020-10-01 13:15:00,114.942859,115.160235,114.893455,115.051547,1936852.0,0.433884,-0.900969,-0.258819,-0.965926,1.0,...,2,-0.78782,43.522318,44.304539,43.522318,44.304539,-8.465868,-5.866284,115.259639,533.238483
2020-10-01 13:30:00,115.051547,115.160235,114.83417,114.993745,2435041.0,0.433884,-0.900969,-0.258819,-0.965926,5.665539e-16,...,1,-0.72472,42.473151,43.591724,42.473151,43.591724,-8.286458,-6.212023,115.234316,530.797996
2020-10-01 13:45:00,115.003724,115.377612,114.954222,115.272481,2259626.0,0.433884,-0.900969,-0.258819,-0.965926,-1.0,...,2,-0.631017,48.873587,47.816401,48.873587,47.816401,-6.415967,-6.241158,115.237951,588.689019


### Get Standardized train, valid, and test sets

Split into train, valid, and test sets. And then standardize with training mean and standard deviation

In [79]:
train_df = df.loc['2020-10-01':'2021-10-01']
valid_df = df.loc['2021-10-02':'2022-05-01']
test_df = df.loc['2022-05-02':]

train_mean = train_df.mean()
train_std = train_df.std()

# ensure that target column is not standardized
train_mean.price_change = 0
train_std.price_change = 1

train_df = (train_df - train_mean) / train_std
valid_df = (valid_df - train_mean) / train_std
test_df = (test_df - train_mean) / train_std

# (train_df * train_std + train_mean)

print(train_df.shape)
print(valid_df.shape)
print(test_df.shape)

(16112, 13)
(9243, 13)
(6267, 13)


In [80]:
print(f'Number of Downward Price Movements: {np.sum(train_df.price_change == 0)}')
print(f'Number of no/small changes in price: {np.sum(train_df.price_change == 1)}')
print(f'Number of Upward Price Movements: {np.sum(train_df.price_change == 2)}')

Number of Downward Price Movements: 4759
Number of no/small changes in price: 6476
Number of Upward Price Movements: 4877


### Get Data Generator for each time step

In [81]:
data_gen = WindowGenerator(
                input_width=32, label_width=1, shift=1, 
                train_df=train_df, valid_df=valid_df, test_df=test_df,
                remove_labels_from_inputs=True, batch_size=32,
                label_columns=['price_change'])

In [82]:
for inputs, targets in data_gen.train.take(1):
    print(f'Inputs shape (batch, time, features): {inputs.shape}')
    print(f'Targets shape (batch, time, features): {targets.shape}')

Inputs shape (batch, time, features): (32, 32, 12)
Targets shape (batch, time, features): (32, 1, 1)


## **Start Training Models**

First we will define a helper function to streamline this process

In [32]:
def compile_and_fit(model, window, lr=1e-4, max_epochs=100, patience=2):
    early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss',
                                                      patience=patience,
                                                      mode='min')

    model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                  optimizer=tf.keras.optimizers.Adam(learning_rate=lr),
                  metrics=['accuracy'])

    history = model.fit(window.train, epochs=max_epochs,
                        validation_data=window.valid,
                        callbacks=[early_stopping])
    return history

Track the performance of several models

In [33]:
val_performance = {}
performance = {}

We will also define a learning rate scheduler

In [83]:
def lr_scheduler(epoch, lr, warmup_epochs=15, decay_epochs=100, initial_lr=1e-6, base_lr=1e-3, min_lr=5e-5):
    if epoch <= warmup_epochs:
        pct = epoch / warmup_epochs
        return ((base_lr - initial_lr) * pct) + initial_lr

    if epoch > warmup_epochs and epoch < warmup_epochs+decay_epochs:
        pct = 1 - ((epoch - warmup_epochs) / decay_epochs)
        return ((base_lr - min_lr) * pct) + min_lr

    return min_lr

Setup the Transformer Encoder

Build model with Keras classes

Things to try:
- Try to replace Layer Normalization with Batch Normalization and observe the results

Position Encoding Code from: https://www.tensorflow.org/text/tutorials/transformer. Instead of using an embedding layer for a Word based model, we simply project the input into higher dimensional space using a single linear layer with no activation.

In [84]:
def positional_encoding(length, depth):
    depth = depth/2

    positions = np.arange(length)[:, np.newaxis]     # (seq, 1)
    depths = np.arange(depth)[np.newaxis, :]/depth   # (1, depth)

    angle_rates = 1 / (10000**depths)         # (1, depth)
    angle_rads = positions * angle_rates      # (pos, depth)

    pos_encoding = np.concatenate(
        [np.sin(angle_rads), np.cos(angle_rads)],
        axis=-1) 

    return tf.cast(pos_encoding, dtype=tf.float32)


class PositionalEmbedding(layers.Layer):
    def __init__(self, d_model, ff_dim):
        super().__init__()
        self.d_model = d_model
        self.ff_dim = ff_dim


    # def compute_mask(self, *args, **kwargs):
    #     return self.embedding.compute_mask(*args, **kwargs)


    def build(self, input_shape):
        # self.embedding = tf.keras.layers.Embedding(vocab_size, d_model, mask_zero=True) 
        self.embedding = layers.Dense(self.d_model)
        self.pos_encoding = positional_encoding(length=self.ff_dim, depth=self.d_model)


    def call(self, x):
        length = tf.shape(x)[1]
        x = self.embedding(x)
        # This factor sets the relative scale of the embedding and positonal_encoding.
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x = x + self.pos_encoding[tf.newaxis, :length, :]
        return x


In [85]:
from tensorflow_addons.layers import MultiHeadAttention

class TransformerEncoder(tf.keras.layers.Layer):

    def __init__(self, n_heads, d_model, ff_dim, dropout=0):
        super().__init__()
        
        self.n_heads = n_heads
        self.d_model = d_model
        self.ff_dim = ff_dim
        self.dropout = dropout

        self.attn_heads = list()


    def build(self, input_shape):
        
        # attention portion
        self.attn_multi = MultiHeadAttention(num_heads=self.n_heads, 
                                             head_size=self.d_model, 
                                             dropout=self.dropout)
        self.attn_dropout = layers.Dropout(self.dropout)
        self.attn_norm = layers.LayerNormalization(epsilon=1e-6)

        # feedforward portion
        self.ff_conv1 = layers.Conv1D(filters=self.ff_dim, 
                                      kernel_size=1, 
                                      activation='relu')
        self.ff_dropout = layers.Dropout(self.dropout)
        self.ff_conv2 = layers.Conv1D(filters=input_shape[-1],
                                      kernel_size=1)
        self.ff_norm = layers.LayerNormalization(epsilon=1e-6)


    def call(self, inputs):
        # attention portion
        x = self.attn_multi([inputs, inputs])
        x = self.attn_dropout(x)
        x = self.attn_norm(x)

        # get first residual
        res = x + inputs
        
        # feedforward portion
        x = self.ff_conv1(res)
        x = self.ff_dropout(x)
        x = self.ff_conv2(x)
        x = self.ff_norm(x)
        
        # return residual
        return res + x
    
    # Needed for saving and loading model with custom layer
    def get_config(self): 
        config = super().get_config().copy()
        config.update({'d_k': self.d_k,
                       'd_v': self.d_v,
                       'n_heads': self.n_heads,
                       'ff_dim': self.ff_dim,
                       'attn_heads': self.attn_heads,
                       'dropout': self.dropout_rate})
        return config          


In [86]:
class TransformerModel(keras.Model):

    def __init__(self, 
            n_heads,
            d_model,
            ff_dim,
            num_transformer_blocks,
            mlp_units,
            n_outputs=3,
            dropout=0.1,
            mlp_dropout=0.1):
            
        super().__init__()
        
        self.n_heads = n_heads
        self.d_model = d_model
        self.ff_dim = ff_dim
        self.num_transformer_blocks = num_transformer_blocks
        self.mlp_units = mlp_units
        self.n_outputs = n_outputs
        self.dropout = dropout
        self.mlp_dropout = mlp_dropout

        
         
    def build(self, input_shape):

        # get embedding layer that projects inputs inot high dimensional space
        # self.embed = layers.Dense(self.d_model)

        # get learnable time layer
        # self.time_layer = layers.Layer(tf.random.uniform((input_shape[1], self.d_model), -0.2, 0.2))
        # self.time_layer = tf.Variable(
        #     initial_value=tf.random.uniform((input_shape[1], self.d_model), -0.2, 0.2)
        #     )
        
        # get positional embedding
        self.positional_embedding = PositionalEmbedding(self.d_model, self.ff_dim)

        # get transformer encoders
        self.encoders = [TransformerEncoder(self.n_heads, self.d_model, self.ff_dim, self.dropout) 
                         for _ in range(self.num_transformer_blocks)]

        self.avg_pool = layers.GlobalAveragePooling1D(data_format="channels_first")

        # get MLP portion of network
        self.mlp_layers = []
        for dim in self.mlp_units:
            self.mlp_layers.append(layers.Dense(dim, activation="relu"))
            self.mlp_layers.append(layers.Dropout(self.mlp_dropout))

        # output layer 
        self.mlp_output = layers.Dense(self.n_outputs, activation='softmax')


    def call(self, x):

        # project input data into high dimensional space
        # x = self.embed(x)

        # inject time information ??
        # x = x + self.time_layer(x)


        # Project Input to high Dimensional Space and Encode Position Information
        x = self.positional_embedding(x)
        
        # Encoder Portion
        for encoder in self.encoders:
            x = encoder(x)

        # Average Pooling
        x = self.avg_pool(x)

        # MLP portion for classification
        for mlp_layer in self.mlp_layers:
            x = mlp_layer(x)

        x = self.mlp_output(x)

        return x

In [87]:
transformer_model = TransformerModel(
            n_heads=2,
            d_model=512,
            ff_dim=256,
            num_transformer_blocks=2,
            mlp_units=[256],
            n_outputs=3,
            dropout=0.1,
            mlp_dropout=0.1)

In [88]:
compile_and_fit(transformer_model, data_gen, lr=1e-4,
                patience=10, max_epochs=10)

val_performance['transformer_3'] = transformer_model.evaluate(data_gen.valid)
performance['transformer_3'] = transformer_model.evaluate(data_gen.test, verbose=0)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [90]:
val_performance

{'transformer_0': [1.077562928199768, 0.43556615710258484],
 'transformer_1': [1.0587785243988037, 0.4224297106266022],
 'transformer_2': [1.076261281967163, 0.4237324893474579],
 'transformer_3': [1.089512586593628, 0.408967524766922]}

In [106]:
compile_and_fit(transformer_model, data_gen, lr=1e-4,
                patience=5, max_epochs=10)

val_performance['transformer_32_32'] = transformer_model.evaluate(data_gen.valid)
performance['transformer_32_32'] = transformer_model.evaluate(data_gen.test, verbose=0)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
val_performance

In [107]:
val_performance

{'transformer_2': [1.0490920543670654, 0.4089961647987366],
 'transformer_64': [1.0982391834259033, 0.418564110994339],
 'transformer_32': [1.0950852632522583, 0.4148600101470947],
 'transformer_32_32': [1.0332379341125488, 0.44132015109062195]}

In [47]:
val_performance

{'transformer_0': [1.0700632333755493, 0.41823726892471313],
 'transformer_1': [1.0255502462387085, 0.4451465308666229],
 'transformer_2': [1.0238858461380005, 0.44533708691596985]}

In [58]:
data_gen.get_position_encoding().shape
p = data_gen.get_position_encoding()
np.repeat(p[None, :, :], 10, axis=0).shape

(10, 128, 13)

In [48]:
val_performance_2 = {}
performance_2 = {}

In [117]:
transformer_model_2 = TransformerModel(
            n_heads=4,
            d_model=512,
            ff_dim=256,
            num_transformer_blocks=2,
            mlp_units=[256],
            n_outputs=3,
            dropout=0.1,
            mlp_dropout=0.1)

In [118]:
compile_and_fit(transformer_model_2, data_gen, lr=1e-4,
                patience=10, max_epochs=10)

val_performance_2['transformer_32_32'] = transformer_model_2.evaluate(data_gen.valid)
performance_2['transformer_32_32'] = transformer_model_2.evaluate(data_gen.test, verbose=0)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
 27/503 [>.............................] - ETA: 1:10 - loss: 1.0838 - accuracy: 0.4306

KeyboardInterrupt: 

In [119]:
val_performance_2['transformer_32_32'] = transformer_model_2.evaluate(data_gen.valid)
performance_2['transformer_32_32'] = transformer_model_2.evaluate(data_gen.test, verbose=0)



In [120]:
val_performance_2

{'transformer_3': [1.0490920543670654, 0.4089961647987366],
 'transformer_32_32': [1.0537605285644531, 0.42481815814971924]}

### Notes

Projecting the nputs into high dimensional space appears to increase performance by a significant margin.

Adding in static position encodings for each input also appears to improve performance, but not by a large margin.

NOTE: Before standardization, the input features are on different scales, i.e. open/close: 100's time sinusoids: 1's., volume: 100's, Indicators 100-500's, possibly need to transform this data before standardization? Let's experiment with this.


It seems like 32 may be the best overall sequence length as it consistently performs better than 64 and and 128. With a simple transformer longer sequences seem to overfit and decrease performance after just a few epochs. However, using more complex model with more Attention Heads appears to allow the model to better understand the longer sequence, but they still overfit and decrease performance after around 10 epochs. It's possible that the learning rate may need to be scheduled for the models to train well on longer seuqences.

It seems like adding multiple heads doesn't really add anything. The base model for feature engineering will be 'transformer_model'. This isn't too complex, but it also isn't extremely simple. We can see which combinations of features provide the best information.

Code for setting up learning rate scheduler in keras

In [None]:
 early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss',
                                                   patience=10,
                                                   mode='min')
 
 callbacks = [tf.keras.callbacks.LearningRateScheduler(lr_scheduler),
              early_stopping]