# A transformer with both encoder and decoder. 

### Intuitively, encoders "understand and interpret", while decoders "write and describe". In file Transformer_with_frozen_conv_1.ipynb, we opted for only using encoder: Our goal is timeseries regression (in the sense that we are looking to create one singular value as the target) and not timeseries prediction (In the sense that we are looking to create a new timeseries), so there was no need for us to implement a decoder to "create a new timeseries in sequence". However, it is a fact that our target value is the sum up a new timeseries, so, we now investigate in using autoregression with decoder to generate a new timeseries and then sum its time steps up. It is unclear if this will help at all. 

### As mentioned, we will use autoregression with decoder to create a new timeseries. One thing to note is that we will NOT be implementing teacher forcing with masked decoder for now. Teacher forcing is a powerful tool, but the context is not the same: We do not have another timeseries to as ground truth to train toward, so we have "nothing to hide" with the masking. Instead of teacher forcing, we are using the loss calculated with the actual target (future RV) and the aoturegression created timeseries to train. This might not even be possible: I foresee memory explosion since, normally, autoregression is done with no_grad() context and can take a huge amount of memory (since grad will keep the computation graph, which is HUGE if we are doing autoregression). But we will see. 

## Import and preparations

In [1]:
import sys, importlib
import torch 
import torch.nn as nn
import numpy as np
import pandas as pd
import copy
import time

sys.path.append("../")
from proj_mod import training, data_processing, visualization
importlib.reload(training);
importlib.reload(data_processing);
importlib.reload(visualization);

In [2]:
#Only run this cell if needed. AMD gpus might need this. 
from dotenv import load_dotenv
import os

load_dotenv("../dotenv_env/deep_learning.env")

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True" # To, possibly fix memory leak issues. 

print(os.environ.get("HSA_OVERRIDE_GFX_VERSION"))

10.3.0


In [3]:
device=(torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu"))
print(f"Using device {device}")

Using device cuda


## Data preparations 

### Load time id order

In [4]:
list_time=np.load("../processed_data/recovered_time_id_order.npy")

### Load timeseries 

In [5]:
df_RV_ts=pd.read_parquet("../processed_data/book_RV_ts_60_si.parquet")

### Load target 

In [6]:
df_target=pd.read_csv("../raw_data/kaggle_ORVP/train.csv")
df_target["row_id"]=df_target["stock_id"].astype(int).astype(str)+"-"+df_target["time_id"].astype(int).astype(str)
df_target

Unnamed: 0,stock_id,time_id,target,row_id
0,0,5,0.004136,0-5
1,0,11,0.001445,0-11
2,0,16,0.002168,0-16
3,0,31,0.002195,0-31
4,0,62,0.001747,0-62
...,...,...,...,...
428927,126,32751,0.003461,126-32751
428928,126,32753,0.003113,126-32753
428929,126,32758,0.004070,126-32758
428930,126,32763,0.003357,126-32763


### Create datasets

In [7]:
time_split_list=data_processing.time_cross_val_split(list_time=list_time,n_split=1,percent_val_size=10,list_output=True)
train_time_id,test_time_id=time_split_list[0][0],time_split_list[0][1]

train_dataset=training.RVdataset(time_id_list=train_time_id,ts_features=["sub_int_RV"],tab_features=["emb_id"],df_ts_feat=df_RV_ts,df_target=df_target)
test_dataset=training.RVdataset(time_id_list=test_time_id,ts_features=["sub_int_RV"],tab_features=["emb_id"],df_ts_feat=df_RV_ts,df_target=df_target)

In fold 0 :

Train set end at 8117 .

Test set start at 15516 end at 10890 .



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tab_copy["sub_int_num"]=np.nan
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tab_copy["sub_int_num"]=np.nan


## The model

### Create the dataloaders

In [8]:
train_loader=torch.utils.data.DataLoader(dataset=train_dataset,batch_size=512,shuffle=True, num_workers=4, pin_memory=True)
test_loader=torch.utils.data.DataLoader(dataset=test_dataset,batch_size=512,shuffle=True, num_workers =4, pin_memory=True)

### Create components needed

In [9]:
ts_emb_dim=32
n_diff=2
ts_dim=n_diff+1

pos_embedder=training.pos_emb_cross_attn(length=60,ts_dim=ts_dim,emb_dim=ts_emb_dim,dropout=0.2,num_heads=4,keep_mag=True).to(device=device)

ts_encoder_ff_layer=[
    nn.Linear(in_features=ts_emb_dim,out_features=64),
    nn.ReLU(),
    nn.Linear(in_features=64,out_features=ts_emb_dim)
]

ts_decoder_ff_layer=[
    nn.Linear(in_features=ts_emb_dim,out_features=64),
    nn.ReLU(),
    nn.Linear(in_features=64,out_features=ts_emb_dim)
]

output_ff=nn.Sequential(
    nn.Linear(in_features=ts_emb_dim,out_features=1)
).to(device=device)

  return t.to(


### Create model

In [None]:
trans_encoder_decoder_model=training.encoder_decoder_autoregressionOnly(
    pos_emb_model=pos_embedder,
    output_feedforward=output_ff,
    encoder_dropout=0.2,
    decoder_dropout=0.2,
    encoder_feedforward_list=ts_encoder_ff_layer,
    decoder_feedforward_list=ts_decoder_ff_layer,
    n_diff=n_diff,
    encoder_layer_num=2,
    decoder_layer_num=2,
    input_scaler=10000,
    ts_emb_dim=ts_emb_dim,
    encoder_num_heads=4,
    decoder_num_heads=4,
    encoder_keep_mag=True,
    decoder_keep_mag=True,
).to(device=device)

In [11]:
import torch.optim as optim

optimizer = optim.AdamW(trans_encoder_decoder_model.parameters(), lr=1e-3)

scheduler=optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer,mode="min",factor=0.5,patience=5,min_lr=1e-7)

# Loss tracking
train_loss = []
val_loss = []

In [12]:
from torchinfo import summary
summary(trans_encoder_decoder_model)

Layer (type:depth-idx)                                       Param #
encoder_decoder_autoregression                               --
├─frozen_diff_conv: 1-1                                      --
│    └─Conv1d: 2-1                                           (2)
├─pos_emb_cross_attn: 1-2                                    --
│    └─Linear: 2-2                                           128
│    └─Embedding: 2-3                                        1,920
│    └─MultiheadAttention: 2-4                               3,168
│    │    └─NonDynamicallyQuantizableLinear: 3-1             1,056
│    └─LayerNorm: 2-5                                        64
├─ModuleList: 1-3                                            --
│    └─ts_encoder: 2-6                                       --
│    │    └─MultiheadAttention: 3-2                          4,224
│    │    └─LayerNorm: 3-3                                   64
│    │    └─ModuleList: 3-4                                  4,192
│    │    └─LayerN

### Training loop

### As of now, the model has issue with memory: it is using all the 12 GB of my AMD GPU. I need to look deeper in attempt to fix this. According to my reading, since grad context keeps all the computation graphs, and autoregression may have HUGE graph, this is kinda expected. The only way, for now, I can see to bypass this is doing teacher training and doing autoregression under no_grad() context. But we will see. 

In [13]:
training.reg_training_loop_rmspe(
    optimizer=optimizer,
    model=trans_encoder_decoder_model,
    train_loader=train_loader,
    val_loader=test_loader,
    ot_steps=20,
    report_interval=5,
    n_epochs=200,
    list_train_loss=train_loss,
    list_val_loss=val_loss,
    device=device,
    eps=1e-8,
    scheduler=scheduler)

OutOfMemoryError: HIP out of memory. Tried to allocate 28.00 MiB. GPU 0 has a total capacity of 11.98 GiB of which 0 bytes is free. Of the allocated memory 10.77 GiB is allocated by PyTorch, and 886.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)