# Data Preprocessing
We consider the opening price of the S&P 500 index (^GSPC), and the following are the steps to preprocess the training data:
- Download the data (not shown in this notebook).
- Convert it into a PyTorch tensor.
- Apply a rolling window.

where the last two steps are included in the ``get_stock_price`` function. Then we have to
- sample *time* indices,
- apply augmentation(s), and 
- transform the data into signatures.

In [2]:
import torch
torch.cuda.is_available()

True

In [3]:
import os
import pandas as pd
import signatory
from lib.datasets import get_stock_price,sample_indices,train_test_split
from lib.aug import augment_path_and_compute_signatures,apply_augmentations,parse_augmentations

In [4]:
path = os.path.join("datasets", "stock", "^GSPC_1d.csv")
df = pd.read_csv(path)
print(f'Original data: {os.path.basename(path)}, shape {df.shape}')

Original data: ^GSPC_1d.csv, shape (1251, 7)


In [5]:
data_config = {
    "ticker" : "^GSPC",
    "interval" : "1d",
    "column" : 1,  
    "window_size" : 20,
    "dir" : "datasets",
    "subdir" : "stock"
}

In [6]:
sig_config = {
    "augmentations": [
        {"name": "AddTime"},
        {"name": "LeadLag"},
    ],
    "device" : "cuda:0",
    "depth" : 5,
}

Convert the data into PyTorch tensor.

In [7]:
tensor_data = get_stock_price(data_config)

Rolled data for training, shape torch.Size([1232, 20, 1])


Sample time indices. The amount of the sampling is the batch size.

In [8]:
x_real_train, x_real_test = train_test_split(tensor_data, train_test_ratio=0.8, device='CUDA')

Apply augmentations.

In [9]:
if sig_config["augmentations"] is not None:
    sig_config["augmentations"] = parse_augmentations(sig_config.get('augmentations'))

In [10]:
y = x_real_train
print("Before augmentation shape:",y.shape)
if sig_config["augmentations"] is not None:
    # Print the tensor shape after each augmentation
    y_aug = apply_augmentations(y,sig_config["augmentations"])
print("After augmentation shape:",y_aug.shape)

Before augmentation shape: torch.Size([985, 20, 1])
torch.Size([985, 20, 2])
torch.Size([985, 39, 4])
After augmentation shape: torch.Size([985, 39, 4])


Convert the augmented data to signatures. The signature of the shape ``[985, 1364]``, where 
- 985 is the total amount of signatures
- Each signature is of shape ``[1,1364]``, and 1364 is the input dimension of VAE.

In [11]:
y_aug_sig = signatory.signature(y_aug,sig_config["depth"])
print("y_aug_sig shape:",y_aug_sig.shape)

y_aug_sig shape: torch.Size([985, 1364])
