# Data Preprocessing
We consider the opening price of the S&P 500 index (^GSPC), and the following are the steps to preprocess the training data:
- Download the data (not shown in this notebook).
- Convert it into a PyTorch tensor.
- Apply a rolling window.

where the last two steps are included in the ``get_stock_price`` function. Then we have to
- sample *time* indices,
- apply augmentation(s), and 
- transform the data into signatures.

In [1]:
import torch
torch.cuda.is_available()

True

In [2]:
import os
import pandas as pd
import signatory
from lib.datasets import get_stock_price,sample_indices,train_test_split
from lib.aug import augment_path_and_compute_signatures,apply_augmentations,parse_augmentations

In [3]:
path = os.path.join("datasets", "stock", "^GSPC_1d.csv")
df = pd.read_csv(path)
print(f'Original data: {os.path.basename(path)}, shape {df.shape}')

Original data: ^GSPC_1d.csv, shape (1251, 7)


In [4]:
data_config = {
    "ticker" : "^GSPC",
    "interval" : "1d",
    "column" : 1,  
    "window_size" : 20,
    "dir" : "datasets",
    "subdir" : "stock"
}

In [5]:
sig_config = {
    "augmentations": [
        {"name": "AddTime"},
        {"name": "LeadLag"},
    ],
    "device" : "cuda:0",
    "depth" : 5,
}

Convert the data into PyTorch tensor.

In [6]:
tensor_data = get_stock_price(data_config)

Rolled data for training, shape torch.Size([1232, 20, 1])


Separate training data and testing data.

In [7]:
x_real_train, x_real_test = train_test_split(tensor_data, train_test_ratio=0.8, device='CUDA')

Apply augmentations.

In [8]:
if sig_config["augmentations"] is not None:
    sig_config["augmentations"] = parse_augmentations(sig_config.get('augmentations'))

In [9]:
y = x_real_train
print("Before augmentation shape:",y.shape)
if sig_config["augmentations"] is not None:
    # Print the tensor shape after each augmentation
    y_aug = apply_augmentations(y,sig_config["augmentations"])
print("After augmentation shape:",y_aug.shape)

Before augmentation shape: torch.Size([985, 20, 1])
torch.Size([985, 20, 2])
torch.Size([985, 39, 4])
After augmentation shape: torch.Size([985, 39, 4])


Sample time indices. The amount of the sampling is the batch size.

In [10]:
batch_size = 128    
data_size = y_aug.shape[0]
time_indices = sample_indices(data_size,batch_size,'cpu')

Sample signatures according to ``time_indices``.

In [11]:
sample = y_aug[time_indices]
print(sample.shape)

torch.Size([128, 39, 4])


Now we give a trivial example of MMD: the MMD of two same time series. We can see that the MMD is almost zero.

In [12]:
from lib.mmd import mmd_loss,SignatureKernel
kernel = SignatureKernel(4,None)
sample1 = sample
sample2 = sample
mmd = mmd_loss(sample1,sample2,kernel)
print(mmd)

tensor(-3.3445e+17)
