### Tutorial 09: Sequential Datasets


A sequential dataset refers to data that is ordered or time-dependent. Examples include:

- Time series data (e.g., stock prices, weather patterns)
- Natural language (e.g., sentences or paragraphs in text data)
- Video frames (e.g., frames in a video that need to be processed in sequence)

In these cases, the order of data matters because the previous data points influence the interpretation of the next ones. For instance, in a language model, the words preceding a certain word help predict that word, and in time series forecasting, the past values influence the future prediction.

---

### Training Data in Sequential Datasets

When working with sequential data, it’s essential to structure the training data properly to respect the temporal order. In PyTorch, this can be achieved by creating a custom Dataset class that handles the specific needs of sequential data. For instance, we might want to split a sequence into overlapping chunks where each chunk is used to predict the next time step or sequence of time steps.

Here’s an example of how this might work:

- **Data Transformation**: In a sequential dataset, each sequence might be divided into fixed-length subsequences (e.g., sliding windows). Each subsequence is treated as a sample in the dataset.

- **Input-Output Pairs**: The data might be structured as input-output pairs, where the input consists of several previous time steps, and the output is the subsequent time step(s).

- **Sliding Window**: A sliding window approach allows us to generate sequences of data that can be fed into the model. This is particularly useful in time series data where we use a sliding window to predict future values based on past observations.


In [1]:
import torch
from torch.utils.data import Dataset, DataLoader

class SequentialDataset(Dataset):
    def __init__(self, data, seq_len, label_len):
        self.data = data
        self.seq_len = seq_len
        self.label_len = label_len
    
    def __len__(self):
        return len(self.data) - self.seq_len - self.label_len
    
    def __getitem__(self, index):
        # Create sequences from the data
        seq_x = self.data[index:index + self.seq_len]
        seq_y = self.data[index + self.seq_len:index + self.seq_len + self.label_len]
        
        return seq_x, seq_y



import numpy as np

data = torch.randn(100, 1)  
dataset = SequentialDataset(data, seq_len=10, label_len=5)
print(data.shape)

seq_x, seq_y = dataset[0] 
print(f"input sample: {np.round(seq_x.detach().flatten().tolist(),2)}")
print(f"output sample: {np.round(seq_y.detach().flatten().tolist(),2)}")





torch.Size([100, 1])
input sample: [ 1.07  0.1   0.41 -1.83 -0.64 -1.71  0.14 -0.1  -0.34 -0.46]
output sample: [-0.27  0.12  0.75 -2.17 -0.62]
