## Demo of the Dataloader for the Seismic Dataset ##

This notebook shows how to convert hdf5 file to our custom seismic dataset, ready to be used in our model training, validation and testing.

1. First import the customDataset class, which includes utility functions to load the dataset from the hdf5 file, windowing, and labelling the data.
2. Next, import te printCustom function to print in different modes (info, success, error)
3. Next, import Datloader and random_split from torch utilities

In [1]:
import customDataset
from logUtils import printCustom
import torch
from torch.utils.data import DataLoader, random_split

#### Load the dataset from the hdf5 file
- Parameters:
   - file_path (str): path to the hdf5 file
   - seconds (int): duration of the window in samples (options: 1500 (15 sec), 3000 (30 sec), 6000 (60 sec), 10000 (100 sec))
   - window_size (int): size of the window in samples
   - hopping_size (int): size of the hop in samples
   - verbose (bool): print information about the dataset
- Returns:
    - dataset (SeismicDataset): seismic dataset

In [2]:

dataset = customDataset.get_dataset(file_path="/Users/nevinsehbal/Documents/workspace/ps-paper-convlstm/gaia-ps-detection/afad_to_hdf5/afad_to_hdf5.hdf5", 
                                    seconds = 1500 , window_size=900, hopping_size=300, verbose=True)

printCustom("info","First sample in the dataset:")
for i, (data, label) in enumerate(dataset):
    printCustom("info","Data shape: " + str(data.shape))
    printCustom("info","Label shape: "+ str(label.shape))
    printCustom("info","Label: "+str(label))
    if i == 0:
        break


[95m---------------- Dataset information: ----------------[0m
[96m[I] Sample rate: 100 Hz[0m
[96m[I] Chosen window size:900[0m
[96m[I] Hopping size:300[0m
[96m[I] Each sample is window_size/sample rate seconds long. Each sample has 3 channels.[0m
[96m[I] Dataset values shape is: [num_samples, num_window, num_channel, (height) 1, (width) num_sample_points][0m
[96m[I] Dataset value shape: torch.Size([1575, 3, 3, 1, 900])[0m
[96m[I] Dataset labels shape is: [num_samples, 4], where the 4 values are [p_idx, s_idx, p_confidence, s_confidence][0m
[96m[I] Dataset label shape:torch.Size([4725, 4])[0m
[96m[I] One sample shape: torch.Size([3, 3, 1, 900])[0m
[96m[I] One label shape: torch.Size([4])[0m
[96m[I] An example sample's label: tensor([4.7900e+02, 1.0180e+03, 1.0000e+00, 0.0000e+00])[0m
[96m[I] ----------------------------------------------------[0m
[96m[I] First sample in the dataset:[0m
[96m[I] Data shape: torch.Size([3, 3, 1, 900])[0m
[96m[I] Label shape: 

#### Randomly split the dataset into train, validation and test sets 
For this:
1. First, define the ratio of the TRAIN_RATIO and VALIDATION_RATIO. TEST_RATIO will automatically be 1-(train+val) 
2. Then set the seed manually for reproducibility.
3. Next, define TRAIN_BATCH_SIZE and TEST_BATCH_SIZE. Validation batch size will be same as train batch size.
4. Finally, random split the dataset into train, validation, test sets.

In [3]:
TRAIN_RATIO, VALIDATION_RATIO = 0.8, 0.1
# check if the sum of the ratios is not more than 1, if it is, raise an exception
assert TRAIN_RATIO + VALIDATION_RATIO <= 1, "The sum of the ratios should be less than or equal to 1"
train_size = int(TRAIN_RATIO * len(dataset))
val_size = int(VALIDATION_RATIO * len(dataset))
test_size = len(dataset) - train_size - val_size

# Set the seed for reproducibility
torch.manual_seed(0)
# Then, create train, validation and test dataloaders
TRAIN_BATCH_SIZE = 16
TEST_BATCH_SIZE = 1


# random_split function will split the dataset randomly
train_dataset, val_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, val_size, test_size])

#### Using torch Dataloader class, create train, validation, test dataloaders with SeismicDataset

In [4]:
# Create DataLoaders
train_dataloader = DataLoader(train_dataset, batch_size=TRAIN_BATCH_SIZE, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=TRAIN_BATCH_SIZE, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=TEST_BATCH_SIZE, shuffle=False)

printCustom("success","Dataloaders are created.")
printCustom("success","Train dataloader size:"+str(len(train_dataloader)*TRAIN_BATCH_SIZE))
printCustom("success","Validation dataloader size:"+str(len(val_dataloader)*TRAIN_BATCH_SIZE))
printCustom("success","Test dataloader size:"+str(len(test_dataloader)*TEST_BATCH_SIZE))

[92m[S] Dataloaders are created.[0m
[92m[S] Train dataloader size:1264[0m
[92m[S] Validation dataloader size:160[0m
[92m[S] Test dataloader size:158[0m
