# Data processing

Experimenting with the API for creating DataLoaders for the train, val and test sets!

**Plan:**

- Create datasets for the $e$ / $\gamma$ / $\pi$ files
- Concatenate the datasets
- Then Dataloader to make the train, val, and test sets. 
**Important:** You just have to be careful when passing the list indices so that you equally sample from all 3 of the classes. 
- Use a 60:10:30 split for the train:val:test sets.

In [27]:
import numpy as np
import h5py

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, utils
from torch.utils.data.sampler import SubsetRandomSampler

#### Step 1: Create a Dataset class

In the Pytorch HW set from class, they used the pre-defined CIFAR10 dataset which was already pre-loaded, but I'm going to need to write my own Dataset class that subclasses the `Dataset` class.

I'm following along with Pytorch's [data loading tutorial](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class).

In [54]:
class emShowersDatasetFlat(Dataset):
    """EM showers dataset"""
    
    def __init__(self, relPath='../data', N=100000):
        """
        
        Instantiates a class which then returns examples as a tuple for the 
        image labels, and the truth labels are:
            0 (gamma), 1 (pi-plus), 2 (positron)
        
        Args:
            relPath: The relative path to where the hdf5 files live
            N: The number of images we have for each particle class
        

        """
        
        d_gamma  = h5py.File('../data/gamma.hdf5', 'r')
        d_piplus = h5py.File('../data/piplus.hdf5', 'r')
        d_eplus  = h5py.File('../data/eplus.hdf5', 'r')
        
        layer0 = np.vstack((d_gamma['layer_0'][:], d_piplus['layer_0'][:], d_eplus['layer_0'][:]))
        layer1 = np.vstack((d_gamma['layer_1'][:], d_piplus['layer_1'][:], d_eplus['layer_1'][:]))
        layer2 = np.vstack((d_gamma['layer_2'][:], d_piplus['layer_2'][:], d_eplus['layer_2'][:]))
        
        # Test to make sure that all of the datasets are the same length
        self.layer0 = torch.from_numpy(layer0)
        self.layer1 = torch.from_numpy(layer1)
        self.layer2 = torch.from_numpy(layer2)
        
        # Get the y labels
        self.y = torch.from_numpy(np.concatenate((np.zeros(N), np.ones(N), 2*np.ones(N))))
        
    def __len__(self):
        return self.layer0.shape[0] 

    def __getitem__(self, idx):
        
        return self.layer0[idx], self.layer1[idx], self.layer2[idx], self.y[idx]


#### Step 2: Create a DataLoader

So the only thing here is we have to be careful with the train / val / test split to get equal proportions in all of the classes and make sure the SubsetRandomSampler is drawing equal proportions for all the classes.

In [55]:
dset = emShowersDatasetFlat()

In [56]:
N = 100000 # 100k events / particle
nClasses = 3

trainFrac = .7
valFrac = .1
testFrac = .3

idxTrain = []
idxVal = []
idxTest = []

for i in range(nClasses):
    
    idxTrain += [j for j in range(i*N, int((i+trainFrac)*N))]
    idxVal += [j for j in range(int((i+trainFrac)*N) + int((i+trainFrac+valFrac)*N))]
    idxTest += [j for j in range(int((i+trainFrac+valFrac)*N), (i+1)*N)]
    

In [57]:
batch_size=64

loader_train = DataLoader(dset, batch_size=batch_size, sampler=SubsetRandomSampler(idxTrain))
loader_val = DataLoader(dset, batch_size=batch_size, sampler=SubsetRandomSampler(idxVal))
loader_test = DataLoader(dset, batch_size=batch_size, sampler=SubsetRandomSampler(idxTest))

Make sure you can successfully iterate over the training set.

In [58]:
for layer0, layer1, layer2, y in loader_train:
    print(layer0.shape)
    print(layer1.shape)
    print(layer2.shape)
    print(y.shape)
    
    break

torch.Size([64, 3, 96])
torch.Size([64, 12, 12])
torch.Size([64, 12, 6])
torch.Size([64])


#### Step 3: Subtract the mean image and divide by the standard deviation for each layer

This was the only prepreocessing step that we did for the CIFAR 10 dataset, so I thought maybe it would be useful to put here as well!
