Recommended materials
====

1. Pytorch Official Tutorial \[[Link](https://pytorch.org/tutorials/)\]
2. DeepLearning Zero to All \[[English](https://www.youtube.com/playlist?list=PLlMkM4tgfjnJ3I-dbhO9JTw7gNty6o_2m)\] \[[Korean](https://www.youtube.com/playlist?list=PLQ28Nx3M4JrhkqBVIXg-i5_CVVoS1UzAv)\]
3. Neural Network Programming - Deep Learning with Pytorch \[[English](https://www.youtube.com/playlist?list=PLZbbT5o_s2xrfNyHZsM6ufI0iZENK9xgG)\]

Data Loading and Pre-processing
====

## Common Data (Processing) Pipeline

1. download or prepare data
2. make dataset using `torch.utils.data.Dataset`
3. (optional) construct data preprocessing code using `torchvision.transforms` if necessary
4. initialize dataloader using `torch.utils.data.DataLoader`

## Preparing data

Lets say our task is to learn a regressor that takes input **x** in range [0, 1] and predicts a function **y=2x**.  
Then, we need to prepare data that consists of input **x** and corresponding label **y** pairs.  
Now we are going to generate and save data.

In [None]:
# mound google drive
from google.colab import drive

drive.mount('/gdrive')

In [None]:
import os
import torch

# make a directory to save our custom data
gdrive_root = '/gdrive/My Drive'
custom_data_dir = os.path.join(gdrive_root, 'my_data')
if not os.path.exists(custom_data_dir):
  os.makedirs(custom_data_dir)

In [None]:
num_samples = 10000

# generate x in range [0, 1] using torch.rand()
x = torch.rand(num_samples)

# generate y by multiplying 2 to x
y = 2*x

# check data
for i in range(5):
  print(x[i:i+1], y[i:i+1])

# aggregate x and y
data = {'inputs':x, 'labels':y}

In [None]:
# now save data into data directory
data_path = os.path.join(custom_data_dir, 'data.pt')
torch.save(data, data_path)

# check if is saved successfully
data_ = torch.load(data_path)
x_ = data['inputs']
y_ = data['labels']
for i in range(5):
  print(x_[i:i+1], y_[i:i+1])

## Make Dataset

We will use `torch.utils.data.Dataset` to make our custom dataset container.  
In this step, what you need to do is mostly three-fold:
1. Define `__init__` function. This function should receive the path of data, get data from that path, and keep them as attributes.
2. Define `__len__` function. This function should return the number of data
3. Define `__getitem__` function. This function should receive `idx` as an argument and return data specified by `idx`

In [None]:
from torch.utils.data import Dataset

class CustomDataset(Dataset):
  def __init__(self, root):
    data = torch.load(root)
    self.x = data['inputs']
    self.y = data['labels']
    self.num_samples = len(self.x)
    
  def __len__(self):
    return self.num_samples
  
  def __getitem__(self, idx):
    x = self.x[idx]
    y = self.y[idx]
    
    return x, y

In [None]:
# initialize dataset
dataset = CustomDataset(data_path)

# when sample a data
print(dataset.__getitem__(0))

# check the number of data
print(len(dataset))

## Make DataLoader

Now we are going to make dataloader using `torch.utils.data.DataLoader` which is very useful when we want to parse data fast, or make random batches of data.

In [None]:
from torch.utils.data import DataLoader

batch_size = 4

# set `shuffle=True` to inject randomness in data-loading process
# set `num_workers` to more than 1 if you want to use multi-processing to speed up the dataloader
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, num_workers=1)

In [None]:
# load data
for i, (inputs, labels) in enumerate(dataloader):
  print('[{}] batch_size:{} x:{} y:{}'.format(i, inputs.size(0), inputs, labels))
  
  # and then the training process goes like...
  
  # feed data into a neural network and get outputs
  # outputs = my_network(inputs)
  
  # calculate loss
  # loss = calc_loss(outputs, labels)
  
  # backpropagate loss
  # optimizer.zero_grad()
  # loss.backward()
  
  # update weights
  # optimizer.step()
  
  if i == 3:
    break

## Data pre-processing

Often we have to pre-process data before feeding it into the neural network.  
Lets consider the situation where now we need to learn **y=3x** function instead of **y=2x**.  
Without generating and saving data from scrach as we did above, we can get our desired data by adding pre-processing step in our dataset container.  
In particular, we just need to get **x** divided by 2 and multiplied by 3 to make **y=3x** and then return **x, y** in our dataset container

In [None]:
# add one more pre-processing step in our dataset container
from torch.utils.data import Dataset

class CustomDataset(Dataset):
  def __init__(self, root, transform=None):
    data = torch.load(root)
    self.x = data['inputs']
    self.y = data['labels']
    self.transform = transform
    self.num_samples = len(self.x)
    
  def __len__(self):
    return self.num_samples
  
  def __getitem__(self, idx):
    x = self.x[idx]
    y = self.y[idx]
    
    # add one more pre-processing step here
    if self.transform is not None:
      y = self.transform(y)
    
    return x, y

In [None]:
from torch.utils.data import DataLoader
from torchvision import transforms

# make data pre-processor using torchvision.transforms
my_transform = transforms.Lambda(lambda x: x/2*3)

batch_size = 4
dataset = CustomDataset(data_path, transform=my_transform)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# load data
for i, (inputs, labels) in enumerate(dataloader):
  print('[{}] batch_size:{} x:{} y:{}'.format(i, inputs.size(0), inputs, labels))
  if i == 3:
    break