# DATALOADER

1. Make dataset, the class component. It needs length of overall dataset and items composing data to initialize.
    - __init__ property is necessary for defining any kinds of class component and the variables used to define this property are data items
    - To set length of the dataset, define the __len__ property of the class
    - To set items of the dataset, define the __getitem__ property of the class, this needs index to indicate the index of each data items


In [1]:
import numpy as np

# sample data
n_samples = 28
x_data = np.random.randn(n_samples, 1).flatten()
eps = np.random.normal(0.0, 1.0, n_samples)
y_data = 3.*np.sin(2.*x_data) + eps

In [2]:
import torch
from torch.utils.data import Dataset, DataLoader

class sampleDataset(Dataset):
    def __init__(self, x, y):
        # defining init property, variable x, y are used to make dataset's items
        self.x = x
        self.y = y
        
    def __len__(self):
        # total length of the dataset
        return len(self.x)
    
    def __getitem__(self, idx):
        # The return values of the 'getitem' property used for dataloader: these are output values of dataset class
        x_data = self.x
        y_data = self.y
        return x_data[idx], y_data[idx]

In [3]:
sample_dataset = sampleDataset(x_data, y_data)
sample_dataset

<__main__.sampleDataset at 0x1f0e1146470>

2. After making dataset, use DataLoader to load each items of dataset we defined on previous section. We can choose
    - batch size: If total dataset length is 10,000 and batch_size is 1000, then we have total 10 batches each containing 1000 data.
    - sampler: For undersampling or oversampling, or any other sampling with ratios.
    - shuffle: To shuffle or not every batches in dataset while loading.
    - etc...
    for options.
    
If we divide dataset into mini-batch, each mini-batch contains items in this example, they are x_data and y_data

In [4]:
batch_size = 7
data_4batch = DataLoader(dataset=sample_dataset, batch_size=7)
print("Total {} batch and each batch contains {} number of data".format(n_samples//batch_size, batch_size))
for idx, batch in enumerate(data_4batch):
    print("In batch {} there are".format((idx+1), ))
    print("x data: {}".format(batch[0]))
    print("y_data: {}".format(batch[1]))

Total 4 batch and each batch contains 7 number of data
In batch 1 there are
x data: tensor([-1.2445,  2.0414, -0.7364, -0.3928, -1.1348,  0.5417,  1.8872],
       dtype=torch.float64)
y_data: tensor([-1.3224, -3.4081, -3.3793, -1.6284, -3.1557,  3.3854, -1.9351],
       dtype=torch.float64)
In batch 2 there are
x data: tensor([ 0.7863,  1.1698,  1.3797, -0.9800, -0.7384,  0.7144, -0.8483],
       dtype=torch.float64)
y_data: tensor([ 3.2183,  2.1055,  1.0005, -2.9727, -3.2439,  3.2780, -4.1728],
       dtype=torch.float64)
In batch 3 there are
x data: tensor([-0.9447, -0.7057,  0.0877, -0.2223, -0.4308, -1.1246, -1.0773],
       dtype=torch.float64)
y_data: tensor([-3.1492, -4.4241,  0.8666, -2.1749, -2.6562, -0.6330, -2.4063],
       dtype=torch.float64)
In batch 4 there are
x data: tensor([ 1.2376,  1.1711,  0.2226, -1.1322,  0.4147, -1.8268,  0.1072],
       dtype=torch.float64)
y_data: tensor([ 2.0179,  2.5381, -0.6607, -2.9606,  2.0488, -0.2267,  1.7169],
       dtype=torch.float6

In [5]:
batch_size = 4
data_7batch = DataLoader(dataset=sample_dataset, batch_size=batch_size)
print("Total {} batch and each batch contains {} number of data".format(n_samples//batch_size, batch_size))
for idx, batch in enumerate(data_7batch):
    print("In batch {} there are".format((idx+1), ))
    print("x data: {}".format(batch[0]))
    print("y_data: {}".format(batch[1]))

Total 7 batch and each batch contains 4 number of data
In batch 1 there are
x data: tensor([-1.2445,  2.0414, -0.7364, -0.3928], dtype=torch.float64)
y_data: tensor([-1.3224, -3.4081, -3.3793, -1.6284], dtype=torch.float64)
In batch 2 there are
x data: tensor([-1.1348,  0.5417,  1.8872,  0.7863], dtype=torch.float64)
y_data: tensor([-3.1557,  3.3854, -1.9351,  3.2183], dtype=torch.float64)
In batch 3 there are
x data: tensor([ 1.1698,  1.3797, -0.9800, -0.7384], dtype=torch.float64)
y_data: tensor([ 2.1055,  1.0005, -2.9727, -3.2439], dtype=torch.float64)
In batch 4 there are
x data: tensor([ 0.7144, -0.8483, -0.9447, -0.7057], dtype=torch.float64)
y_data: tensor([ 3.2780, -4.1728, -3.1492, -4.4241], dtype=torch.float64)
In batch 5 there are
x data: tensor([ 0.0877, -0.2223, -0.4308, -1.1246], dtype=torch.float64)
y_data: tensor([ 0.8666, -2.1749, -2.6562, -0.6330], dtype=torch.float64)
In batch 6 there are
x data: tensor([-1.0773,  1.2376,  1.1711,  0.2226], dtype=torch.float64)
y_dat

## Batch

In deep learning **Batch** size is quite important concept. This is because of the training method of deep learning. Mostly we use **Gradient Descent** method that updata learning parameters through reverse direction of loss function's gradient to decrease the loss of the model.

Let's suppose that we have training set that have 100,000 data. When we use gradient descent in this dataset, we have to calculate loss function of 100,000 data each time we go through the iteration. This could cause enormous amount of calculations. So researchers suggested **Stochastic Gradient Descent** method that use only one example for calculating loss function of each iteration. But you can simply expect using 1 data among 100,000 data will not good at training since too small number of data can't cause significant difference to updating parameters.

Improved version of stochastic gradient descent method is **Mini-Batch Stochastic Gradient Descent(SGD)** method which is mostly used in deep learning process. In SGD method, we divide total dataset into mini-batch and model only calculate gradients about each mini-batch at each iteration. The bigger batch size will cause more accurate gradient but increase in calculation of each iteration. But if we use multi GPU or TPU for calculation, we can reduce the training time with their parallel calculation ability.

# MODEL

Constructing model with pytorch is quite easy. We can use many functions such as <code>nn.Sequential</code> or <code>nn.ModuleList</code>. Tough part is attaching the dataloader and model part without any conflicts. For example, let's make simple MLP model.

In [6]:
import torch.nn as nn
torch.set_default_dtype(torch.double)

def MLP(hidden_size):
    layer_list = list()
    layer_list.append(nn.Linear(in_features=1, out_features=hidden_size))
    layer_list.append(nn.ReLU())
    layer_list.append(nn.Linear(in_features=hidden_size, out_features=1))
    model = nn.Sequential(*layer_list)
    
    return model

In [7]:
model = MLP(200)
model

Sequential(
  (0): Linear(in_features=1, out_features=200, bias=True)
  (1): ReLU()
  (2): Linear(in_features=200, out_features=1, bias=True)
)

In [8]:
optimizer = torch.optim.SGD(model.parameters(), 0.005)
criteria = nn.MSELoss()

In [9]:
# dataset with 7 batches, each contains 4 data items
for i, batch in enumerate(data_7batch):
    x_batch = (batch[0].clone().detach().requires_grad_(True))
    y_batch = (batch[1].clone().detach().requires_grad_(True))
    
    x_batch_reform = x_batch.view(-1, 1)
    y_batch_reform = y_batch.view(-1, 1)
    
    print(x_batch.shape, y_batch.shape)
    print(x_batch_reform.shape, y_batch_reform.shape)
    break

torch.Size([4]) torch.Size([4])
torch.Size([4, 1]) torch.Size([4, 1])


Which of the two data form, x_batch or x_batch_reform can be used for training of our model?

x_batch data has size of [4] and x_batch_reform data has size of [4,1]. Our model start with <code>nn.Linear</code> layer with <code>in_features=1</code>.
$$H = XW+B, \{W\in \mathbb{R}^{1\times 200}, B \in \mathbb{R}^{200}, h \in \mathbb{R}^{N \times 200}\}$$
The value $N$ in above equation is batch size of the dataset. It means that we should make input data into tensor with the size of **batch size $\times$ in_features**. Meanwhile, out_featuer of the model's last layer should be same with the dimension of the values that we predict. For example, if we want to predict two values, we should set out feature of the last layer two, or if we want to predict three values, we should set out feature of the last layer three.

Below codes are example of failure and success for running model with input data dimension [4], [4,1].

In [10]:
# success, in feature of the first lyaer of MLP model is same with our input data dimension
# dataset with 7 batches, each contains 4 data items
for i, batch in enumerate(data_7batch):
    x_batch_reform = (batch[0].clone().detach().requires_grad_(True)).view(-1, 1)
    y_batch_reform = (batch[1].clone().detach().requires_grad_(True)).view(-1, 1)
    y_pred = model(x_batch_reform)

    print(x_batch_reform.shape, y_batch_reform.shape)
    break

torch.Size([4, 1]) torch.Size([4, 1])


In [11]:
# failure since dimension is not matched
# dataset with 7 batches, each contains 4 data items
for i, batch in enumerate(data_7batch):
    x_batch = (batch[0].clone().detach().requires_grad_(True))
    y_batch = (batch[1].clone().detach().requires_grad_(True))
    y_pred = model(x_batch)

    print(x_batch.shape, y_batch.shape)
    break

RuntimeError: size mismatch, m1: [1 x 4], m2: [1 x 200] at ..\aten\src\TH/generic/THTensorMath.cpp:961