# 8. DataLoader

### Manual data feed

이전의 예에서는 모든 자료를 한번에 다 읽고, 가중치 갱신을 위한 훈련의 모든 반복 때 마다 모든 데이타를 사용합니다. 주어진 자료의 크기가 작은 경우에는 별 문제 없이 진행가능하나, 그 크기가 너무 커지면 문제가 발생할 수 있다. 

In [None]:
...

xy = np.loadtxt('./data/diabetes.csv.gz', delimiter=',', dtype=np.float32)
x_data = Variable(torch.from_numpy(xy[:,0:-1]))
y_data = Variable(torch.from_numpy(xy[:,[-1]]))

...   # Process all data at once

for epoch in range(100):
    y_pred = model(x_data)    # Feed all data at once
    loss = criterion(y_pred, y_data)
    print(epoch, loss.data[0])
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()


### Batch

In [None]:
# Training cycle
for epoch in range(training_epochs):
    # Loop over all batches
    for i in range(total_batch):
        batch_xs, batch_ys ...

In the neural network terminology:

- **epoch** = 모든 훈련자료에 대하여 한 번의 정방향 패스와 한 번의 역방향 패스 
- **batch size** = 한 번의 정방향/역방향 패스에서의 훈련자료의 수. 배치 크기가 클수록 더 많은 메모리 공간이 필요합니다.
- number of **iterations** = 패스의 수, 각 패스는 [배치 크기] 수의 자료를 사용합니다. 명확하게 하면, 하나의 패스 = 하나의 순방향 패스 + 하나의 역방향 패스 (우리는 순방향 패스와 역방향 패스를 두 개의 다른 패스로 계산하지 않습니다).

예 : 1000 개의 훈련자료가 있고, 배치 크기가 500 인 경우, 1 epoch를 완료하는 데 2 회의 반복이 필요합니다.

### DataLoader

####  Iteration Batch using DataLoder

```
for i, data inenumerate(train_loader, 0):
    # get the inputs
    inputs, labels = data
    
    # wrap them in Variable
        inputs, labels = Variable(inputs), Variable(labels)

    # Run your training process
        print(epoch, i, "inputs", inputs.data, "labels", labels.data)
```

#### Custom DataLoader

```
class DiabetesDataset(Dataset):
    """Diabetes dataset."""
    # Initialize your data, download, etc.
    def __init__(self):
                    # (1) Download, read data, etc.
                    
    def __getitem__(self, index): 
          return          # (2) Return one item on the index
    
    def __len__(self):
          return          # (3) return the data length
          
dataset = DiabetesDataset()
train_loader = DataLoader(dataset=dataset,
                          batch_size=32,
                          shuffle=True,
                          num_workers=2)
```

In [None]:
# References
# https://github.com/yunjey/pytorch-tutorial/blob/master/tutorials/01-basics/pytorch_basics/main.py
# http://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class
import torch
import numpy as np
from torch.autograd import Variable
from torch.utils.data import Dataset, DataLoader

class DiabetesDataset(Dataset):
    """ Diabetes dataset."""
    # Initialize your data, download, etc.
    def __init__(self):
        xy = np.loadtxt('./data/diabetes.csv.gz',
                        delimiter=',', dtype=np.float32)
        self.len = xy.shape[0]
        self.x_data = torch.from_numpy(xy[:, 0:-1])
        self.y_data = torch.from_numpy(xy[:, [-1]])

    def __getitem__(self, index):
        return self.x_data[index], self.y_data[index]

    def __len__(self):
        return self.len


dataset = DiabetesDataset()
train_loader = DataLoader(dataset=dataset,
                          batch_size=32,
                          shuffle=True,
                          num_workers=2)

for epoch in range(2):
    for i, data in enumerate(train_loader, 0):
        # get the inputs
        inputs, labels = data

        # wrap them in Variable
        inputs, labels = Variable(inputs), Variable(labels)

        # Run your training process            
        print(epoch, i, "inputs", inputs.data, "labels", labels.data)

### Custom DataLoader

In [None]:
import torch
import numpy as np
from torch.autograd import Variable
from torch.utils.data import Dataset, DataLoader

class DiabetesDataset(Dataset):
    # initialize yout data, download, etc.
    def __init__(self):
        xy = np.loadtxt('./data/diabetes.csv.gz', delimiter=',', dtype=np.float32)
        self.len = xy.shape[0]
        self.x_data = torch.from_numpy(xy[:,0:-1])
        self.y_data = torch.from_numpy(xy[:,[-1]])
        
    def __getitem__(self, index):
        return self.x_data[index], self.y_data[index]
    
    def __len__(self):
        return self.len
    
dataset = DiabetesDataset()
train_loader = DataLoader(dataset=dataset, batch_size=32, 
                          shuffle=True, num_workers=2)

class Model(torch.nn.Module):
    def __init__(self):
        """
        In the constructor we instantiate two nn.Linear module
        """
        super(Model, self).__init__()
        self.l1 = torch.nn.Linear(8, 6)
        self.l2 = torch.nn.Linear(6, 4)
        self.l3 = torch.nn.Linear(4, 1)

        self.sigmoid = torch.nn.Sigmoid()

    def forward(self, x):
        """
        In the forward function we accept a Variable of input data and we must return
        a Variable of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Variables.
        """
        out1 = self.sigmoid(self.l1(x))
        out2 = self.sigmoid(self.l2(out1))
        y_pred = self.sigmoid(self.l3(out2))
        return y_pred

# our model
model = Model()

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.BCELoss(size_average=True)
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)

# Training loop
for epoch in range(100):
    for i, data in enumerate(train_loader, 0):
        # get the inputs
        inputs, labels = data

        # wrap them in Varable
        inputs, labels = Variable(inputs), Variable(labels)

        # forward pass: compute predicted y by passing x to the model
        y_pred = model(inputs)

        # Compute and print loss
        loss = criterion(y_pred, labels)
        if epoch%20 == 0 : 
            print(epoch, np.round(loss.data[0],5))

        # Zero gradients, perform a backward pass, and update the weights.
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

### Dataset loaders

- MNIST and FashionMNIST
- COCO (Captioning and Detection)
- LSUN Classification
- ImageFolder
- Imagenet-12
- CIFAR10 and CIFAR100
- STL10
- SVHN
- PhotoTour

### Exercise 8-1:
- Check out existing datasets (torch.vision)
- Buil DataLoader for
 - Titanic dataset : https://www.kaggle.com/c/titanic/download/train.csv
- Build a classifier using the DataLoader