## PyTorch Tutorial 09 - Dataset and DataLoader - Batch Training

#### In batch training we define:

1. EPOCH = 1 forward and backward pass of all training samples

2. BATCH_SIZE = number of training samples in one forward & backward pass

3. Number of iterations = number of passes, each pass using [batch_size] number of samples

e.g.  100 samples, BATCH_SIZE = 20 --> 100/20 = 5 iterations for each EPOCH

In [1]:
import torch
import torchvision
from torch.utils.data import Dataset, DataLoader

import numpy as np
import math



### Creating Dataset and DataLoader

Using these two classes makes loading data easier in batches for the further process.

Here we create our own class and subclass Dataset class and implement 
- \__init__, 
- \__getitem__, and 
- \__len__ methods


In [2]:
# Creating a Dataset class for loading data
class WineDataset(Dataset):

    def __init__(self):
        
        # super(WineDataset, self).__init__() # calling this crashes kernel

        xy = np.loadtxt("./data/wine/wine.csv", delimiter=",", dtype=np.float32, skiprows = 1) # skip first header
        self.x = torch.from_numpy(xy[:,1:]) # converted to tensor
        self.y = torch.from_numpy(xy[:,[0]])
        self.n_samples = xy.shape[0]

    def __getitem__(self, index):
        return self.x[index], self.y[index] # returns tuple of data

    def __len__(self):
        return self.n_samples # return len samples
    
data = WineDataset() # create WineDataset object

In [3]:
data[0] # checking

(tensor([1.4230e+01, 1.7100e+00, 2.4300e+00, 1.5600e+01, 1.2700e+02, 2.8000e+00,
         3.0600e+00, 2.8000e-01, 2.2900e+00, 5.6400e+00, 1.0400e+00, 3.9200e+00,
         1.0650e+03]),
 tensor([1.]))

In [13]:
# Next we create dataloader object through which we will load data in batches for training
BATCH_SIZE = 4


dataloader = DataLoader(dataset=data, batch_size=BATCH_SIZE, shuffle=True) # num_workers was crashing kernel

In [14]:
dataiter = iter(dataloader)

In [15]:
sample = dataiter.__next__() # checking iterator
sample

[tensor([[1.1820e+01, 1.4700e+00, 1.9900e+00, 2.0800e+01, 8.6000e+01, 1.9800e+00,
          1.6000e+00, 3.0000e-01, 1.5300e+00, 1.9500e+00, 9.5000e-01, 3.3300e+00,
          4.9500e+02],
         [1.2290e+01, 1.6100e+00, 2.2100e+00, 2.0400e+01, 1.0300e+02, 1.1000e+00,
          1.0200e+00, 3.7000e-01, 1.4600e+00, 3.0500e+00, 9.0600e-01, 1.8200e+00,
          8.7000e+02],
         [1.4370e+01, 1.9500e+00, 2.5000e+00, 1.6800e+01, 1.1300e+02, 3.8500e+00,
          3.4900e+00, 2.4000e-01, 2.1800e+00, 7.8000e+00, 8.6000e-01, 3.4500e+00,
          1.4800e+03],
         [1.1660e+01, 1.8800e+00, 1.9200e+00, 1.6000e+01, 9.7000e+01, 1.6100e+00,
          1.5700e+00, 3.4000e-01, 1.1500e+00, 3.8000e+00, 1.2300e+00, 2.1400e+00,
          4.2800e+02]]),
 tensor([[2.],
         [2.],
         [1.],
         [2.]])]

In [16]:
# Training
NUM_EPOCHS = 2
TOTAL_SAMPLES = len(data)
N_ITERATIONS = math.ceil(TOTAL_SAMPLES/BATCH_SIZE)

print(TOTAL_SAMPLES, N_ITERATIONS)

178 45


In [17]:
for epoch in range(NUM_EPOCHS):
    for i, (inputs, labels) in enumerate(dataloader):
        if (i+1)%5 == 0:
            print("EPOCH {}/{}, STEP {}/{}, INPUTS {}".format(epoch+1, NUM_EPOCHS, i+1, N_ITERATIONS, inputs.shape))

EPOCH 1/2, STEP 5/45, INPUTS torch.Size([4, 13])
EPOCH 1/2, STEP 10/45, INPUTS torch.Size([4, 13])
EPOCH 1/2, STEP 15/45, INPUTS torch.Size([4, 13])
EPOCH 1/2, STEP 20/45, INPUTS torch.Size([4, 13])
EPOCH 1/2, STEP 25/45, INPUTS torch.Size([4, 13])
EPOCH 1/2, STEP 30/45, INPUTS torch.Size([4, 13])
EPOCH 1/2, STEP 35/45, INPUTS torch.Size([4, 13])
EPOCH 1/2, STEP 40/45, INPUTS torch.Size([4, 13])
EPOCH 1/2, STEP 45/45, INPUTS torch.Size([2, 13])
EPOCH 2/2, STEP 5/45, INPUTS torch.Size([4, 13])
EPOCH 2/2, STEP 10/45, INPUTS torch.Size([4, 13])
EPOCH 2/2, STEP 15/45, INPUTS torch.Size([4, 13])
EPOCH 2/2, STEP 20/45, INPUTS torch.Size([4, 13])
EPOCH 2/2, STEP 25/45, INPUTS torch.Size([4, 13])
EPOCH 2/2, STEP 30/45, INPUTS torch.Size([4, 13])
EPOCH 2/2, STEP 35/45, INPUTS torch.Size([4, 13])
EPOCH 2/2, STEP 40/45, INPUTS torch.Size([4, 13])
EPOCH 2/2, STEP 45/45, INPUTS torch.Size([2, 13])


In [20]:
# Some famous datasets we can get built-in
torchvision.datasets.MNIST(root = "./data", transform=torchvision.transforms.ToTensor(), download=True)
# fashion dataset etc


Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:49<00:00, 199955.12it/s]


Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 60812.22it/s]


Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:02<00:00, 658063.88it/s]


Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 1376085.58it/s]

Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw






Dataset MNIST
    Number of datapoints: 60000
    Root location: ./data
    Split: Train
    StandardTransform
Transform: ToTensor()