# Hangar Tutorial (2/2): Training a Model using the Data in Hangar
#### Now lets make some models with the data we put in hangar

In [1]:
import hangar
from hangar import Repository
from hangar import make_torch_dataset

import torch
from torch import nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch import optim

In [2]:
hangar.__version__

'0.5.1'

Lets continue from where we left off. 

All the data from MNIST has now been added successfully to the Hangar repo. Lets load the repo and see what we have.

In [3]:
repo = Repository('./')

In [4]:
repo.summary()

Summary of Contents Contained in Data Repository 
 
| Repository Info 
|----------------- 
|  Base Directory: /home/jjmachan/jjmachan/hangar_tutorial 
|  Disk Usage: 105.88 MB 
 
| Commit Details 
------------------- 
|  Commit: a=39a36c4fa931e82172f03edd8ccae56bf086129b 
|  Created: Fri May  1 18:23:19 2020 
|  By: jjmachan 
|  Email: jjmachan@g.com 
|  Message: added all the mnist datasets 
 
| DataSets 
|----------------- 
|  Number of Named Columns: 6 
|
|  * Column Name: ColumnSchemaKey(column="mnist_test_images", layout="flat") 
|    Num Data Pieces: 10000 
|    Details: 
|    - column_layout: flat 
|    - column_type: ndarray 
|    - schema_hasher_tcode: 1 
|    - data_hasher_tcode: 0 
|    - schema_type: fixed_shape 
|    - shape: (784,) 
|    - dtype: float32 
|    - backend: 00 
|    - backend_options: {'complib': 'blosc:lz4hc', 'complevel': 5, 'shuffle': 'byte'} 
|
|  * Column Name: ColumnSchemaKey(column="mnist_test_labels", layout="flat") 
|    Num Data Pieces: 10000 
|   

it shows the columns and specifications of how they are stored internally in hangar along with size, dtype information.

To access let's create a read-only checkout from the master branch.

In [5]:
# Create a Read checkout
co = repo.checkout(branch='master')

 * Checking out BRANCH: master with current HEAD: a=39a36c4fa931e82172f03edd8ccae56bf086129b


## Create Dataloaders

Hangar provides two Dataloaders to import the data stored in the Hangar repositories directly for training in Tensorflow (*make_tf_dataset*) or PyTorch(*make_torch_dataset*). Both these take a list of columns and return a dataset with each index values in the columns.

The `make_torch_dataset` returns a PyTorch `Dataset` and that means we can use the `DataLoader` provided by PyTorch which makes it super simple for loading the data, splitting it into batches etc. 

In [7]:
# Create the train, test and val datasets using
# th make_torch dataset in hangar. This takes the 
# columns and creates a torch dataset out of it.

train_dataset = make_torch_dataset((co['mnist_training_images'], co['mnist_training_labels']))
test_dataset = make_torch_dataset((co['mnist_test_images'], co['mnist_test_labels']))
val_dataset = make_torch_dataset((co['mnist_validation_images'], co['mnist_validation_labels']))

One thing to note is that currently Hangar does not seem to support multiple workers for the dataloaders.

In [8]:
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)
val_loader = DataLoader(val_dataset, batch_size=batch_size)

## The model

A simple Neural Network with 3 layers and outputs the logits. Takes in input shape (in this case 784) and the output shape (in this case 10).

Now I won't be explaining the details of the model or the training loops but if all this seems new to you I highly recomment you checking out [this](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)

In [9]:
class net(nn.Module):
    def __init__(self, inShape, outShape):
        super().__init__()
        self.fc1 = nn.Linear(inShape, 500)
        self.fc2 = nn.Linear(500, 200)
        self.fc3 = nn.Linear(200, outShape)
    
    def forward(self, input):
        out = F.relu(self.fc1(input))
        out = F.relu(self.fc2(out))
        out = self.fc3(out)
        
        return out

# initialize the model
model = net(784, 10)

## Training 

In [10]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
epochs = 10

for epoch in range(epochs):
    total_loss_test = 0
    total_loss_train = 0
    accuracy = 0
    for img, label in train_loader:
        label = label.view(-1)
        optimizer.zero_grad()

        out = model(img)
        loss = criterion(out, label)
        loss.backward()
        optimizer.step()
        total_loss_train += loss.item()
        
    for img, label in test_loader:
        label = label.view(-1)
        with torch.no_grad():
            # Train Loss
            out = model(img)
            loss = criterion(out, label)
            total_loss_test += loss.item()
            
            # Accuracy
            _, indx = out.topk(1)
            correct = (indx.view(-1) == label).sum().item()
            acc = correct/batch_size
            accuracy += acc
    
    # Print losses for each epoch
    train_loss = total_loss_train/len(train_loader)
    test_loss = total_loss_test/len(test_loader)
    accuracy = accuracy/len(test_loader)
    print(f'[EPOCH {epoch}/{epochs}] Train Loss: {train_loss}')
    print(f'Test Loss: {test_loss} Accuracy: {accuracy}')

[EPOCH 0/10] Train Loss: 1.2083537247954312
Test Loss: 0.47797256113050846 Accuracy: 0.865814696485623
[EPOCH 1/10] Train Loss: 0.3944594695549208
Test Loss: 0.3467818654228609 Accuracy: 0.897064696485623
[EPOCH 2/10] Train Loss: 0.31767420198765994
Test Loss: 0.2960374093682954 Accuracy: 0.9107428115015974
[EPOCH 3/10] Train Loss: 0.27709063706813486
Test Loss: 0.2613061714274719 Accuracy: 0.9223242811501597
[EPOCH 4/10] Train Loss: 0.24662601495887404
Test Loss: 0.234231408689909 Accuracy: 0.9306110223642172
[EPOCH 5/10] Train Loss: 0.22161395768786918
Test Loss: 0.21181030162928488 Accuracy: 0.9365015974440895
[EPOCH 6/10] Train Loss: 0.20021527176466666
Test Loss: 0.19286749035137862 Accuracy: 0.9421924920127795
[EPOCH 7/10] Train Loss: 0.18172580767267993
Test Loss: 0.1767021114335428 Accuracy: 0.946685303514377
[EPOCH 8/10] Train Loss: 0.16573666792806246
Test Loss: 0.16299303889887545 Accuracy: 0.9507787539936102
[EPOCH 9/10] Train Loss: 0.1518441192202165
Test Loss: 0.151279971

and voilà!

we have trained a simple neural network with the data stored in hangar. Now you have all the tools to successfully train models from your own data stored in Hangar.

Cheers

## Compare DataLoader Speeds

This is a auxiliary section which compares the speed of the dataloaders using Hangar and a simple dataloader we have written.

Now lets try to see if there is any speedup in using the DataLoaders from Hangar. 

In [10]:
from torch.utils.data import Dataset
import gzip
import pickle

Our dataset unzips the mnist data file and loads the data directly into memory.

In [11]:
class mnist_dataset(Dataset):
    
    def __init__(self, mnist_file_path, split='train'):
        assert split in ('train', 'test', 'val')
        self.split = split
        
        # Load data to memory
        with gzip.open(mnist_file_path, 'rb') as f:
            train_set, val_set, test_set = pickle.load(f, encoding='bytes')
        if split == 'train':
            self.imgs = train_set[0]
            self.labels = train_set[1]
        if split == 'test':
            self.imgs = test_set[0]
            self.labels = test_set[1]
        elif split == 'val':
            self.imgs = val_set[0]
            self.labels = val_set[1]
            
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        return self.imgs[idx], self.labels[idx]

In [12]:
dataset = mnist_dataset('./mnist.pkl.gz', 'train')
len(dataset)

50000

In [13]:
dataloaders = DataLoader(dataset, batch_size=32)

In our handwritten case the mnist_dataset is loading the data file into memory, creating the dataloader and iterting through all of the training data.

In [44]:
%%timeit
dataset = mnist_dataset('./mnist.pkl.gz', 'train')
# print the memory location to see if during each test the 
# the dataset is getting actually loaded to a new loc in memory.
print(hex(id(dataset)))
dataloaders = DataLoader(dataset, batch_size=32)
for img, label in dataloaders:
    pass

0x7ff66c6430d0
0x7ff66c64fc90
0x7ff66c6439d0
0x7ff66c64f710
0x7ff66c6434d0
0x7ff66c64f250
0x7ff66c643cd0
0x7ff66c64fa10
1.16 s ± 63.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [39]:
%%timeit 
for img, label in train_loader:
    pass

7.21 s ± 87.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


This seems to be a bummer, the hangar dataloaders seems to be almost 7x slower that our simple custom dataloader. 

Also we try to manually feed data for training using hangars getitem but gives similar results.

In [None]:
img_col = co['mnist_training_images']
label_col = co['mnist_training_labels']

In [31]:
%%timeit
for i in range(len(img_col)):
    img = torch.from_numpy(img_col[i])
    label = torch.from_numpy(label_col[i])

7.7 s ± 192 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [33]:
%%timeit
with img_col, label_col:
    for i in range(len(img_col)):
        img = torch.from_numpy(img_col[i])
        label = torch.from_numpy(label_col[i])

7.17 s ± 48.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
