# Hangar Tutorial (1/2): Adding your data to Hangar
#### A step by step guide to setting up a hangar repo and adding your data to it.

This an accompanying notebook with the tutorial blog on [Hangar](hanga). In this notebook we will go through how you can add your data to hangar for versioning. 

In [1]:
import hangar
from hangar import Repository

import numpy as np
import pickle
import gzip
import matplotlib.pyplot as plt

from tqdm import tqdm

In [2]:
hangar.__version__

'0.5.1'

We will be using the MNIST dataset for this tutorial. Download MNIST from [here](https://github.com/mnielsen/neural-networks-and-deep-learning/raw/master/data/mnist.pkl.gz) and save it to a known path. A copy is given in this github repo also.

## Initialize the Repo

As explained in the blog each dataset is contained in a **Repository**. You initialize a repo with the *init()* method.

In [4]:
repo = Repository('./')
repo.init(user_name='jjmachan', user_email='jjmachan@g.com', remove_old=True)
repo

Hangar Repo initialized at: /home/jjmachan/jjmachan/hangar_tutorial/.hangar


Hangar Repository               
    Repository Path  : /home/jjmachan/jjmachan/hangar_tutorial               
    Writer-Lock Free : True


*remote* is used to access remote copies of the Hangar repository that you may have or create. Once really cool feature in Hangar is that it allows partial downloads from Remote repositories. This is really handy when working with huge datasets on your dev envs.

In [5]:
# list all remote repos
repo.remote.list_all()

[]

In [20]:
# Write enabled Checkout to write data to repo
co = repo.checkout(write=True)
co

Hangar WriterCheckout                
    Writer       : True                
    Base Branch  : master                
    Num Columns  : 6


**Note**: You can only have on write enabled checkout, so if you don't close your checkout before creating new onces it will show errors. 

In [7]:
co2 = repo.checkout(write=True)

PermissionError: Cannot acquire the writer lock. Only one instance of a writer checkout can be active at a time. If the last checkout of this repository did not properly close, or a crash occurred, the lock must be manually freed before another writer can be instantiated.

## Columns
These are the structures that are used to store the data as numpy arrays. both numeric and string data can be stored.

In [8]:
co.columns

Hangar Columns                
    Writeable         : True                
    Number of Columns : 0                
    Column Names / Partial Remote References:                
      - 

In [9]:
# Load the dataset
with gzip.open('./mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f, encoding='bytes')

# sample image and label for creating columns
sample_trimg = train_set[0][0]
sample_trlabel = np.array([train_set[1][0]])

# training images
trimgs = train_set[0]
trlabels = train_set[1]

# add it to a list
data = [train_set, valid_set, test_set]

In [10]:
sample_trimg.shape, sample_trimg.dtype

((784,), dtype('float32'))

Now lets create columns using the *add_ndarray_column()*. It takes a *name* and the *shape* and *dtype* which hangar can infer if you provide a sample data using the *prototype=* argument.

In [11]:
# Train
co.add_ndarray_column(name='mnist_training_images', prototype=sample_trimg)
co.add_ndarray_column(name='mnist_training_labels', prototype=sample_trlabel)


Hangar FlatSampleWriter                 
    Column Name              : mnist_training_labels                
    Writeable                : True                
    Column Type              : ndarray                
    Column Layout            : flat                
    Schema Type              : fixed_shape                
    DType                    : int64                
    Shape                    : (1,)                
    Number of Samples        : 0                
    Partial Remote Data Refs : False


In [12]:
# Val
co.add_ndarray_column(name='mnist_validation_images', prototype=sample_trimg)
co.add_ndarray_column(name='mnist_validation_labels', prototype=sample_trlabel)

# Test
co.add_ndarray_column(name='mnist_test_images', prototype=sample_trimg)
co.add_ndarray_column(name='mnist_test_labels', prototype=sample_trlabel)

Hangar FlatSampleWriter                 
    Column Name              : mnist_test_labels                
    Writeable                : True                
    Column Type              : ndarray                
    Column Layout            : flat                
    Schema Type              : fixed_shape                
    DType                    : int64                
    Shape                    : (1,)                
    Number of Samples        : 0                
    Partial Remote Data Refs : False


In [13]:
# list all the columns created
column_list = [('train', 'mnist_training_images', 'mnist_training_labels'), ('val', 'mnist_validation_images', 'mnist_validation_labels'), ('test', 'mnist_test_images', 'mnist_test_labels')]
for _, imgs, labels in column_list:
    print(co.columns[imgs], co.columns[labels], '\n')

FlatSampleWriter(repo_pth=/home/jjmachan/jjmachan/hangar_tutorial/.hangar, aset_name=mnist_training_images, ['column_layout=flat, ', 'column_type=ndarray, ', 'schema_hasher_tcode=1, ', 'data_hasher_tcode=0, ', 'schema_type=fixed_shape, ', 'shape=(784,), ', 'dtype=float32, ', 'backend=00, ', "backend_options={'complib': 'blosc:lz4hc', 'complevel': 5, 'shuffle': 'byte'}, "], mode=a) FlatSampleWriter(repo_pth=/home/jjmachan/jjmachan/hangar_tutorial/.hangar, aset_name=mnist_training_labels, ['column_layout=flat, ', 'column_type=ndarray, ', 'schema_hasher_tcode=1, ', 'data_hasher_tcode=0, ', 'schema_type=fixed_shape, ', 'shape=(1,), ', 'dtype=int64, ', 'backend=10, ', 'backend_options={}, '], mode=a) 

FlatSampleWriter(repo_pth=/home/jjmachan/jjmachan/hangar_tutorial/.hangar, aset_name=mnist_validation_images, ['column_layout=flat, ', 'column_type=ndarray, ', 'schema_hasher_tcode=1, ', 'data_hasher_tcode=0, ', 'schema_type=fixed_shape, ', 'shape=(784,), ', 'dtype=float32, ', 'backend=00, ',

In [23]:
# actually adding data
for i, (split, imgs, labels) in enumerate(column_list):
    print(f'Adding {split} data')
    img_col , label_col = co.columns[imgs], co.columns[labels]
    
    # using the context managers
    # skip them out to see the time differences.
    with img_col, label_col:
        for idx, image in enumerate(data[i][0]):
            
            # column[idx] invokes the __setter__ function and 
            # adds the data assigned with the corresponding idx(index)
            # to hangar
            img_col[idx] = image
            label_col[idx] = np.array([data[i][1][idx]])

Adding train data
Adding val data
Adding test data


In [15]:
# Details of the new column
# Note the number of samples has increased to 5000
co.columns['mnist_training_images']

Hangar FlatSampleWriter                 
    Column Name              : mnist_training_images                
    Writeable                : True                
    Column Type              : ndarray                
    Column Layout            : flat                
    Schema Type              : fixed_shape                
    DType                    : float32                
    Shape                    : (784,)                
    Number of Samples        : 50000                
    Partial Remote Data Refs : False


In [16]:
# Save the changes from this checkout to disk
co.commit('added all the mnist datasets')

'a=39a36c4fa931e82172f03edd8ccae56bf086129b'

In [17]:
# see the commits using the logs
repo.log()

* a=39a36c4fa931e82172f03edd8ccae56bf086129b ([1;31mmaster[m) : added all the mnist datasets


In [18]:
# You can see the whole summary of the repo using this function.
repo.summary()

Summary of Contents Contained in Data Repository 
 
| Repository Info 
|----------------- 
|  Base Directory: /home/jjmachan/jjmachan/hangar_tutorial 
|  Disk Usage: 105.88 MB 
 
| Commit Details 
------------------- 
|  Commit: a=39a36c4fa931e82172f03edd8ccae56bf086129b 
|  Created: Fri May  1 18:23:19 2020 
|  By: jjmachan 
|  Email: jjmachan@g.com 
|  Message: added all the mnist datasets 
 
| DataSets 
|----------------- 
|  Number of Named Columns: 6 
|
|  * Column Name: ColumnSchemaKey(column="mnist_test_images", layout="flat") 
|    Num Data Pieces: 10000 
|    Details: 
|    - column_layout: flat 
|    - column_type: ndarray 
|    - schema_hasher_tcode: 1 
|    - data_hasher_tcode: 0 
|    - schema_type: fixed_shape 
|    - shape: (784,) 
|    - dtype: float32 
|    - backend: 00 
|    - backend_options: {'complib': 'blosc:lz4hc', 'complevel': 5, 'shuffle': 'byte'} 
|
|  * Column Name: ColumnSchemaKey(column="mnist_test_labels", layout="flat") 
|    Num Data Pieces: 10000 
|   

### Note: 
always remember to close the checkpoints after you have written the data and commited it!

In [19]:
co.close()