# Deep Learning Workflow
Managing DL workflow is always a nightmare. Problems include handling the scale, efficient resource utilization, version controlling the data etc. With the heavily organized Hangar, we can keep the data on check now, not as a blob but as tensors in the data store and version at. The super flexible PyTorch gives us the advantage of prototyping faster and iterate smoother. This model prototype can now be pushed to RedisAI, the highly optimized production runtime, as optimized torchscript and scale the serving to multi node redis cluster or make it highly available with redis sentinel

This tutorial is devided into three parts
1. Hangar
2. PyTorch / Tensorflow or anyother framework that has an ONNX export
3. RedisAI

## Hangar
This tutorial will review the first steps of working with a hangar repository.
To fit with the beginner's theme, we will use the MNIST dataset

In [5]:
!pip install grpcio>=1.21.1
!pip install git+https://github.com/tensorwerk/hangar-py
!pip install matplotlib
!pip install git+https://github.com/redisai/redisai-py@onnxruntime
!pip install https://download.pytorch.org/whl/cpu/torch-1.1.0-cp37-cp37m-linux_x86_64.whl
!pip install tqdm

Collecting git+https://github.com/tensorwerk/hangar-py
  Cloning https://github.com/tensorwerk/hangar-py to /tmp/pip-req-build-siue6qt_
  Running command git clone -q https://github.com/tensorwerk/hangar-py /tmp/pip-req-build-siue6qt_
Building wheels for collected packages: hangar
  Building wheel for hangar (setup.py) ... [?25ldone
[?25h  Stored in directory: /tmp/pip-ephem-wheel-cache-jhkhfa3p/wheels/41/92/9e/a01c44b33015b54b966237badb395ec6ff104b78676e83c1aa
Successfully built hangar
Collecting git+https://github.com/redisai/redisai-py@onnxruntime
  Cloning https://github.com/redisai/redisai-py (to revision onnxruntime) to /tmp/pip-req-build-20qme0t6
  Running command git clone -q https://github.com/redisai/redisai-py /tmp/pip-req-build-20qme0t6
  Running command git checkout -b onnxruntime --track origin/onnxruntime
  Switched to a new branch 'onnxruntime'
  Branch 'onnxruntime' set up to track remote branch 'onnxruntime' from 'origin'.
Building wheels for collected packages: red

In [6]:
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

from hangar import Repository

### Creating & Interacting with a Hangar Repository

Hangar is designed to "just make sense" in every operation you have to perform. As such, there is a single interface which all interaction begins with:
the `Repository` object. 

Weather a hangar repository exists at the path you specify or not, just tell hangar where it should live!

#### Intitializing a repository

The first time you want to work with a new repository, the `init()` method must be called. This is where you provide hangar with your name and email address (to be used in the commit log), as well as implicitly confirming that you do want to create the underlying data files hangar uses on disk. 

In [7]:
!mkdir myhangarrepo
repo = Repository(path='myhangarrepo')
repo.init(user_name='Sherin Thomas', user_email='sherin@gmail.com', remove_old=True)

mkdir: cannot create directory ‘myhangarrepo’: File exists
Hangar Repo initialized at: myhangarrepo/__hangar


'myhangarrepo/__hangar'

In [None]:
repo

In [None]:
repo.writer_lock_held

In [None]:
repo.repo_path

In [None]:
# data link
# https://drive.google.com/drive/folders/1zYdhNN4s5QnqGHRN632hXvfCt4OxsF0l?usp=sharing

In [None]:
datapath = "mnist_data"
import os
os.listdir(datapath)


In [None]:
target = np.load(os.path.join(datapath, 'target1.npy'))
data = np.load(os.path.join(datapath, 'data1.npy'))

In [None]:
data.shape, target.shape

### Checking out the repo for writing

A repository can be checked out in two modes: 

1) **write-enabled**: applies all operations to the staging area's current state. Only one write-enabled checkout can be active at a different time, must be closed upon last use, or manual intervention will be needed to remove the writer lock. 
    
2) **read-only**: checkout a commit or branch to view repository state as it existed at that point in time. 

In [None]:
co = repo.checkout(write=True)

In [8]:
co.datasets

NameError: name 'co' is not defined

In [9]:
co.metadata

NameError: name 'co' is not defined

#### Before data can be added to a repository, a dataset must be initialized. 

A Dataset is a named grouping of data samples where each sample shares a number of similar attributes and array properties:

https://hangar-py.readthedocs.io/en/latest/concepts.html#how-hangar-thinks-about-data

In [None]:
data_dset = co.datasets.init_dataset('mnist_data', shape=(28, 28), dtype='uint8')

### Interaction 

When a dataset is initialized, a dataset accessor object will be returned, however, depending on your use case, this may or may not be the most convenient way to access a dataset.

In general, we have implemented a full `dict` mapping interface ontop of all object. To access the `'mnist_training_images'` dataset you can just use a dict style access like the following (note: if operating in ipython/jupyter, the dataset keys will autocomplete for you).

In [None]:
co.datasets['mnist_data'] == data_dset

In [None]:
target_dset = co.datasets.init_dataset('mnist_target', prototype=target[0])

In [None]:
co.commit('datasets init')
co.close()

In [None]:
co = repo.checkout(write=True)
data_dset = co.datasets['mnist_data']
target_dset = co.datasets['mnist_target']

#### Performance

Once you've completed an interactive exploration, be sure to use the context manager form of the `.add` and `.get` methods! 

In order to make sure that all your data is always safe in Hangar, the backend dilligently ensures that all contexts are opened and closed appropriatly. 

When you use the context manager form of a dataset object, we can offload a significat amount of work to the python runtime, and dramatically increase read and write speeds.

Most datasets we've tested see an increased throughput differential of ~250% for writes and ~300% for reads when comparing using the context manager form vs the naked form!

In [None]:
with data_dset, target_dset:  # You don't really need this CM if you are not worried about perf
  for i in tqdm(range(len(data))):
    sample_name = str(i)
    data_dset[sample_name] = data[i]
    target_dset[sample_name] = np.array(target[i])
co.commit('dataset curation: stage 1')
co.close()

In [None]:
co = repo.checkout()
dset = co.datasets['mnist_data']

In [None]:
'1' in dset

In [None]:
dset.keys()

In [None]:
next(dset.values()).shape

In [None]:
dset

In [None]:
for key, value in dset.items():
    print(key)
    plt.imshow(value)
    break

In [None]:
repo.log()

In [None]:
del dset['1']

### Metadata

In [None]:
co = repo.checkout(write=True)
co.metadata['DataSource'] = "Sherin"
co.commit("Added source")
co.close()

### Safety from python "oddities" is built in Hangar's very essense.

- Unknown to the user, Hangar does not actually allow `dataset` or `metadata` objects to be directly referenced in application code.
- What you actually get is a `weakref ObjectProxy`. Though semantically identicaly, only Hangar keeps strong references to it's accessors.
- When a Hangar object no longer has permissions to act, the `ObjectProxy` "self destructs".
- Any introspection/call/modification by application code immediatly raises an exception to let you know you're dealing with something which is out of date! 

In [None]:
co = repo.checkout(write=True)
data_dset = co.datasets['mnist_data']
co.close()
data_dset['1']

### What you put in is what you get out
All data is hashed by cryptographically secure hash algorithms (blake2b with 20byte digest length)
A commit is entirely self sufficient, and it's hash depends on the hash of previous references
For performance reasons, data hash is only calculated / verified when:

a sample is added to a dataset
data is fetched from a remote repo
data is sent to a remote repo
During regular read access, data integrity is ensured via fletcher32 / crc32 checksums

Backend store utilities provide well validated, trusted, and performant implementations

## Branching & Merging
- Time travel through the historical evolution of a dataset
- Zero-cost Branching to enable exploratory analysis and collaboration
- Cheap Merging to build datasets over time (with multiple collaborators)
- Completely abstracted organization and management of data files on disk
- Ability to only retrieve a small portion of the data (as needed) while still maintaining complete historical record
- Ability to push and pull changes directly to collaborators or a central server (ie a truly distributed version control system)

In [None]:
repo.create_branch('stage2')

In [None]:
co2 = repo.checkout(branch_name='stage2', write=True)

In [None]:
target = np.load(os.path.join(datapath, 'target2.npy'))
data = np.load(os.path.join(datapath, 'data2.npy'))
target.shape, data.shape

In [None]:
with co2.datasets['mnist_data'] as ddset, co2.datasets['mnist_target'] as tdset:
    current_index = len(ddset)
    for i in tqdm(range(len(data))):
        sample_name = str(current_index + i)
        ddset[sample_name] = data[i]
        tdset[sample_name] = np.array(target[i])
co2.metadata['DataSource'] = "Somebody else"
co2.commit('Data curation: stage2')
co2.close()

In [None]:
repo.list_branch_names()

In [None]:
repo.log(branch_name='stage2')

In [None]:
co = repo.checkout(write=True)

In [None]:
# Dummy commit to make the diversion
co.metadata['RandomeKey'] = "RandomValue"
co.commit("Dummy metadata")

In [None]:
# It's not a good idea to run this now
# from pprint import pprint
# pprint(co.diff.branch("stage2"))

In [None]:
co.merge("Merging stage2", dev_branch='stage2')

In [None]:
co.close()

In [None]:
repo.log()

In [None]:
repo._details()

In [None]:
repo.summary()

### Security Disclosure

Hangar is an early stage products, none of the core developers have any significant cryptography or security background/experience. While efforts have been made to secure application data, we are not comfortable calling Hangar a `cryptographically secure utility` until a formal security and design review by domain experts has been performed. 

We are actively looking for help in this area, if you are interested in contributing, please let us know!

### What's pending
- Remote Hangar Repository
- Import & Export

## PyTorch
- Dynamic Graph
- torch.nn
- Datasets & Dataloaders
- Training
    - Autograd
    - Optimization
- Validation
- Serializing

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

### Dynamic Graph

### torch.nn

In [None]:
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 5, 1)
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        self.fc1 = nn.Linear(4*4*50, 500)
        self.fc2 = nn.Linear(500, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2, 2)
        x = x.view(-1, 4*4*50)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = Net().to(device)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.5)

In [None]:
model.train()

### Datasets & Dataloaders

In [None]:
from torch.utils.data import Dataset, DataLoader

In [None]:
def normalize(img):
    img = np.asarray(img, dtype=np.float32)
    img /= 255.0
    return img

In [None]:
# But we are not using that for now
class HangarDataset:
    """
    PyTorch Dataset that gives access to hangar dataset
    """

    def __init__(self, data, target):
        if len(data) != len(target):
            raise RuntimeError("Length of data and target does not match")
        self.data = data
        self.target = target

    def __len__(self):
        return len(self.data)

    def __getitem__(self, key):
        """
        Since our sample names in hangar repository is str(index),
        we can do str(key) to figure out the sample name
        """
        sample_name = str(key)
        normalized_img = normalize(self.data[sample_name].reshape(1, 1, 28, 28))
        target = self.target[sample_name].reshape(1)
        return normalized_img, target

In [None]:
co = repo.checkout()
ddset = co.datasets['mnist_data']
tdset = co.datasets['mnist_target']
hangar_dset = HangarDataset(ddset, tdset)

### Training

In [None]:
# Training not on batch = not good
# not shuffled = not good
for i in tqdm(range(len(hangar_dset))):
    data, target = hangar_dset[i]
    data = torch.from_numpy(data).to(device)
    target = torch.from_numpy(target).to(device)
    optimizer.zero_grad()
    output = model(data)
    loss = F.nll_loss(output, target)
    loss.backward()
    optimizer.step()
    if i and i % 5000 == 0:
        break
    

In [None]:
#TODO Testing & Validation

## RedisAI

#### What is Redis

In [None]:
import redis
con = redis.Redis()

In [None]:
con.set('foo', 'bar')

In [None]:
con.get('foo')

### Saving the model

In [None]:
import redisai as rai
traced_model = torch.jit.trace(model, data)
rai.save_model(traced_model, 'mnist.pt')
del traced_model
del model

### Loading the model & tensors

In [None]:
model = rai.load_model('mnist.pt')
model

In [None]:
data.shape

In [None]:
target

In [None]:
tensor = rai.BlobTensor.from_numpy(data.numpy())

### Interacting with Redis & RedisAI server

In [None]:
con = rai.Client(host='localhost', port=6379)

In [None]:
con.tensorset('input', tensor)

In [None]:
con.modelset('model', rai.Backend.torch, rai.Device.cpu, model)

In [None]:
con.modelrun('model', 'input', 'output')
# if you have more input and output tensors?

In [None]:
# output = con.tensorget('output')
output = con.tensorget('output', as_type=rai.BlobTensor)
output.to_numpy()

In [None]:
output.to_numpy().argmax()

### SCRIPTing

In [None]:
script = """
def first_script(arr1, arr2):
    return (arr1 / 2) @ arr2

"""

In [None]:
con.scriptset('script', rai.Device.cpu, script)

In [None]:
arr1 = rai.Tensor(value=[8, 8, 8, 8, 8, 8], shape=(3, 2), dtype=rai.DType.int32)
np_arr = np.array((2, 2), dtype=np.int32).reshape(2, 1)
arr2 = rai.BlobTensor.from_numpy(np_arr)
con.tensorset('dummyarr1', arr1)
con.tensorset('dummyarr2', arr2)

In [None]:
con.scriptrun('script', 'first_script', ['dummyarr1', 'dummyarr2'], 'scriptout')

In [None]:
con.tensorget('scriptout')

In [None]:
con.tensorget('scriptout').value

#### What's pending
- [Replication & failover](https://github.com/RedisAI/redisai-examples/tree/master/sentinel)
- [Other clients](https://github.com/RedisAI/redisai-examples)
- [Other backends](https://github.com/RedisAI/redisai-examples/tree/master/python_client)
- Keep data local
- [RedisEdge](https://github.com/RedisGears/EdgeRealtimeVideoAnalytics)

## Links
- [Hangar](https://github.com/tensorwerk/hangar-py)
- [PyTorch](https://pytorch.org)
- [RedisAI](https://github.com/RedisAI/RedisAI)
- [This example](https://github.com/pytorch/examples/blob/master/mnist/main.py)