# Preparation

In [None]:
!pip install d2l

In [None]:
!pip install matplotlib==3.1.3

# Compilers and Interprerters

*Imperative programming*: makes use of statements to change a program's state

In [1]:
def add(a, b):
    return a + b

def fancy_func(a, b, c, d):
    e = add(a, b)
    f = add(c, d)
    g = add(e, f)
    return g

print(fancy_func(1, 2, 3, 4))

10


*Symbolic programming:* computation is usually performed only the process has been fully defined.

Usual steps:
- Define the operations to be be executed
- Compile the operations into an executable program
- Provide the required inputs and call the compiled program for execution

Allowing significant amount of optimization.

Python interpreter can be skipped, thus removing the performance bottleneck that can become significant on multiple fast GPUs paired with a single Python thread on a CPU. 

In [4]:
def add_():
    return '''
def add(a, b):
    return a + b
'''

def fancy_func_():
    return '''
def fancy_func(a, b, c, d):
    e = add(a, b)
    f = add(c, d)
    g = add(e, f)
    return g
'''

def evoke_():
    return add_() + fancy_func_() + 'print(fancy_func(1, 2, 3, 4))'

prog = evoke_()
print(prog)
y = compile(prog, '', 'exec')
exec(y)


def add(a, b):
    return a + b

def fancy_func(a, b, c, d):
    e = add(a, b)
    f = add(c, d)
    g = add(e, f)
    return g
print(fancy_func(1, 2, 3, 4))
10


Differences between *imperative* (interpreted) programming and *symbolic* programming:
- Imperative programming is easier to write and debug, because it is easy to obtain and print all relecant intermediate variable values.
- Symbolic programming is more efficient and easier to port. Easier to optimize the code during compilation, while also having the ability to port the program into a format independent of Python. Allows program to be run in a non-Python environment, thus avoiding any potential performance issues related to the Python interpreter.

Hybrid programming

- Tensorflow: symbolic
- PyTorch: imperative and ueses dynamic computation graphs

Hybridizing the Sequential Class

In [5]:
import torch
from torch import nn
from d2l import torch as d2l

# Factory for networks
def get_net():
    net = nn.Sequential(nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, 2))
    return net

x = torch.randn(size=(1, 512))
net = get_net()
net(x)

tensor([[ 0.0358, -0.0877]], grad_fn=<AddmmBackward0>)

With torch.jit.script, we are able to compile and optimize the computation in the MLP. 

In [6]:
net = torch.jit.script(net)
net(x)

tensor([[ 0.0358, -0.0877]], grad_fn=<AddmmBackward0>)

Benchmark performance

In [7]:
class Benchmark:
    """For measuring running time."""
    def __init__(self, description='Done'):
        self.description = description

    def __enter__(self):
        self.timer = d2l.Timer()
        return self

    def __exit__(self, *args):
        print(f'{self.description}: {self.timer.stop():.4f} sec')

In [8]:
net = get_net()
with Benchmark('Without torchscript'):
    for i in range(1000): net(x)

net = torch.jit.script(net)
with Benchmark('With torchscript'):
    for i in range(1000): net(x)

Without torchscript: 0.0939 sec
With torchscript: 0.0903 sec


Serialization

One benefit of compiling the models is that we can serialize (save) the model and its parameters to disk. This allows us to store a model in a manner that is independent of the front-end language of choice. This allows us to deploy trained model to other devices and easily use other front-end programming languages. At the same time the code is often faster than what can be achieved in imperative programming. 

In [9]:
net.save('my_mlp')
!ls -lh my_mlp*

-rw-r--r-- 1 root root 651K Jan 14 17:35 my_mlp


# Asynchronous Computation

Today's computers are highly parallel systems, consisting of multiple CPU coores (often multiple threads per core), multiple processing elements per GPU, and often multiple GPUs per device.

Deep learning frameworks such as MXNet and Tensorflow adopt an aynchronous programming model to improve performance, while PyTorch uses Python's own scheduler leading to a different performance trade-off.

For PyTorch, by default, GPU operations are asynchronoous. When you call a function that uses that GPU, the operations are enqueued to the particular device, but not necessarily executed until later. This allows us to execute more computations in parallel, including operations on the CPU or other GPUs.

Develop more efficient programs by proactively reducing computational requirements and mutual dependencies. This allows us to reduce memory overhead and incrase processor utilization.

In [10]:
import os
import subprocess
import numpy
import torch
from torch import nn
from d2l import torch as d2l

Asynchronous via Backend

Generate a random matrix and multiply it. PyTorch tensor is defined on a GPU.

In [11]:
# Warmup for GPU computation
device = d2l.try_gpu()
a = torch.randn(size=(1000, 1000), device=device)
b = torch.mm(a, a)

with d2l.Benchmark('numpy'):
    for _ in range(10):
        a = numpy.random.normal(size=(1000, 1000))
        b = numpy.dot(a, a)

with d2l.Benchmark('torch'):
    for _ in range(10):
        a = torch.randn(size=(1000, 1000), device=device)
        b = torch.mm(a, a)

numpy: 1.3420 sec
torch: 0.4412 sec


In [None]:
with d2l.Benchmark():
    for _ in range(10):
        a = torch.randn(size=(1000, 1000), device=device)
        b = torch.mm(a, a)
    torch.cuda.synchronize(device)

Dependency graph

In [13]:
x = torch.ones((1, 2), device=device)
y = torch.ones((1, 2), device=device)
z = x * y + 2
z

tensor([[3., 3.]])

**Summary**:
- Deep learning framework may decouple the Python frontend from an execution backend. This allows for fast asynchronous insertion of commands into the backend and associated parallellism
- Asynchrony leads to a rather responsive frontend. However, use caution not to overfill the task queue since it may lead to excessive memory consumption. It is recommended to synchronize for each minibatch to keep frontend and backend approximately synchronized. 
- Chip vendors offer sophiticated performance analysis tools to obtain a much more fine-grained insight into the efficiency of deep learning.

# Automatic Parallelism

Deep learrning frameworks automatically construct computational graphs at the backend. Using a computational graph, the system is aware of all the dependencies, can selectively execute multiple non-interdependent tasks in parallel to improve speed.

# Hardware

**Summary**:
- Devices have overheads for operations. Hence, it is important to aim for a small number of large transfers rather than many small ones. This applies to RAM, SSDs, networks and GPUs.
- Vectorization is key for performance. Make sure you are aware of the specific abilities of your accelerators. 
- Numerical overflow due to small data types can be a probllem during training (and to a lesser extent during inference)
- Aliasing can significantly degrade performance.
- Match your algorithm to the hardware. Great speedup can be achieved when fitting the parameters into caches.
- It is recommended to sketch out performance of a novel algorithm on paper before verifying the experimentall results. Discrepencies of an order-of-magnitude or more are reasons for concern.
- Use profilers to debug performance bottlenecks.
- Training and inference hardware have different sweet spots in terms of price and performance.

# Training on Multiple GPUs

By and large, data parallelism is the most convenient way to proceed.

**Summary**:
- There are multiple ways to split deep network training over multple GPUs. We could split them between layers, across layers, or across data. The former two require tightly choreographed data transfers. Data parallelism is the simplest strategy.
- Data parallel training is straightforward. However, it increases the effective minibatch size to be efficient.
- In data parallelism, data are split across multiple GPUs, where each GPU executes its own forward and backward operation and subsequently gradients are aggregated and results are braodcast back to the GPUs. 
- We may use slightly increased learning rates for larger minibatches.

# Parameter Servers

**Summary**:
- Synchronization needs to be highly adaptive to specific network infrastructure and connectivity within a server. This can make a significant different to the time it takes to synchronize.
- Ring-synchronization can be optimal for p3 and DGX-2 servers. For others possibly not so much.
- A hierarchical synchronization strategy works well when adding multiple parameter servers for increased bandwidth.