In [24]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="2,3"

* **Day 2: Improve performances**
    - TF execution modes
    - GPUs support
    - Improve performances with input data pipelines
    - Parallel execution across multiple devices
    - Distribution strategies

## Eager Execution
Tensorflow 2 by default executes operations in eager mode, an imperative programming environment that evaluates immediately, without building graphs.

There are several advantages about the eager execution:
* It is easy, it looks like Numpy (and the two are nicely integrated)
* You can always check what's going on printing stuff
* You can operate on tensor using python control flow

In [2]:
import tensorflow as tf
print("Eager execution:", tf.executing_eagerly())

Eager execution: True


In [3]:
import numpy as np

x_np = np.arange(6, dtype=np.float32).reshape((2,3))
x_tf = tf.constant(x_np)
print("x_tf:",x_tf)

y_tf = tf.math.exp(x_tf)
print("y_tf:",y_tf)

z_tf = tf.math.exp(x_np)
print("z_tf == y_tf ?", np.all(z_tf == y_tf))

x_tf: tf.Tensor(
[[0. 1. 2.]
 [3. 4. 5.]], shape=(2, 3), dtype=float32)
y_tf: tf.Tensor(
[[  1.          2.7182817   7.389056 ]
 [ 20.085537   54.59815   148.41316  ]], shape=(2, 3), dtype=float32)
z_tf == y_tf ? True


We can use python control flow even while tracing operations with `tf.GradientTape`.

In [4]:
def cond_op(x):
    if tf.reduce_any(x > 1): # Conditional
        return tf.reduce_max(x, axis=0) # First axis reduced
    else:
        return tf.reduce_mean(x, axis=-1) # Last axis reduced

x = tf.constant([[.3, .5]])

with tf.GradientTape() as tape:
    tape.watch(x)
    y = cond_op(x)

print('x:\t', x)
print('y:\t', y)
print('dy/dx:\t', tape.gradient(y, x))

x:	 tf.Tensor([[0.3 0.5]], shape=(1, 2), dtype=float32)
y:	 tf.Tensor([0.4], shape=(1,), dtype=float32)
dy/dx:	 tf.Tensor([[0.5 0.5]], shape=(1, 2), dtype=float32)


Eager execution makes development and debugging more interactive but this can come at the expense of performance and deployability.

## Graph Mode

<img style="float: left;" src="https://miro.medium.com/max/504/1*SmfhKWHXHVEMg8KqNaj-uw.gif">
Opposed to the imperative environment offered by the eager execution there is the graph mode. In this mode symbolic graphs are used to represent the computations where each node corresponds to an operation and data flow following the arrows.


The benefits of graph mode are numerous, and can be summarized in:
- **Performance**: Computations and memory usage can be optimized.
- **Portability**: The dataflow graph is a language-independent representation of the code in your model.

### tf.function
TensorFlow 2 makes it easy to convert a Python program to a symbolic graph. With the `tf.function` API a function can be compiled into a callable TensorFlow graph (usually called *TensoFlow Function*).

**Note:** Keras APIs are already integrated with `tf.function`. When you call the `fit()` method on a model the execution is run by default in graph mode.

In [5]:
def simple_fn(x):
    y = x * 2
    return y

simple_fn

<function __main__.simple_fn(x)>

In [6]:
tf_simple_fn = tf.function(simple_fn)
tf_simple_fn

<tensorflow.python.eager.def_function.Function at 0x7f48098007f0>

Alternatively you can decorate the function definition:

In [7]:
@tf.function
def tf_other_fn(x):
    return x**2 + 1

tf_other_fn

<tensorflow.python.eager.def_function.Function at 0x7f48098006a0>

TF functions can be used exactly as Python functions: you can execute it eagerly, you can compute gradients, you can even call them with arguments of different types and shapes..

In [8]:
x = 4
print(tf_simple_fn(x))

x = tf.constant(5)
print(tf_simple_fn(x))

x = tf.constant([.3, -1.2, 1], dtype=tf.float64)
print(tf_simple_fn(x))

tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(10, shape=(), dtype=int32)
tf.Tensor([ 0.6 -2.4  2. ], shape=(3,), dtype=float64)


polymorphism.. WAT?

TensorFlow graphs are statically typed. Thus, every time a TF function is called with a new *input signature* (combination of arguments shapes and types) a new *concrete function* (a graph with a single signature) is created. Given an input we can get the concrete function corresponding to the input signature with the `get_concrete_function()` method.

In [9]:
x = 4
print(tf_simple_fn.get_concrete_function(x))

x = 3
print(tf_simple_fn.get_concrete_function(x))

x = tf.constant(5)
print(tf_simple_fn.get_concrete_function(x))

<tensorflow.python.eager.function.ConcreteFunction object at 0x7f480983ca00>
<tensorflow.python.eager.function.ConcreteFunction object at 0x7f4809896d00>
<tensorflow.python.eager.function.ConcreteFunction object at 0x7f48b7f3a250>


**Note:** This holds for Tensor arguments only. For example, every instance of a Python numeric type will generate a new graph! To avoid this, pass numeric arguments as Tensors whenever possible.

The underlying graph of a concrete function can be accessed via the `graph` attribute. Each graph contains a set of `tf.Operation` objects, which represent units of computation (the nodes in the graph). The `get_operations()` returns the list of operations constituting a graph:

In [10]:
x = tf.constant([3.])
g = tf_simple_fn.get_concrete_function(x).graph
g.get_operations()

[<tf.Operation 'x' type=Placeholder>,
 <tf.Operation 'mul/y' type=Const>,
 <tf.Operation 'mul' type=Mul>,
 <tf.Operation 'Identity' type=Identity>]

In [11]:
print(g.get_operations()[1])

name: "mul/y"
op: "Const"
attr {
  key: "dtype"
  value {
    type: DT_FLOAT
  }
}
attr {
  key: "value"
  value {
    tensor {
      dtype: DT_FLOAT
      tensor_shape {
      }
      float_val: 2.0
    }
  }
}



In [12]:
g = tf_simple_fn.get_concrete_function(5).graph
g.get_operations()

[<tf.Operation 'Const' type=Const>, <tf.Operation 'Identity' type=Identity>]

### Tracing

When the argument has a new signature, a *symbolic tensor* (a tensor with name, shape and type but without values) is passed to the function and every `tf` operation encountered adds a `tf.Operation` node to a new graph.

 In particular, this means that every call to an external library (even Numpy or the standard library) will be executed **only during tracing** and will not be added in the graph!

In [13]:
@tf.function
def print_fn(x):
    print("Tracing! Argument:", x)
    tf.print("Executing! Argument:", x)

print_fn(tf.constant([3], dtype=tf.float32))
print_fn(tf.constant([1.5], dtype=tf.float32))

Tracing! Argument: Tensor("x:0", shape=(1,), dtype=float32)
Executing! Argument: [3]
Executing! Argument: [1.5]


In [14]:
print_fn(1)
print_fn(2)

print_fn(tf.constant([3]))
print_fn(tf.constant([-2]))

Tracing! Argument: 1
Executing! Argument: 1
Tracing! Argument: 2
Executing! Argument: 2
Tracing! Argument: Tensor("x:0", shape=(1,), dtype=int32)
Executing! Argument: [3]
Executing! Argument: [-2]


The different behaviour between `print` and `tf.print` turns out to be very helpful when tracking down issues that only appear within `tf.function`. But remember to be very carefull when calling an external library.

In [15]:
@tf.function
def my_rand():
    x = np.random.randint(1000)
    print("Tracing - Sampled:", x)
    tf.print("Executing - Sampled:", x)
    
my_rand()
my_rand()

Tracing - Sampled: 666
Executing - Sampled: 666
Executing - Sampled: 666


### AutoGraph
AutoGraph is another component that comes in action with `tf.function` and performs a first conversion of plain Python into TensorFlow code:

- `for`/`while` -> `tf.while_loop` (break and continue are supported)
- `if` -> `tf.cond`

This allows to trace data dependent control flows in the code and make them dynamically operate at execution time.

In [16]:
@tf.function(autograph=True) # default: autograph=True
def tf_cond_fn(x):
    if tf.shape(x)[0] > 1: # Conditional
        print("Tracing - True - Argument:", x)
        return tf.reduce_max(x, axis=0)
    else:
        print("Tracing - False  - Argument:", x)
        return tf.reduce_mean(x, axis=-1)

tf_cond_fn(tf.constant([[0.3], [0.5]]))

Tracing - True - Argument: Tensor("x:0", shape=(2, 1), dtype=float32)
Tracing - False  - Argument: Tensor("x:0", shape=(2, 1), dtype=float32)


<tf.Tensor: shape=(1,), dtype=float32, numpy=array([0.5], dtype=float32)>

### Variables
Creating a new `Variable` in a function works in eager mode but not in graph mode!
>Due to tracing semantics, `tf.function` will reuse the same variable each call, but eager mode will create a new variable with each call. To guard against this mistake, tf.function will raise an error if it detects dangerous variable creation behavior.

In [17]:
@tf.function
def tf_new_var(x):
    v = tf.Variable(0.)
    v.assign_add(x)
    return v

# tf_new_var(1)

### Summary
- Debug in Eager mode, it is easier.
- Then decorate with `@tf.function` to get performant and portable models.

To convert python function to TF function remeber:
1. Be aware of using external libraries. Try to use `tf` methods evrywhere!
2. Don't rely on Python side effects like object mutation or list appends.
3. Variables creation in a TF function is allowed only on the first call. Avoid it!

# GPU support

Since version 2.1 TensorFlow have a unique installation procedure for both CPU-only and GPU machines. Nevertheless, the GPU support has both hardware (NVIDIA GPU only) and software requirements (NVIDIA drivires, CUDA 10.1, cuDNN 7.6).

If the above requirements are satisfied, TensorFlow code, and tf.keras models will transparently run on a single GPU with **no code changes**.

You can check the availabel devices via `tf.config.experimental.get_visible_devices()`.

In [19]:
tf.config.experimental.get_visible_devices()

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
 PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'),
 PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]

By default, every TensorFlow operation and every tensor that have a GPU implementation (usually referred to as GPU kernels) will be placed on the first GPU ("GPU:0").

In [20]:
x = tf.Variable([3., 2.], [1., 7.])
x.device

'/job:localhost/replica:0/task:0/device:GPU:0'

If there are no GPU kernels for the variable/operation it will placed on the CPU. This is the case, for example, for integer tensors:

In [21]:
x = tf.Variable([3, 2], [1, 7])
x.device

'/job:localhost/replica:0/task:0/device:CPU:0'

To find out which devices your operations and tensors are assigned to, you can use `tf.debugging.set_log_device_placement(True)` as the first statement of your program. It will print placement logging for any Tensor allocations or operations.

However, when possible, the default behaviour can be overridden by explicitly placing tensors and operations on specific devices using the tf.device context manager.

In [22]:
with tf.device('/CPU:0'):
    x = tf.Variable([3., 2.], [1., 7.])
x.device

'/job:localhost/replica:0/task:0/device:CPU:0'

In [23]:
with tf.device('/GPU:0'):
    x = tf.Variable([3, 2], [1, 7])
x.device

'/job:localhost/replica:0/task:0/device:CPU:0'

As you can see, in the last case our demand has not been satisfied. You may wish to get an error. If this is the case set `tf.config.set_soft_device_placement(False)` at the very begging of your code.

By default, at the first computation, TensorFlow grabs nearly all of the memory of all GPUs visible to the process. This means that if you try to run a second TensorFlow program on the same machine you wil most probably get an error. There are a couple of things you can do to avoid this problem:

- If you have multiple GPUs available, you can limit TensorFlow to a specific set of GPUs by:
    1. setting the `CUDA_VISIBLE_DEVICES` environment variable (before any computation).
    2. using the `tf.config.experimental.set_visible_devices()` method.
    
- To controll the memory allocated by a TensorFlow program:
    1. with `tf.config.experimental.set_memory_growth` the process attempts to allocate only as much GPU memory as needed.
    2. You can create a virtual device with `tf.config.set_logical_device_configuration` and set its memory limit.
    
You can even simulate multiple GPUs by creating multiple logical devices from a single phisical GPU. Check the [docs](https://www.tensorflow.org/guide/gpu)!

# tf.data
When training models on large datasets, as often is the case in Deep Learning, the bottleneck to reduce training time moves from the computations required by the model to deliver the data to the computing devices. Especially when working with accellerators it is crucial to provide them the next batch before the current train step has finished to avoid idle time.

For this purpose the TensorFlow module `tf.data` provides a set of APIs that enables you to build complex input pipelines from simple, reusable pieces. In particular, it makes it possible to handle large amounts of data, read from different data formats, and perform complex transformations in a flexible and efficient way.

## Dataset
The main abstraction in `tf.data` is `Dataset`. This API is used to represents a potentially large sequence of elements, in which each element consists of one or more components. Nevertheless, all elements must have the same (nested) structure.

`Dataset` usage follows a common pattern:
1. Create a source dataset from your input data.
2. Apply dataset transformations to preprocess the data.
3. Iterate over the dataset and process the elements.



There are different ways to create a dataset from source:
- If the input data fits in memory, `Dataset.from_tensor_slices` creates datasets from array-like objects

In [25]:
dataset = tf.data.Dataset.from_tensor_slices(np.arange(20).reshape(-1,2))
dataset # tensors are sliced along their first dimension

<TensorSliceDataset shapes: (2,), types: tf.int64>

- `Dataset.list_files` creates a dataset of all files matching a pattern.

In [26]:
dataset = tf.data.Dataset.list_files('./*')
dataset

<ShuffleDataset shapes: (), types: tf.string>

- To process lines from files, use `TextLineDataset`.
- If the files are written in `TFRecord` format, use `TFRecordDataset`.
- As always.. Much more in the [docs](https://www.tensorflow.org/api_docs/python/tf/data/Dataset)!

### Iterable object
In particular, `Dataset` objects are Python iterables. This makes it possible to consume its elements using a for loop:

In [27]:
dataset = tf.data.Dataset.list_files('./*')
for tensor in dataset:
    print(tensor)

tf.Tensor(b'./slides.ipynb', shape=(), dtype=string)
tf.Tensor(b'./3.Dataset_basics.ipynb', shape=(), dtype=string)
tf.Tensor(b'./README.md', shape=(), dtype=string)
tf.Tensor(b'./4.Dataset_performance.ipynb', shape=(), dtype=string)
tf.Tensor(b'./5.Dataset_images.ipynb', shape=(), dtype=string)
tf.Tensor(b'./__pycache__', shape=(), dtype=string)
tf.Tensor(b'./Untitled.ipynb', shape=(), dtype=string)
tf.Tensor(b'./utils.py', shape=(), dtype=string)
tf.Tensor(b'./.ipynb_checkpoints', shape=(), dtype=string)
tf.Tensor(b'./6.ResNet_distributed.ipynb', shape=(), dtype=string)
tf.Tensor(b'./model_toy.py', shape=(), dtype=string)
tf.Tensor(b'./1.Eager_Graph_benchmarks.ipynb', shape=(), dtype=string)
tf.Tensor(b'./0.ResNet_custom_training.ipynb', shape=(), dtype=string)
tf.Tensor(b'./2.GPU_support.ipynb', shape=(), dtype=string)
tf.Tensor(b'./model_ResNetv2.py', shape=(), dtype=string)


Equivalently, we can create an iterator via the Python built-in `iter` and get the next element by calling `next`:

In [28]:
ds_iterator = iter(dataset)
next(ds_iterator)

<tf.Tensor: shape=(), dtype=string, numpy=b'./model_toy.py'>

An alternative way to inspect the content of the dataset is the `as_numpy_iterator()` method. It returns an iterator which converts all elements of the dataset to numpy.

In [29]:
list(dataset.as_numpy_iterator())

[b'./2.GPU_support.ipynb',
 b'./.ipynb_checkpoints',
 b'./model_toy.py',
 b'./4.Dataset_performance.ipynb',
 b'./Untitled.ipynb',
 b'./model_ResNetv2.py',
 b'./slides.ipynb',
 b'./3.Dataset_basics.ipynb',
 b'./README.md',
 b'./0.ResNet_custom_training.ipynb',
 b'./6.ResNet_distributed.ipynb',
 b'./__pycache__',
 b'./utils.py',
 b'./1.Eager_Graph_benchmarks.ipynb',
 b'./5.Dataset_images.ipynb']

### Transformations
Another way to construct a datasets from one or more `Dataset` objects is via data transformations. There are many possible transformations, the most common are:

- `map()` applies a function to each element in the dataset.
- `filter()` filters a dataset according to a predicate.
- `batch()` combines consecutive elements of this dataset into batches.
- `repeat()` repeats a dataset, possibly infinite times.
- `shuffle()` randomly shuffles the elements.

We will see some of those in action in the notebook [2.Dataset_basics](./2.Dataset_basics.ipynb) while the notebook [3.Dataset_performance.ipynb](3.Dataset_performance.ipynb) discusses performance improvements.

# Multiple devices

There are several reasons to use multiple devices in Deep Learning:

- To run the data preprocessing on the CPU and the compute intense part on a GPU.
- To train the same model simultaneuosly on different devices to tune its hyper-parameters.
- To train different models on differnt devices and then collect the results in an ensemble model.
- To speed-up the training of a model.


## Training a model on multiple devices

There are two main approaches to distributed training:

- **Model parallelism**: The model is split in chunks and each chunk is assigned to a different device. 
- **Data parallelism**: The model is replicated across teh devices and each device process a chunk of data.
![decomposition_dimensions](https://cdn.filestackcontent.com/a5diV7cSQ9iM1m8OvMRu)

### Model parallelism
The efficiency of model parallelism depends **strongly** on the network architecture! It can be performed on two dimensions:
- **Layer parallelism**: Each device has a consecutive block of layers. In the picture each colour represent a different GPU.
![layer_parallelism](https://secureservercdn.net/198.12.145.239/a7b.fcb.myftpupload.com/wp-content/uploads/2020/03/Picture1s-9_1.svg)

- **Vertical Split**: Layers are split on multiple devices.
![vertical_parallelism](https://docs.chainer.org/en/stable/_images/parallelism.png)
- **Hybrid methods**:
![](https://fananymi.files.wordpress.com/2015/03/dlbig.png)

### Data parallelism

Model parallelism can be unavoidable when the model is really big but when it fit in each device memory there is no much sense in splitting it. A different approach is to split the data: each device has its own copy of the model (called **replica**), it receives a minibatch and via backpropagation computes the weights updates. The update can be performed in two ways:

- **Collective AllReduce** - Each device receives the average update and perform a **synchronous update** . 

- Comunicating to a **paramenter server** - A special worker is designed to receive the updates and send back to the devices the new weights. 

With a parameter server there is no need to wait for all devices and teh weights updates can be performed **asynchronously**. In particular this means that the replicas will have different weights!

## Distribution strategies
The `tf.distribute.Strategy` is an API designed to distribute training accross multiple devices. Althought it doesn't cover all the use cases, it has the benefit to be really easy to use and to switch between strategies.

It can be used to distribute the execution of Keras APIs and custom training loops. Moreovoer, it works in both eager and graph mode (although it works best when used with `tf.function`), and it does not require massive code changes:

In [30]:
from model_toy import get_toy_ResNet

N_train = 60000
x_train = tf.random.uniform([N_train,32,32,3])
y_train = tf.random.uniform([N_train,1], minval=0, maxval=9, dtype=tf.int32)

mirrored_strategy = tf.distribute.MirroredStrategy()

with mirrored_strategy.scope():
    dist_model = get_toy_ResNet()
    dist_model.compile(loss='sparse_categorical_crossentropy', optimizer="RMSProp")
    
history = dist_model.fit(x=x_train, y=y_train, batch_size=32, epochs=2, verbose=1)

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
Epoch 1/2
INFO:tensorflow:batch_all_reduce: 27 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 27 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Epoch 2/2


In [31]:
central_storage_strategy = tf.distribute.experimental.CentralStorageStrategy()

with central_storage_strategy.scope():
    dist_model = get_toy_ResNet()
    dist_model.compile(loss='sparse_categorical_crossentropy', optimizer="RMSProp")
    
history = dist_model.fit(x=x_train, y=y_train, batch_size=32, epochs=2, verbose=1)

INFO:tensorflow:ParameterServerStrategy (CentralStorageStrategy if you are using a single machine) with compute_devices = ['/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1'], variable_device = '/device:CPU:0'
Epoch 1/2
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/job:localhost/replica

# Exercise

In the file [model_ResNetv2.py](model_ResNetv2.py) you can find an accurate implementation of the ResNet v2 (from [here](https://keras.io/examples/cifar10_resnet/)). Create a `tf.Dataset` containing CIFAR10, apply to it some basic image preprocessing (avoid rotation!) and train a ResNet56_v2 model on it for at least 30 epochs. Distribute the training over multiple GPUs (try at least with 2) and test different batch sizes. How does the time per epoch change? What about the validation accuracy?