In [1]:
from matplotlib import pyplot as plt
import pandas as pd
import random
import time
import math
import d2l
import os

from mxnet import autograd, np, npx, gluon, init
from mxnet.gluon import loss as gloss
from mxnet.gluon import nn
npx.set_np()

#  05. Deep Learning Computation
Alongside giant datasets and powerful hardware, great software tools have played an indispensable role in the rapid progress of deep learning. Starting with the pathbreaking `Theano` library released in 2007, flexible open-source tools have enabled researchers to rapidly prototype models, avoiding repetitive work when recycling standard components while still maintaining the ability to make low-level modifications. Over time, deep learningʼs libraries have evolved to offer increasingly coarse abstractions. Just as semiconductor designers went from specifying transistors to logical circuits to writing code, neural networks researchers have moved from thinking about the behavior of individual artificial neurons to conceiving of networks in terms of whole layers, and now often design architectures with far coarser blocks in mind.

So far, we have introduced some basic machine learning concepts, ramping up to fully-functional deep learning models. In the last chapter, we implemented each component of a multilayer perceptron from scratch and even showed how to leverage `MXNet`ʼs `Gluon` library to roll out the same models effortlessly. To get you that far that fast, we called upon the libraries, but skipped over more advanced details about how they work. In this chapter, we will peel back the curtain, digging deeper into the key components of deep learning computation, namely model construction, parameter access and initialization, designing custom layers and blocks, reading and writing models to disk, and leveraging GPUs to achieve dramatic speedups. These insights will move you from end user to power user, giving you the tools needed to reap the benefits of a mature deep learning library while retaining the flexibility to implement more complex models, including those you invent yourself! While this chapter does not introduce any new models or datasets, the advanced modeling chapters that follow rely heavily on these techniques.

## 5.1 Layers and Blocks
When we first introduced neural networks, we focused on linear models with a single output. Here, the entire model consists of just a single neuron. Note that a single neuron:
1. takes some set of inputs
2. generates a corresponding (scalar) output
3. has a set of associated parameters that can be updated to optimize some objective function of interest 

Then, once we started thinking about networks with multiple outputs, we leveraged vectorized arithmetic to characterize an entire layer of neurons. Just like individual neurons, layers 
1. take a set of inputs
2. generate corresponding outputs
3. described by a set of tunable parameters

When we worked through softmax regression, a single layer was itself the model. However, even when we subsequently introduced multilayer perceptrons, we could still think of the model as retaining this same basic structure.

Interestingly, for multilayer perceptrons, both the entire model and its constituent layers share this structure. The (entire) model takes in raw inputs (the features), generates outputs (the predictions), and possesses parameters (the combined parameters from all constituent layers). Likewise, each individual layer ingests inputs (supplied by the previous layer) generates outputs (the inputs to the subsequent layer), and possesses a set of tunable parameters that are updated according to the signal that flows backwards from the subsequent layer.

While you might think that neurons, layers, and models give us enough abstractions to go about our business, it turns out that we often find it convenient to speak about components that are larger than an individual layer but smaller than the entire model. For example, the `ResNet-152` architecture, which is wildly popular in computer vision, possesses hundreds of layers. These layers consist of repeating patterns of groups of layers. Implementing such a network one layer at a time can grow tedious. This concern is not just hypothetical---such design patterns are common in practice. The `ResNet` architecture mentioned above won the 2015 ImageNet and COCO computer vision competitions for both recognition and detection (`He.Zhang.Ren.ea.2016`) and remains a go-to architecture for many vision tasks. Similar architectures in which layers are arranged in various repeating patterns are now ubiquitous in other domains, including natural language processing and speech.

To implement these complex networks, we introduce the concept of a neural network block. A block could describe a single layer, a component consisting of multiple layers, or the entire model itself! One benefit of working with the block abstraction is that they can be combined into larger artifacts, often recursively, (see illustration in `Fig. 5.1.1`).

<img src="images/05_01.png" style="width:600px;"/>

By defining code to generate `Blocks` of arbitrary complexity on demand, we can write surprisingly compact code and still implement complex neural networks.

From a software standpoint, a `Block` is a `class`. Any subclass of Block must define a `forward` method that transforms its input into output and must store any necessary parameters. Note that some Blocks do not require any parameters at all! Finally a Block must possess a `backward` method, for purposes of calculating gradients. Fortunately, due to some behind-the-scenes magic supplied by the `autograd` package (introduced in `Chapter 2`) when defining our own Block, we only need to worry about parameters and the forward function.

To begin, we revisit the Blocks that we used to implement multilayer perceptrons (`Section 4.3`). The following code generates a network with one fully-connected hidden layer with 256 units and ReLU activation, followed by a fully-connected *output layer* with 10 units (no activation function). 

In [2]:
x = np.random.uniform(size=(2, 20))

net = nn.Sequential()
net.add(nn.Dense(256, activation='relu'))
net.add(nn.Dense(10))
net.initialize()
net(x)

array([[ 0.06240272, -0.03268593,  0.02582653,  0.02254182, -0.03728798,
        -0.04253786,  0.00540613, -0.01364186, -0.09915452, -0.02272738],
       [ 0.02816677, -0.03341204,  0.03565666,  0.02506382, -0.04136416,
        -0.04941845,  0.01738528,  0.01081961, -0.09932579, -0.01176298]])

In this example, we constructed our model by instantiating an `nn.Sequential`, assigning the returned object to the `net` variable. Next, we repeatedly call its `add` method, appending layers in the order that they should be executed. In short, `nn.Sequential` defines a special kind of Block that maintains an ordered list of constituent Blocks. The add method simply facilitates the addition of each successive Block to the list. Note that each layer is an instance of the `Dense` class which is itself a subclass of Block. The forward function is also remarkably simple: it chains each Block in the list together, passing the output of each as the input to the next. Note that until now, we have been invoking our models via the construction `net(X)` to obtain their outputs. This is actually just shorthand for `net.forward(X)`, a slick Python trick achieved via the Block class's `__call__` function.

### 5.1.1 A Custom Block
Perhaps the easiest way to develop intuition about how a block works is to implement one ourselves. Before we implement our own custom block, we briefly summarize the basic functionality that each block must provide:
1. Ingest input data as arguments to its forward method.
2. Generate an output by having forward return a value. Note that the output may have a different shape from the input. For example, the first Dense layer in our model above ingests an input of arbitrary dimension but returns an output of dimension 256.
3. Calculate the gradient of its output with respect to its input, which can be accessed via its backward method. Typically this happens automatically.
4. Store and provide access to those parameters necessary to execute the forward computation.
5. Initialize these parameters as needed.

In the following snippet, we code up a block from scratch corresponding to a multilayer perceptron with one hidden layer with 256 hidden nodes, and a 10-dimensional output layer. Note that the MLP class below inherits the class represents a block. We will rely heavily on the parent class's methods, supplying only our own `__init__` and `forward` methods.

In [3]:
class MLP(nn.Block):
    # Declare a layer with model parameters. Here, we declare two fully connected layers
    def __init__(self, **kwargs):
        # Call the constructor of the MLP parent class Block to perform the necessary initialization. 
        # In this way, other function parameters can also be specified when constructing an instance, 
        # such as the model parameter, params, described in the following sections
        super().__init__(**kwargs)
        self.hidden = nn.Dense(256, activation='relu')  # Hidden layer
        self.output = nn.Dense(10)  # Output layer

    # Define the forward computation of the model, that is, how to return the
    # required model output based on the input x
    def forward(self, x):
        return self.output(self.hidden(x))

To begin, let us focus on the forward method. Note that it takes $x$ as input, calculates the hidden representation (`self.hidden(x)`) with the activation function applied, and outputs its logits (`self.output()`). In this MLP implementation, both layers are instance variables. To see why this is reasonable, imagine instantiating two MLPs, `net1` and `net2`, and training them on different data. Naturally, we would expect them to represent two different learned models.

We instantiate the MLP's layers in the `__init__` method (the constructor) and subsequently invoke these layers on each call to the forward method. Note a few key details. First, our customized `__init__` method invokes the parent class's `__init__` method `via super().__init__()` sparing us the pain of restating boilerplate code applicable to most Blocks. We then instantiate our two fully-connected layers, assigning them to `self.hidden` and `self.output`. Note that unless we implement a new operator, we need not worry about `backpropagation` (the `backward` method) or parameter initialization. The system will generate these methods automatically. Let us try this out:

In [4]:
net = MLP()
net.initialize()
net(x)

array([[-0.03989594, -0.1041471 ,  0.06799038,  0.05245074,  0.02526059,
        -0.00640342,  0.04182098, -0.01665319, -0.02067346, -0.07863817],
       [-0.03612847, -0.07210436,  0.09159479,  0.07890771,  0.02494172,
        -0.01028665,  0.01732428, -0.02843242,  0.03772651, -0.06671704]])

A key virtue of the block abstraction is its versatility. We can subclass the block class to create layers (such as the fully-connected layer class), entire models (such as the MLP above), or various components of intermediate complexity. We exploit this versatility throughout the following chapters, especially when addressing convolutional neural networks.

### 5.1.2 The Sequential Block
We can now take a closer look at how the `Sequential` class works. Recall that Sequential was designed to daisy-chain other blocks together. To build our own simplified `MySequential`, we just need to define two key methods:
+ A method to append blocks one by one to a list.
+ A forward method to pass an input through the chain of Blocks (in the same order as they were appended).

The following `MySequential` class delivers the same functionality the default Sequential class:

In [5]:
class MySequential(nn.Block):
    def add(self, block):
        # Here, block is an instance of a Block subclass, and we assume it has a unique name. 
        # We save it in the member variable _children of the Block class, and its type is OrderedDict. 
        # When the MySequential instance calls the initialize function, the system automatically
        # initializes all members of _children
        self._children[block.name] = block

    def forward(self, x):
        # OrderedDict guarantees that members will be traversed in the order they were added
        for block in self._children.values():
            x = block(x)
        return x

The add method adds a single Block to the ordered dictionary `_children`. You might wonder why every `Gluon` Block possesses a `_children` attribute and why we used it rather than just defining a Python list ourselves. In short the chief advantage of `_children` is that during our Block's parameter initialization, `Gluon` knows to look in the `_children` dictionary to find sub-Blocks whose parameters also need to be initialized.

When our MySequential's forward method is invoked, each added block is executed in the order in which they were added. We can now reimplement an MLP using our MySequential class.

In [6]:
net = MySequential()
net.add(nn.Dense(256, activation='relu'))
net.add(nn.Dense(10))
net.initialize()
net(x)

array([[-0.07645682, -0.01130233,  0.04952145, -0.04651389, -0.04131573,
        -0.05884133, -0.0621381 ,  0.01311472, -0.01379425, -0.02514282],
       [-0.05124625,  0.00711231, -0.00155935, -0.07555379, -0.06675334,
        -0.01762914,  0.00589084,  0.01447191, -0.04330775,  0.03317726]])

Note that this use of `MySequential` is identical to the code we previously wrote for the `nn.Sequential` class (as described in `Section 4.3`).

### 5.1.3 Executing Code in the forward Method
The `nn.Sequential` class makes model construction easy, allowing us to assemble new architectures without having to define our own class. However, not all architectures are simple daisy chains. When greater flexibility is required, we will want to define our own blocks. For example, we might want to execute Python's control flow within the forward method. Moreover we might want to perform arbitrary mathematical operations, not simply relying on predefined neural network layers.

You might have noticed that until now, all of the operations in our networks have acted upon our network's activations and its parameters. Sometimes, however, we might want to incorporate terms that are neither the result of previous layers nor updatable parameters. We call these *constant* parameters. Say for example that we want a layer that calculates the function $f(\mathbf{x},\mathbf{w}) = c \cdot \mathbf{w}^\top \mathbf{x}$, where $\mathbf{x}$ is the input, $\mathbf{w}$ is our parameter, and $c$ is some specified constant that is not updated during optimization.

Declaring constants explicitly (via `get_constant`) makes this clear and helps `Gluon` to speed up execution. In the following code, we will implement a model that could not easily be assembled using only predefined layers and `Sequential`.

In [7]:
class FixedHiddenMLP(nn.Block):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        # Random weight parameters created with the get_constant are not
        # iterated during training (i.e., constant parameters)
        self.rand_weight = self.params.get_constant('rand_weight', np.random.uniform(size=(20, 20)))
        self.dense = nn.Dense(20, activation='relu')

    def forward(self, x):
        x = self.dense(x)
        # Use the constant parameters created, as well as the relu and dot functions
        x = npx.relu(np.dot(x, self.rand_weight.data()) + 1)
        # Reuse the fully connected layer. This is equivalent to sharing parameters with two 
        # fully connected layers
        x = self.dense(x)
        # Here in Control flow, we need to call asscalar to return the scalar for comparison
        while np.abs(x).sum() > 1:
            x /= 2
        return x.sum()

In this `FixedHiddenMLP` model, we implement a hidden layer whose weights (`self.rand_weight`) are initialized randomly at instantiation and are thereafter constant. This weight is not a model parameter and thus it is never updated by backpropagation. The network then passes the output of this fixed layer through a `Dense` layer.

Note that before returning output, our model did something unusual. We ran a while loop, testing on the condition `np.abs(x).sum() > 1`, and dividing our output vector by $2$ until it satisfied the condition. Finally, we returned the sum of the entries in $x$. To our knowledge, no standard neural network performs this operation. Note that this particular operation may not be useful in any real world task. Our point is only to show you how to integrate arbitrary code into the flow of your neural network computations.

In [8]:
net = FixedHiddenMLP()
net.initialize()
net(x)

array(0.52637565)

We can mix and match various ways of assembling blocks together. In the following example, we nest blocks in some creative ways.

In [9]:
class NestMLP(nn.Block):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.net = nn.Sequential()
        self.net.add(nn.Dense(64, activation='relu'),
                     nn.Dense(32, activation='relu'))
        self.dense = nn.Dense(16, activation='relu')

    def forward(self, x):
        return self.dense(self.net(x))

chimera = nn.Sequential()
chimera.add(NestMLP(), nn.Dense(20), FixedHiddenMLP())

chimera.initialize()
chimera(x)
chimera.summary

<bound method Block.summary of Sequential(
  (0): NestMLP(
    (net): Sequential(
      (0): Dense(20 -> 64, Activation(relu))
      (1): Dense(64 -> 32, Activation(relu))
    )
    (dense): Dense(32 -> 16, Activation(relu))
  )
  (1): Dense(16 -> 20, linear)
  (2): FixedHiddenMLP(
    (dense): Dense(20 -> 20, Activation(relu))
  )
)>

### 5.1.4 ompilation
The avid reader might start to worry about the efficiency of some of these operations. After all, we have lots of dictionary lookups, code execution, and lots of other Pythonic things taking place in what is supposed to be a high performance deep learning library. The problems of Python's Global Interpreter Lock are well known. In the context of deep learning, we worry that our extremely fast GPU(s) might have to wait until a puny CPU runs Python code before it gets another job to run. The best way to speed up Python is by avoiding it altogether. One way that `Gluon` does this by allowing for Hybridization (`Section 12.1`). Here, the Python interpreter executes a Block the first time it is invoked. The `Gluon` runtime records what is happening and the next time around it short-circuits calls to Python. This can accelerate things considerably in some cases but care needs to be taken when control flow (as above) leads down different branches on different passes through the net. We recommend that the interested reader check out the hybridization section (`Section 12.1`) to learn about compilation after finishing the current chapter. 

##### Summary
+ Layers are blocks.
+ Many layers can comprise a block.
+ Many blocks can comprise a block.
+ A block can contain code.
+ Blocks take care of lots of housekeeping, including parameter initialization and backpropagation.
+ Sequential concatenations of layers and blocks are handled by the Sequential Block.

##### Exercises
1. What kinds of problems will occur if you change MySequential to store blocks in a Python list.
2. Implement a block that takes two blocks as an argument, say net1 and net2 and returns the concatenated output of both networks in the forward pass (this is also called a parallel block).
3. Assume that you want to concatenate multiple instances of the same network. Implement a factory function that generates multiple instances of the same block and build a larger network from it.


## 5.2 Parameter Management
Once we have chosen an architecture and set our hyperparameters, we proceed to the training loop, where our goal is to find parameter values that minimize our objective function. After training, we will need these parameters in order to make future predictions. Additionally, we will sometimes wish to extract the parameters either to reuse them in some other context, to save our model to disk so that it may be executed in other software, or for examination in the hopes of gaining scientific understanding.

Most of the time, we will be able to ignore the nitty-gritty details of how parameters are declared and manipulated, relying on the framework to do the heavy lifting. However, when we move away from stacked architectures with standard layers, we will sometimes need to get into the weeds of declaring and manipulating parameters. In this section, we cover the following:
+ Accessing parameters for debugging, diagnostics, and visualizations.
+ Parameter initialization.
+ Sharing parameters across different model components.

We start by focusing on an MLP with one hidden layer.

In [10]:
net = nn.Sequential()
net.add(nn.Dense(8, activation='relu'))
net.add(nn.Dense(1))
net.initialize()  # Use the default initialization method

x = np.random.uniform(size=(2, 4))
net(x)  # Forward computation
net.summary

<bound method Block.summary of Sequential(
  (0): Dense(4 -> 8, Activation(relu))
  (1): Dense(8 -> 1, linear)
)>

### 5.2.1 Parameter Access
Let us start with how to access parameters from the models that you already know. When a model is defined via the Sequential class, we can first access any layer by indexing into the model as though it were a list. Each layer's parameters are conveniently located in its attribute. We can inspect the parameters of the net defined above as a dictionary.

In [11]:
net[0].params, net[1].params

(dense12_ (
   Parameter dense12_weight (shape=(8, 4), dtype=float32)
   Parameter dense12_bias (shape=(8,), dtype=float32)
 ),
 dense13_ (
   Parameter dense13_weight (shape=(1, 8), dtype=float32)
   Parameter dense13_bias (shape=(1,), dtype=float32)
 ))

The output tells us a few important things. First, each fully-connected layer contains two parameters, e.g., `weight` and `bias` (may with prefix), corresponding to that layer's weights and biases, respectively. Both are stored as single precision floats. Note that the names of the parameters allow us to uniquely identify each layer's parameters, even in a network containing hundreds of layers.

##### Targeted Parameters
Note that each parameter is represented as an instance of the `Parameter` class. To do anything useful with the parameters, we first need to access the underlying numerical values. There are several ways to do this. Some are simpler while others are more general. To begin, given a layer, we can access one of its parameters via the bias or weight attributes, which returns an `Parameter` instance and further access that parameter's value via its data method. The following code extracts the bias from the second neural network layer.

In [12]:
type(net[1].bias), net[1].bias, net[1].bias.data()

(mxnet.gluon.parameter.Parameter,
 Parameter dense13_bias (shape=(1,), dtype=float32),
 array([0.]))

Parameters are complex objects, containing data, gradients, and additional information. That's why we need to request the data explicitly.

In addition to data, each Parameter also provides a `grad` method for accessing the gradient. Because we have not invoked backpropagation for this network yet, it is in its initial state.

In [13]:
net[0].weight.grad()

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

##### All Parameters at Once
When we need to perform operations on all parameters, accessing them one-by-one can grow tedious. The situation can grow especially unwieldy when we work with more complex blocks, (e.g., nested Blocks), since we would need to recurse through the entire tree in to extract each sub-block's parameters.

In [14]:
# parameters only for the first layer
print(net[0].collect_params())
# parameters of the entire network
print(net.collect_params())

dense12_ (
  Parameter dense12_weight (shape=(8, 4), dtype=float32)
  Parameter dense12_bias (shape=(8,), dtype=float32)
)
sequential3_ (
  Parameter dense12_weight (shape=(8, 4), dtype=float32)
  Parameter dense12_bias (shape=(8,), dtype=float32)
  Parameter dense13_weight (shape=(1, 8), dtype=float32)
  Parameter dense13_bias (shape=(1,), dtype=float32)
)


This provides us with another way of accessing the parameters of the network:

In [16]:
net.collect_params()['dense12_bias'].data()

array([0., 0., 0., 0., 0., 0., 0., 0.])

Throughout the book we encounter Blocks that name their sub-Blocks in various ways. Sequential simply numbers them. We can exploit this naming convention by leveraging one clever feature of `collect_params`: it allows us to filter the parameters returned by using regular expressions.

In [18]:
net.collect_params('.*weight'), net.collect_params('dense0.*')

(sequential3_ (
   Parameter dense12_weight (shape=(8, 4), dtype=float32)
   Parameter dense13_weight (shape=(1, 8), dtype=float32)
 ),
 sequential3_ (
 
 ))

##### Collecting Parameters from Nested Blocks
Let us see how the parameter naming conventions work if we nest multiple blocks inside each other. For that we first define a function that produces blocks (a block factory, so to speak) and then combine these inside yet larger blocks.

In [24]:
def block1():
    net = nn.Sequential()
    net.add(nn.Dense(32, activation='relu'))
    net.add(nn.Dense(16, activation='relu'))
    return net

def block2():
    net = nn.Sequential()
    for i in range(4):
        net.add(block1())
    return net

rgnet = nn.Sequential()
rgnet.add(block2())
rgnet.add(nn.Dense(10))
rgnet.initialize()
rgnet(x)

array([[ 9.92952165e-10,  5.80053006e-10, -1.89483429e-09,
         1.12776655e-09,  1.68435921e-09, -8.53175031e-10,
         3.54924562e-10, -1.07297127e-09,  1.25932820e-09,
         8.22724000e-10],
       [ 2.87912662e-11,  5.09265789e-11, -2.83627760e-10,
         1.27787614e-10,  8.49983195e-10, -2.54666704e-10,
        -1.06637130e-10,  3.06466276e-11,  4.25334018e-10,
         1.09359036e-10]])

Now that we have designed the network, let us see how it is organized.

In [23]:
rgnet.collect_params

<bound method Block.collect_params of Sequential(
  (0): Sequential(
    (0): Sequential(
      (0): Dense(-1 -> 32, Activation(relu))
      (1): Dense(-1 -> 16, Activation(relu))
    )
    (1): Sequential(
      (0): Dense(-1 -> 32, Activation(relu))
      (1): Dense(-1 -> 16, Activation(relu))
    )
    (2): Sequential(
      (0): Dense(-1 -> 32, Activation(relu))
      (1): Dense(-1 -> 16, Activation(relu))
    )
    (3): Sequential(
      (0): Dense(-1 -> 32, Activation(relu))
      (1): Dense(-1 -> 16, Activation(relu))
    )
  )
  (1): Dense(-1 -> 10, linear)
)>

In [25]:
rgnet.collect_params()

sequential16_ (
  Parameter dense32_weight (shape=(32, 4), dtype=float32)
  Parameter dense32_bias (shape=(32,), dtype=float32)
  Parameter dense33_weight (shape=(16, 32), dtype=float32)
  Parameter dense33_bias (shape=(16,), dtype=float32)
  Parameter dense34_weight (shape=(32, 16), dtype=float32)
  Parameter dense34_bias (shape=(32,), dtype=float32)
  Parameter dense35_weight (shape=(16, 32), dtype=float32)
  Parameter dense35_bias (shape=(16,), dtype=float32)
  Parameter dense36_weight (shape=(32, 16), dtype=float32)
  Parameter dense36_bias (shape=(32,), dtype=float32)
  Parameter dense37_weight (shape=(16, 32), dtype=float32)
  Parameter dense37_bias (shape=(16,), dtype=float32)
  Parameter dense38_weight (shape=(32, 16), dtype=float32)
  Parameter dense38_bias (shape=(32,), dtype=float32)
  Parameter dense39_weight (shape=(16, 32), dtype=float32)
  Parameter dense39_bias (shape=(16,), dtype=float32)
  Parameter dense40_weight (shape=(10, 16), dtype=float32)
  Parameter dense40_bi

Since the layers are hierarchically nested, we can also access them as though indexing through nested lists. For instance, we can access the first major block, within it the second subblock, and within that the bias of the first layer, with as follows:

In [26]:
rgnet[0][1][0].bias.data()

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

### 5.2.2 Parameter Initialization
Now that we know how to access the parameters, let us look at how to initialize them properly. We discussed the need for initialization in `Section 4.8`.

By default, `MXNet` initializes weight matrices uniformly by drawing from $U[-0.07, 0.07]$ and the bias parameters are all set to $0$. However, we will often want to initialize our weights according to various other protocols. `MXNet`'s init module provides a variety of preset initialization methods. If we want to create a custom initializer, we need to do some extra work. 

##### Built-in Initialization
Let us begin by calling on built-in initializers. The code below initializes all weight parameters as Gaussian random variables with standard deviation $0.01$, while bias parameters set to 0.

In [27]:
# force_reinit ensures that variables are freshly initialized 
# even if they were already initialized previously
net.initialize(init=init.Normal(sigma=0.01), force_reinit=True)
net[0].weight.data()[0]

array([-0.01606865, -0.00637418, -0.00194022,  0.01938055])

We can also initialize all parameters to a given constant value (say, $1$).

In [28]:
net.initialize(init=init.Constant(1), force_reinit=True)
net[0].weight.data()[0]

array([1., 1., 1., 1.])

We can also apply different initializers for certain Blocks. For example, below we initialize the first layer with the `Xavier` initializer and initialize the second layer to a constant value of 42.

In [29]:
net[0].weight.initialize(init=init.Xavier(), force_reinit=True)
net[1].initialize(init=init.Constant(42), force_reinit=True)
print(net[0].weight.data()[0])
print(net[1].weight.data())

[-0.3839189  -0.59246373 -0.38836983 -0.1149953 ]
[[42. 42. 42. 42. 42. 42. 42. 42.]]


##### Custom Initialization
Sometimes, the initialization methods we need are not provided in the init module. In the example below, we define an initializer for the following strange distribution:
$$ \begin{aligned} w \sim \begin{cases} U[5, 10] & \text{ with probability } \frac{1}{4} \\ 0 & \text{ with probability } \frac{1}{2} \\ U[-10, -5] & \text{ with probability } \frac{1}{4} \end{cases} \end{aligned} $$

Here we define a subclass of `Initializer`. Usually, we only need to implement the `_init_weight` function which takes a tensor argument (data) and assigns to it the desired initialized values.

In [30]:
class MyInit(init.Initializer):
    def _init_weight(self, name, data):
        print('Init', name, data.shape)
        data[:] = np.random.uniform(-10, 10, data.shape)
        data *= np.abs(data) >= 5

net.initialize(MyInit(), force_reinit=True)
net[0].weight.data()[0:2]

Init dense12_weight (8, 4)
Init dense13_weight (1, 8)


array([[-7.2428675,  8.986643 ,  8.441984 , -6.104151 ],
       [ 7.513895 , -0.       , -0.       ,  8.672056 ]])

Note that we always have the option of setting parameters directly by calling data to access the underlying data.

A note for advanced users: if you want to adjust parameters within an `autograd` scope, you need to use `set_data` to avoid confusing the automatic differentiation mechanics.

In [31]:
net[0].weight.data()[:] += 1
net[0].weight.data()[0, 0] = 42
net[0].weight.data()[0]

array([42.      ,  9.986643,  9.441984, -5.104151])

### 5.2.3 Tied Parameters
Often, we want to share parameters across multiple layers. Later we will see that when learning word embeddings, it might be sensible to use the same parameters both for encoding and decoding words. We discussed one such case when we introduced `Section 5.1`. Let us see how to do this a bit more elegantly. In the following we allocate a dense layer and then use its parameters specifically to set those of another layer.

In [32]:
net = nn.Sequential()
# We need to give the shared layer a name such that we can reference its parameters
shared = nn.Dense(8, activation='relu')
net.add(nn.Dense(8, activation='relu'),
        shared,
        nn.Dense(8, activation='relu', params=shared.params),
        nn.Dense(10))
net.initialize()

x = np.random.uniform(size=(2, 20))
net(x)

array([[ 6.5927663e-05,  2.9150839e-04,  1.8158997e-04, -3.1138291e-06,
         1.1306122e-04, -2.1609252e-05, -1.0594767e-04,  3.5190111e-05,
        -2.3551464e-04,  5.0346847e-05],
       [ 6.2714724e-05,  3.4918476e-04,  1.7671600e-04,  2.2592618e-05,
         9.4192124e-05, -5.6604629e-05, -7.5541138e-05,  3.4781366e-05,
        -2.4959963e-04,  6.2697138e-05]])

In [33]:
# Check whether the parameters are the same
print(net[1].weight.data()[0] == net[2].weight.data()[0])
net[1].weight.data()[0, 0] = 100
# Make sure that they are actually the same object rather than just having the
# same value
print(net[1].weight.data()[0] == net[2].weight.data()[0])

[ True  True  True  True  True  True  True  True]
[ True  True  True  True  True  True  True  True]


This example shows that the parameters of the second and third layer are tied. They are not just equal, they are represented by the same exact tensor. Thus, if we change one of the parameters, the other one changes, too. You might wonder, when parameters are tied what happens to the gradients? Since the model parameters contain gradients, the gradients of the second hidden layer and the third hidden layer are added together during backpropagation.

##### Summary
+ We have several ways to access, initialize, and tie model parameters.
+ We can use custom initialization.

##### Exercises
1. Use the FancyMLP defined in `Section 5.1` and access the parameters of the various layers.
2. Look at the init module document to explore different initializers.
3. Construct a multilayer perceptron containing a shared parameter layer and train it. During the training process, observe the model parameters and gradients of each layer.
4. Why is sharing parameters a good idea?


## 5.3 Deferred Initialization
So far, it might seem that we got away with being sloppy in setting up our networks. Specifically, we did the following unintuitive things, which might not seem like they should work:
+ We defined the network architectures without specifying the input dimensionality.
+ We added layers without specifying the output dimension of the previous layer.
+ We even "initialized" these parameters before providing enough information to determine how many parameters our models should contain.

You might be surprised that our code runs at all. After all, there is no way `MXNet` could tell what the input dimensionality of a network would be. The trick here is that `MXNet` defers initialization, waiting until the first time we pass data through the model, to infer the sizes of each layer on the fly.

Later on, when working with convolutional neural networks, this technique will become even more convenient since the input dimensionality (i.e., the resolution of an image) will affect the dimensionality of each subsequent layer. Hence, the ability to set parameters without the need to know, at the time of writing the code, what the dimensionality is can greatly simplify the task of specifying and subsequently modifying our models. Next, we go deeper into the mechanics of initialization.

### 5.3.1 Instantiating a Network
To begin, let us instantiate an MLP.

In [34]:
def getnet():
    net = nn.Sequential()
    net.add(nn.Dense(256, activation='relu'))
    net.add(nn.Dense(10))
    return net

net = getnet()

At this point, the network cannot possibly know the dimensions of the input layer's weights because the input dimension remains unknown. Consequently `MXNet` has not yet initialized any parameters. We confirm by attempting to access the parameters below.

In [35]:
net.collect_params, net.collect_params()

(<bound method Block.collect_params of Sequential(
   (0): Dense(-1 -> 256, Activation(relu))
   (1): Dense(-1 -> 10, linear)
 )>,
 sequential23_ (
   Parameter dense45_weight (shape=(256, -1), dtype=float32)
   Parameter dense45_bias (shape=(256,), dtype=float32)
   Parameter dense46_weight (shape=(10, -1), dtype=float32)
   Parameter dense46_bias (shape=(10,), dtype=float32)
 ))

Note that while the Parameter objects exist, the input dimension to each layer is listed as $-1$. `MXNet` uses the special value $-1$ to indicate that the parameters dimension remains unknown. At this point, attempts to access `net[0].weight.data()` would trigger a runtime error stating that the network must be initialized before the parameters can be accessed. Now let us see what happens when we attempt to initialze parameters via the initialize method.

In [36]:
net.initialize()
net.collect_params()

sequential23_ (
  Parameter dense45_weight (shape=(256, -1), dtype=float32)
  Parameter dense45_bias (shape=(256,), dtype=float32)
  Parameter dense46_weight (shape=(10, -1), dtype=float32)
  Parameter dense46_bias (shape=(10,), dtype=float32)
)

As we can see, nothing has changed. When input dimensions are unknown, calls to initialize do not truly initalize the parameters. Instead, this call registers to `MXNet` that we wish (and optionally, according to which distribution) to initialize the parameters. Only once we pass data through the network will `MXNet` finally initialize parameters and will we see a difference.

In [37]:
x = np.random.uniform(size=(2, 20))
net(x)  # Forward computation

net.collect_params()

sequential23_ (
  Parameter dense45_weight (shape=(256, 20), dtype=float32)
  Parameter dense45_bias (shape=(256,), dtype=float32)
  Parameter dense46_weight (shape=(10, 256), dtype=float32)
  Parameter dense46_bias (shape=(10,), dtype=float32)
)

As soon as we know the input dimensionality, $\mathbf{x} \in \mathbb{R}^{20}$, `MXNet` can identify the shape of the first layer's weight matrix, i.e., $\mathbf{W}_1 \in \mathbb{R}^{256 \times 20}$. Having recognized the first layer shape, `MXNet` proceeds to the second layer, whose dimensionality is $10 \times 256$ and so on through the computational graph until all shapes are known. Note that in this case, only the first layer requires deferred initialization, but `MXNet` initializes sequentially. Once all parameter shapes are known, `MXNet` can finally initialize the parameters.

### 5.3.2 Deferred Initialization in Practice
Now that we know how it works in theory, let us see when the initialization is actually triggered. In order to do so, we mock up an initializer which does nothing but report a debug message stating when it was invoked and with which parameters.

In [38]:
class MyInit(init.Initializer):
    def _init_weight(self, name, data):
        print('Init', name, data.shape)
        # The actual initialization logic is omitted here

net = getnet()
net.initialize(init=MyInit())

Note that, although `MyInit` will print information about the model parameters when it is called, the above initialize function does not print any information after it has been executed. 

Therefore there is no real initialization parameter when calling the initialize function. Next, we define the input and perform a forward calculation.

In [39]:
x = np.random.uniform(size=(2, 20))
y = net(x)

Init dense47_weight (256, 20)
Init dense48_weight (10, 256)


At this time, information on the model parameters is printed. When performing a forward calculation based on the input $x$, the system can automatically infer the shape of the weight parameters of all layers based on the shape of the input. Once the system has created these parameters, it calls the `MyInit` instance to initialize them before proceeding to the forward calculation.

This initialization will only be called when completing the initial forward calculation. After that, we will not re-initialize when we run the forward calculation `net(x)`, so the output of the `MyInit` instance will not be generated again.

In [40]:
y = net(x)

As mentioned at the beginning of this section, deferred initialization can be a source of confusion. Before the first forward calculation, we were unable to directly manipulate the model parameters. For example, we could not use the `data` and `set_data` functions to get and modify the parameters. Therefore, we often force initialization by sending a sample observation through the network.

### 5.3.3 Forced Initialization
Deferred initialization does not occur if the system knows the shape of all parameters when we call the initialize function. This can occur in two cases:
+ We have already seen some data and we just want to reset the parameters.
+ We specified all input and output dimensions of the network when defining it.

Forced reinitialization works as illustrated below.

In [41]:
net.initialize(init=MyInit(), force_reinit=True)

Init dense47_weight (256, 20)
Init dense48_weight (10, 256)


The second case requires that we specify all parameters when creating each layer. For instance, for dense layers we must specify `in_units` at the time that the layer is instantiated.

In [42]:
net = nn.Sequential()
net.add(nn.Dense(256, in_units=20, activation='relu'))
net.add(nn.Dense(10, in_units=256))

net.initialize(init=MyInit())

Init dense49_weight (256, 20)
Init dense50_weight (10, 256)


##### Summary
+ Deferred initialization can be convenient, allowing Gluon to infer parameter shapes automatically, making it easy to modify architectures and eliminating one common source of errors.
+ We do not need deferred initialization when we specify all variables explicitly.
+ We can forcibly re-initialize a network's parameters by invoking initalize with the force_reinit=True flag.

##### Exercises
1. What happens if you specify the input dimensions to the first laye but not to subsequent layers? Do you get immediate initialization?
2. What happens if you specify mismatching dimensions?
3. What would you need to do if you have input of varying dimensionality? Hint - look at parameter tying.


## 5.4 Custom Layers
One factor behind deep learning's success is the availability of a wide range of layers that can be composed in creative ways to design architectures suitable for a wide variety of tasks. For instance, researchers have invented layers specifically for handling images, text, looping over sequential data, performing dynamic programming, etc. Sooner or later you will encounter (or invent) a layer that does not exist yet in the framework, In these cases, you must build a custom layer. In this section, we show you how.

### 5.4.1 Layers without Parameters
To start, we construct a custom layer (a block) that does not have any parameters of its own. This should look familiar if you recall our introduction to block in `Section 5.1`. The following `CenteredLayer` class simply subtracts the mean from its input. To build it, we simply need to inherit from the Block class and implement the `forward` method.

In [43]:
class CenteredLayer(nn.Block):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def forward(self, x):
        return x - x.mean()

Let us verify that our layer works as intended by feeding some data through it.

In [44]:
layer = CenteredLayer()
layer(np.array([1, 2, 3, 4, 5]))

array([-2., -1.,  0.,  1.,  2.])

We can now incorporate our layer as a component in constructing more complex models.

In [46]:
net = nn.Sequential()
net.add(nn.Dense(128), CenteredLayer())
net.initialize()
net.summary

<bound method Block.summary of Sequential(
  (0): Dense(-1 -> 128, linear)
  (1): CenteredLayer(
  
  )
)>

As an extra sanity check, we can send random data through the network and check that the mean is in fact 0. Because we are dealing with floating point numbers, we may still see a very small nonzero number due to quantization.

In [47]:
y = net(np.random.uniform(size=(4, 8)))
y.mean()

array(6.330083e-10)

### 5.4.2 Layers with Parameters
Now that we know how to define simple layers, let us move on to defining layers with parameters that can be adjusted through training. To automate some of the routine work the `Parameter` class provide some basic housekeeping functionality. In particular, they govern access, initialization, sharing, saving, and loading model parameters. This way, among other benefits, we will not need to write custom serialization routines for every custom layer.

The Block class contains a params variable of the `ParameterDict` type. This dictionary maps strings representing parameter names to model parameters (of the `Parameter` type). The `ParameterDict` also supplies a `get` function that makes it easy to generate a new parameter with a specified name and shape. 

In [48]:
params = gluon.ParameterDict()
params.get('param2', shape=(2, 3))
params

(
  Parameter param2 (shape=(2, 3), dtype=<class 'numpy.float32'>)
)

We now have all the basic ingredients that we need to implement our own version of `Gluon`'s `Dense` layer. Recall that this layer requires two parameters, one to represent the `weight` and another for the `bias`. In this implementation, we bake in the ReLU activation as a default. In the `__init__` function, `in_units` and `units` denote the number of inputs and outputs, respectively. 

In [49]:
class MyDense(nn.Block):
    # units: the number of outputs in this layer
    # in_units: the number of inputs in this layer
    def __init__(self, units, in_units, **kwargs):
        super(MyDense, self).__init__(**kwargs)
        self.weight = self.params.get('weight', shape=(in_units, units))
        self.bias = self.params.get('bias', shape=(units,))

    def forward(self, x):
        linear = np.dot(x, self.weight.data(ctx=x.ctx)) + self.bias.data(ctx=x.ctx)
        return npx.relu(linear)

Next, we instantiate the MyDense class and access its model parameters.

In [50]:
dense = MyDense(units=3, in_units=5)
dense.params

mydense0_ (
  Parameter mydense0_weight (shape=(5, 3), dtype=<class 'numpy.float32'>)
  Parameter mydense0_bias (shape=(3,), dtype=<class 'numpy.float32'>)
)

We can directly carry out forward calculations using custom layers.

In [51]:
dense.initialize()
dense(np.random.uniform(size=(2, 5)))

array([[0.        , 0.1600183 , 0.        ],
       [0.03217274, 0.16958053, 0.        ]])

We can also construct models using custom layers. Once we have that we can use it just like the built-in dense layer.

In [52]:
net = nn.Sequential()
net.add(MyDense(8, in_units=64),
        MyDense(1, in_units=8))
net.initialize()
net(np.random.uniform(size=(2, 64)))

array([[0.00850829],
       [0.00838567]])

##### Summary
+ We can design custom layers via the Block class. This allows us to define flexible new layers that behave differently from any existing layers in the library.
+ Once defined, custom layers can be invoked in arbitrary contexts and architectures.
+ Blocks can have local parameters, which are stored in a ParameterDict object in each Block's params attribute.

##### Exercises
1. Design a layer that learns an affine transform of the data.
2. Design a layer that takes an input and computes a tensor reduction, i.e., it returns $y_k = \displaystyle\sum_{i, j} W_{ijk} x_i x_j$.
3. Design a layer that returns the leading half of the Fourier coefficients of the data.


## 5.5 File I/O
So far we discussed how to process data and how to build, train, and test deep learning models. However, at some point, we will hopefully be happy enough with the learned models that we will want to save the results for later use in various contexts (perhaps even to make predictions in deployment). Additionally, when running a long training process, the best practice is to periodically save intermediate results (checkpointing) to ensure that we do not lose several days worth of computation if we trip over the power cord of our server. Thus it is time we learned how to load and store both individual weight vectors and entire models. This section addresses both issues.

### 5.5.1 Loading and Saving Tensors
For individual tensors, we can directly invoke the load and save functions to read and write them respectively. Both functions require that we supply a name, and save requires as input the variable to be saved.

In [54]:
x = np.arange(4)
npx.save('./temp/x-file', x)

We can now read this data from the stored file back into memory.

In [55]:
x2 = npx.load('./temp/x-file')
x2

[array([0., 1., 2., 3.])]

We can store a list of tensors and read them back into memory.

In [56]:
y = np.zeros(4)
npx.save('./temp/x-files', [x, y])
x2, y2 = npx.load('./temp/x-files')
(x2, y2)

(array([0., 1., 2., 3.]), array([0., 0., 0., 0.]))

We can even write and read a dictionary that maps from strings to tensors. This is convenient when we want to read or write all the weights in a model.

In [57]:
mydict = {'x': x, 'y': y}
npx.save('./temp/mydict', mydict)
mydict2 = npx.load('./temp/mydict')
mydict2

{'x': array([0., 1., 2., 3.]), 'y': array([0., 0., 0., 0.])}

### 5.5.2 Model Parameters
Saving individual weight vectors (or other tensors) is useful, but it gets very tedious if we want to save (and later load) an entire model. After all, we might have hundreds of parameter groups sprinkled throughout. For this reason the framework provides built-in functionality to load and save entire networks. 

An important detail to note is that this saves model parameters and not the entire model. For example, if we have a 3-layer MLP, we need to specify the architecture separately. The reason for this is that the models themselves can contain arbitrary code, hence they cannot be serialized as naturally. Thus, in order to reinstate a model, we need to generate the architecture in code and then load the parameters from disk. Let us start with our familiar MLP.

In [58]:
class MLP(nn.Block):
    def __init__(self, **kwargs):
        super(MLP, self).__init__(**kwargs)
        self.hidden = nn.Dense(256, activation='relu')
        self.output = nn.Dense(10)

    def forward(self, x):
        return self.output(self.hidden(x))

net = MLP()
net.initialize()
x = np.random.uniform(size=(2, 20))
y = net(x)

Next, we store the parameters of the model as a file with the name `mlp.params`.

In [59]:
net.save_parameters('./temp/mlp.params')

To recover the model, we instantiate a clone of the original MLP model. Instead of randomly initializing the model parameters, we read the parameters stored in the file directly.

In [60]:
clone = MLP()
clone.load_parameters('./temp/mlp.params')

Since both instances have the same model parameters, the computation result of the same input $x$ should be the same. Let us verify this.

In [61]:
yclone = clone(x)
yclone == y

array([[ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True],
       [ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True]])

##### Summary
+ The save and load functions can be used to perform File I/O for tensor objects.
+ We can save and load the entire sets of parameters for a network via a parameter dictionary.
+ Saving the architecture has to be done in code rather than in parameters.

##### Exercises
1. Even if there is no need to deploy trained models to a different device, what are the practical benefits of storing model parameters?
2. Assume that we want to reuse only parts of a network to be incorporated into a network of a different architecture. How would you go about using, say the first two layers from a previous network in a new network.
3. How would you go about saving network architecture and parameters? What restrictions would you impose on the architecture?


## 5.6 GPUs
In the introduction, we discussed the rapid growth of computation over the past two decades. In a nutshell, GPU performance has increased by a factor of 1000 every decade since 2000. This offers great opportunity but it also suggests a significant need to provide such performance.

|Decade|Dataset|Memory|Floating Point Calculations per Second|
|:--|:-|:-|:-|
|1970|100 (Iris)|1 KB|100 KF (Intel 8080)|
|1980|1 K (House prices in Boston)|100 KB|1 MF (Intel 80186)|
|1990|10 K (optical character recognition)|10 MB|10 MF (Intel 80486)|
|2000|10 M (web pages)|100 MB|1 GF (Intel Core)|
|2010|10 G (advertising)|1 GB|1 TF (NVIDIA C2050)|
|2020|1 T (social network)|100 GB|1 PF (NVIDIA DGX-2)|

In this section, we begin to discuss how to harness this compute performance for your research. First by using single GPUs and at a later point, how to use multiple GPUs and multiple servers (with multiple GPUs).

In this section, we will discuss how to use a single NVIDIA GPU for calculations. First, make sure you have at least one NVIDIA GPU installed. Then, download CUDA and follow the prompts to set the appropriate path. Once these preparations are complete, the `nvidia-smi` command can be used to view the graphics card information.

You might have noticed that `MXNet` tensor looks almost identical to `NumPy`. But there are a few crucial differences. One of the key features that distinguishes `MXNet` from `NumPy` is its support for diverse hardware devices.

In `MXNet`, every array has a context. So far, by default, all variables and associated computation have been assigned to the CPU. Typically, other contexts might be various GPUs. Things can get even hairier when we deploy jobs across multiple servers. By assigning arrays to contexts intelligently, we can minimize the time spent transferring data between devices. For example, when training neural networks on a server with a GPU, we typically prefer for the model's parameters to live on the GPU.

Next, we need to confirm that the GPU version of `MXNet` is installed. If a CPU version of `MXNet` is already installed, we need to uninstall it first. For example, use the `pip uninstall mxnet` command, then install the corresponding `MXNet` version according to your CUDA version. Assuming you have CUDA 9.0 installed, you can install the MXNet version that supports CUDA 9.0 via `pip install mxnet-cu90`. To run the programs in this section, you need at least two GPUs. 

Note that this might be extravagant for most desktop computers but it is easily available in the cloud, e.g., by using the AWS EC2 multi-GPU instances. Almost all other sections do not require multiple GPUs. Instead, this is simply to illustrate how data flows between different devices.

### 5.6.1 Computing Devices
We can specify devices, such as CPUs and GPUs, for storage and calculation. By default, tensors are created in the main memory and then uses the CPU to calculate it.

In `MXNet`, the CPU and GPU can be indicated by `cpu()` and `gpu()`. It should be noted that `cpu()` (or any integer in the parentheses) means all physical CPUs and memory. This means that `MXNet`'s calculations will try to use all CPU cores. However, `gpu()` only represents one card and the corresponding memory. If there are multiple GPUs, we use `gpu(i)` to represent the $i^\mathrm{th}$ GPU ($i$ starts from 0). Also, `gpu(0)` and `gpu()` are equivalent. 

In [62]:
npx.cpu(), npx.gpu(), npx.gpu(1)

(cpu(0), gpu(0), gpu(1))

We can query the number of available GPUs.

In [63]:
npx.num_gpus()

0

Now we define two convenient functions that allow us to run codes even if the requested GPUs do not exist.

In [64]:
def try_gpu(i=0): 
    """Return gpu(i) if exists, otherwise return cpu()."""
    return npx.gpu(i) if npx.num_gpus() >= i + 1 else npx.cpu()

def try_all_gpus():
    """Return all available GPUs, or [cpu(),] if no GPU exists."""
    ctxes = [npx.gpu(i) for i in range(npx.num_gpus())]
    return ctxes if ctxes else [npx.cpu()]

try_gpu(), try_gpu(3), try_all_gpus()

(cpu(0), cpu(0), [cpu(0)])

### 5.6.2 Tensors and GPUs
By default, tensors are created on the CPU. We can query the device where the tensor is located.

In [65]:
x = np.array([1, 2, 3])
x.ctx

cpu(0)

It is important to note that whenever we want to operate on multiple terms, they need to be in the same context. For instance, if we sum two tensors, we need to make sure that both arguments live on the same device---otherwise the framework would not know where to store the result or even how to decide where to perform the computation.

##### Storage on the GPU
There are several ways to store a tensor on the GPU. For example, we can specify a storage device when creating a tensor. Next, we create the tensor variable a on the first gpu. Notice that when printing $a$, the device information changed. The tensor created on a GPU only consumes the memory of this GPU. We can use the `nvidia-smi` command to view GPU memory usage. In general, we need to make sure we do not create data that exceeds the GPU memory limit.

In [66]:
x = np.ones((2, 3), ctx=try_gpu())
x

array([[1., 1., 1.],
       [1., 1., 1.]])

Assuming you have at least two GPUs, the following code will create a random array on the second GPU.

In [67]:
y = np.random.uniform(size=(2, 3), ctx=try_gpu(1))
y

array([[0.77861947, 0.17202786, 0.0139584 ],
       [0.5919036 , 0.97415286, 0.9939629 ]])

##### Copying
If we want to compute $\mathbf{x} + \mathbf{y}$, we need to decide where to perform this operation. For instance, as shown in `Fig. 5.6.1`, we can transfer $\mathbf{x}$ to the second GPU and perform the operation there. Do not simply add $\mathbf{x} + \mathbf{y}$, since this will result in an exception. The runtime engine would not know what to do, it cannot find data on the same device and it fails.

<img src="images/05_02.png" style="width:400px;"/>

`copyto` copies the data to another device such that we can add them. Since $\mathbf{y}$ lives on the second GPU, we need to move $\mathbf{x}$ there before we can add the two.

In [68]:
z = x.copyto(try_gpu(1))
print(x)
print(z)

[[1. 1. 1.]
 [1. 1. 1.]]
[[1. 1. 1.]
 [1. 1. 1.]]


Now that the data is on the same GPU (both $\mathbf{z}$ and $\mathbf{y}$ are), we can add them up.

Imagine that your variable $\mathbf{z}$ already lives on your second GPU. What happens if we call still `z.copyto(gpu(1))`? It will make a copy and allocate new memory, even though that variable already lives on the desired device! There are times where, depending on the environment our code is running in, two variables may already live on the same device. So we only want to make a copy if the variables currently lives on different contexts. In these cases, we can call `as_in_ctx()`. If the variable already live in the specified context then this is a no-op. Unless you specifically want to make a copy, `as_in_ctx()` is the method of choice.

In [69]:
z.as_in_ctx(try_gpu(1)) is z

True

##### Side Notes
People use GPUs to do machine learning because they expect them to be fast. But transferring variables between contexts is slow. So we want you to be $100\%$ certain that you want to do something slow before we let you do it. If the framework just did the copy automatically without crashing then you might not realize that you had written some slow code.

Also, transferring data between devices (CPU, GPUs, other machines) is something that is much slower than computation. It also makes parallelization a lot more difficult, since we have to wait for data to be sent (or rather to be received) before we can proceed with more operations. This is why copy operations should be taken with great care. As a rule of thumb, many small operations are much worse than one big operation. Moreover, several operations at a time are much better than many single operations interspersed in the code (unless you know what you are doing). This is the case since such operations can block if one device has to wait for the other before it can do something else. It is a bit like ordering your coffee in a queue rather than pre-ordering it by phone and finding out that it is ready when you are.

Last, when we print tensors or convert tensors to the `NumPy` format, if the data is not in main memory, the framework will copy it to the main memory first, resulting in additional transmission overhead. Even worse, it is now subject to the dreaded `Global Interpreter Lock` that makes everything wait for Python to complete.

### 5.6.3 Neural Networks and GPUs
Similarly, a neural network model can specify devices. The following code put the model parameters on the GPU (we will see many more examples of how to run models on GPUs in the following, simply since they will become somewhat more compute intensive).

In [70]:
net = nn.Sequential()
net.add(nn.Dense(1))
net.initialize(ctx=try_gpu())

When the input is a tensor on the GPU, `Gluon` will calculate the result on the same GPU.

In [71]:
net(x)

array([[0.06193921],
       [0.06193921]])

Let us confirm that the model parameters are stored on the same GPU.

In [72]:
net[0].weight.data().ctx

cpu(0)

In short, as long as all data and parameters are on the same device, we can learn models efficiently. In the following we will see several such examples.

##### Summary
+ We can specify devices for storage and calculation, such as CPU or GPU. By default, data are created in the main memory and then uses the CPU for calculations.
+ The framework requires all input data for calculation to be on the same device, be it CPU or the same GPU.
+ You can lose significant performance by moving data without care. A typical mistake is as follows: computing the loss for every minibatch on the GPU and reporting it back to the user on the command line (or logging it in a NumPy array) will trigger a global interpreter lock which stalls all GPUs. It is much better to allocate memory for logging inside the GPU and only move larger logs.

##### Exercises
1. Try a larger computation task, such as the multiplication of large matrices, and see the difference in speed between the CPU and GPU. What about a task with a small amount of calculations?
2. How should we read and write model parameters on the GPU?
3. Measure the time it takes to compute 1000 matrix-matrix multiplications of $100 \times 100$ matrices and log the matrix norm $\mathrm{tr} M M^\top$ one result at a time vs. keeping a log on the GPU and transferring only the final result.
4. Measure how much time it takes to perform two matrix-matrix multiplications on two GPUs at the same time vs. in sequence on one GPU (hint: you should see almost linear scaling).