# Music machine learning - Pytorch

### Author: Philippe Esling (esling@ircam.fr)

In this course we will cover
1. A [quick introduction](#intro) to Pytorch 
2. An implementation for [advanced models](#models) implementation
3. A quick proposal for [attention](#attention) layers

## Introduction to Pytorch

`Pytorch` is a Python-based scientific computing package targeted at deep learning, which provides a very large flexibility and easeness of use for GPU calculation. `Pytorch` is constructed around the concept of `Tensor`, which is very similar to `numpy.ndarray`, but can be seamlessly run on GPU.

Here are some examples of different `Tensor` creation

In [None]:
import torch
# Create a 5 x 3 Tensor of zeros
x = torch.empty(5, 3)
# Create a 64 x 3 x 32 x 32 random Tensor
x = torch.rand(64, 3, 32, 32)
# Create a Tensor of zeros with _long_ type
x = torch.zeros(10, 10, dtype=torch.long)
# Construct a Tensor from the data
x = torch.tensor([5.5, 3])

or create a tensor based on an existing tensor. These methods
will reuse properties of the input tensor, e.g. dtype, unless
new values are provided by user



In [None]:
x = x.new_ones(8, 2, dtype=torch.double)      # new_* methods take in sizes
x = torch.randn_like(x, dtype=torch.float)    # override dtype!
print(x.size())
print(x.shape[0])

#### Arithmetic operations

Tensors provide access to a transparent library of arithmetic operations

In [None]:
x = torch.rand(8, 2)
y = torch.rand(8, 2)
z = torch.rand(2, 4)
# Equivalent additions
print(x + y)
print(torch.add(x, y))
# Add in place
x.add_(y)
print(x)
# Put in target Tensor
result = torch.empty(5, 3)
torch.add(x, y, out=result)
print(result)
# Element_wise multiplication
print(x * y)
# Matrix product
print(x @ z)

#### Slicing and resizing

You can slice tensors using the usual Python operators. For resizing and reshaping tensor, you can use ``torch.view`` or ``torch.reshape``

In [None]:
print(x[:, 1])
x = torch.randn(4, 4)
y = x.view(16)
z = x.view(-1, 8)  # the size -1 is inferred from other dimensions
print(x.size(), y.size(), z.size())

If you have a one element tensor, use ``.item()`` to get the value as a
Python number



In [None]:
x = torch.randn(1)
print(x)
print(x.item())

Tensors have more than 100 operations, including transposing, indexing, slicing, mathematical operations, linear algebra, random numbers, which are all described at [https://pytorch.org/docs/torch](https://pytorch.org/docs/torch)

#### Numpy bridge

Converting a Torch Tensor to a Numpy array and vice versa is extremely simple. Note that the Pytorch Tensor and Numpy array **will share their underlying memory locations** (if the Tensor is on CPU), and changing one will change the other.

In [None]:
a = torch.ones(5)
b = a.numpy()
a.add_(1)
print(a)
print(b)

#### Going GPU

Tensors can be moved onto any device using the ``.to`` method.

In [None]:
# let us run this cell only if CUDA is available
# We will use ``torch.device`` objects to move tensors in and out of GPU
if torch.cuda.is_available():
    device = torch.device("cuda")          # a CUDA device object
    y = torch.ones_like(x, device=device)  # directly create a tensor on GPU
    x = x.to(device)                       # or just use strings ``.to("cuda")``
    z = x + y
    print(z)
    print(z.to("cpu", torch.double))       # ``.to`` can also change dtype together!

## Computation Graphs

The concept of a computation graph is essential to efficient deep learning programming, because it allows you to not have to write the back propagation gradients yourself. A computation graph is simply a specification of how your data is combined to give you the output (the forward pass). Since the graph totally specifies what parameters were involved with which operations, it contains enough information to compute derivatives. 

The fundamental flag ``requires_grad`` allows to specify which variables are going to need differentiation in all these operations. If ``requires_grad=True``, the Tensor object keeps track of how it was created.

In [None]:
# Tensor factory methods have a ``requires_grad`` flag
x = torch.tensor([1., 2., 3], requires_grad=True)
# With requires_grad=True, we can still do all the operations 
y = torch.tensor([4., 5., 6], requires_grad=True)
z = x + y
print(z)
# But z now knows something extra.
print(z.grad_fn)

Therefore, `z` knows that it is the direct result of an addition. Furthermore, if we keep following z.grad_fn, we can even find back both `x` and `y`. But how does that help us compute a gradient?

In [None]:
# Lets sum up all the entries in z
s = z.sum()
print(s)
print(s.grad_fn)

So now, what is the derivative of this sum with respect to the first
component of x? In math, we want

\begin{align}\frac{\partial s}{\partial x_0}\end{align}



Well, s knows that it was created as a sum of the tensor z. z knows
that it was the sum x + y. So

\begin{align}s = \overbrace{x_0 + y_0}^\text{$z_0$} + \overbrace{x_1 + y_1}^\text{$z_1$} + \overbrace{x_2 + y_2}^\text{$z_2$}\end{align}

And so s contains enough information to determine that the derivative we want is 1. We can have Pytorch compute the gradient, and see that we were right:

**Note** : If you run this block multiple times, the gradient will increment. That is because Pytorch *accumulates* the gradient into the .grad property, since for many models this is very convenient.

In [None]:
# calling .backward() on any variable will run backprop, starting from it.
s.backward()
print(x.grad)

Understanding what is going on in the block below is crucial for being a
successful programmer in deep learning.




In [None]:
x = torch.randn(2, 2)
y = torch.randn(2, 2)
# By default, user created Tensors have ``requires_grad=False``
print(x.requires_grad, y.requires_grad)
z = x + y
# So you can't backprop through z
print(z.grad_fn)
# ``.requires_grad_( ... )`` changes an existing Tensor's ``requires_grad``
x = x.requires_grad_()
y = y.requires_grad_()
# z contains enough information to compute gradients, as we saw above
z = x + y
print(z.grad_fn)
# If any input to an operation has ``requires_grad=True``, so will the output
print(z.requires_grad)
# Now z has the computation history, which we can **detach**
new_z = z.detach()
# Which means that we have no gradient attached anymore
print(new_z.grad_fn)

You can also stop autograd from tracking history on Tensors
with ``.requires_grad=True`` by wrapping the code block in
``with torch.no_grad():``



In [None]:
print(x.requires_grad)
print((x ** 2).requires_grad)
with torch.no_grad():
	print((x ** 2).requires_grad)

## Defining networks 

Here, we briefly recall that in `PyTorch`, the `nn` package provides higher-level abstractions over raw computational graphs that are useful for building neural networks. The `nn` package defines a set of `Modules`, which are roughly equivalent to neural network layers. A `Module` receives input `Tensors` and computes output `Tensors`, but may also hold internal state such as `Tensors` containing learnable parameters. In the following example, we use the `nn` package to show how easy it is to instantiate a three-layer network

In [None]:
import torch
import torch.nn as nn
# Define the input dimensions
in_size = 1000
# Number of neurons in a layer
hidden_size = 100
# Output (target) dimension
output_size = 10
# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    nn.Linear(in_size, hidden_size),
    nn.ReLU(),
    nn.Linear(hidden_size, hidden_size),
    nn.Tanh(),
    nn.Linear(hidden_size, output_size),
    nn.Softmax()
)

As we have seen in the slides, we can as easily mix between pre-defined modules and arithmetic operations. Here, we will define a *residual* block, and then combine them in a more complex network

In [None]:
class ResBlock(nn.Module):
    def __init__(self, dim, dim_res=32):
        super().__init__()
        self.block = nn.Sequential(
            nn.Conv2d(dim, dim_res, 3, 1, 1),
            nn.ReLU(True),
            nn.Conv2d(dim_res, dim, 1),
            nn.ReLU(True)
        )

    def forward(self, x):
        return x + self.block(x)

model = nn.Sequential(
	ResBlock(64, 32),
	ResBlock(64, 32),
)

### Defining our own layers

In the following, we re-implement the *attention* layer, which is the basis of the infamous `Transformer` models.

In [None]:
class AttentionLayer(nn.Module):
    def __init__(self, n_hidden):
        super(ChordLevelAttention, self).__init__()
        self.mlp = nn.Linear(n_hidden, n_hidden)
        self.u_w = nn.Parameter(torch.rand(n_hidden))

    def forward(self, X):
        # get the hidden representation of the sequence
        u_it = F.tanh(self.mlp(X))
        # get attention weights for each timestep
        alpha = F.softmax(torch.matmul(u_it, self.u_w), dim=1)
        # get the weighted sum of the sequence
        out = torch.sum(torch.matmul(alpha, X), dim=1)
        return out, alpha
