**Deep Learning through Operations**

In this notebook, we'll take a look at the operations necessary to train neural networks. We'll implement everything in PyTorch and explicate each implementation as we go along. We'll start with CPU implementations; these ops generally have corresponding implementations for GPUs and other devices.

First we need to import PyTorch:

In [2]:
import torch

Now let's define an input tensor `x`, a ground truth output tensor `y`, and a model output tensor `Y`. We arbitrarily set `x` equal to a randomly-sampled floating point number in the interval `[0, 1)` and `y` equal to `1`:

In [158]:
x = torch.rand(1)
print(x)
y = torch.tensor([1])
print(y)

tensor([0.9971])
tensor([1])


Now let's set the model output `Y` to the input `x` plus a constant `b`. As with `x`, we arbitrarily set `b` equal to a randomly-sampled floating point number in the interval `[0, 1)`:

In [159]:
b = torch.rand(1)
print(b)
Y = x.add(b)
print(Y)

tensor([0.8354])
tensor([1.8325])


Let's pause and examine the `tensor` objects we've created. `tensor` is defined in `pytorch/c10/core`. In this directory, we see a bunch of `.h` and `.cpp` files that begin with `Tensor`. Header files define `tensor` attributes:

`TensorImpl.h`: implementation <br/>
`TensorOptions.h`: options <br/>
`TensorTypeId.h`: type IDs (e.g. for CPU, CUDA, XLA) <br/>
`TensorTypeSet.h`: "a representation of a set of TensorTypeIds"

Each of these has functions defined in a corresponding C++ file. We can call the attributes `dtype`, `device`, and `layout` on each tensor:

In [160]:
print(x.dtype, y.dtype, b.dtype, Y.dtype, sep=', ')
print(x.device, y.device, b.device, Y.device, sep=', ')
print(x.layout, y.layout, b.layout, Y.layout, sep=', ')

torch.float32, torch.int64, torch.float32, torch.float32
cpu, cpu, cpu, cpu
torch.strided, torch.strided, torch.strided, torch.strided


We can apply unary operations to each tensor. In `pytorch/aten/src/native`, in `UnaryOps.h`, `DECLARE_DISPATCH` defines a `_stub` for each unary op within the `at::native` namespace. In the same directory, `UnaryOps.cpp` implements each of these operations. Some of these implementations are explicitly defined in this file; some are defined elsewhere; all have an `IMPLEMENT_UNARY_OP_VEC` and `DEFINE_DISPATCH` in this file. Let's apply these unary ops to our input `x`. Some ops (e.g. `angle`, `real`) will not work on `x`; we omit these.

In [161]:
print(x.abs())
print(x.acos())
print(x.asin())
print(x.atan())
print(x.cos())
print(x.cosh())
print(x.erf())
print(x.erfc())
print(x.erfinv())
print(x.exp())
print(x.frac())
print(x.log1p())
print(x.log2())
print(x.reciprocal())
print(x.sigmoid())
print(x.sin())
print(x.sinh())
print(x.sqrt())
print(x.tan())
print(x.tanh())
print(x.lgamma())

tensor([0.9971])
tensor([0.0757])
tensor([1.4951])
tensor([0.7840])
tensor([0.5427])
tensor([1.5397])
tensor([0.8415])
tensor([0.1585])
tensor([2.1086])
tensor([2.7105])
tensor([0.9971])
tensor([0.6917])
tensor([-0.0041])
tensor([1.0029])
tensor([0.7305])
tensor([0.8399])
tensor([1.1708])
tensor([0.9986])
tensor([1.5476])
tensor([0.7604])
tensor([0.0017])


We can also apply binary operations to multiple tensors. Binary ops are defined (natively) in `pytorch/aten/src/ATen/native`, in `BinaryOps.h` and `BinaryOps.cpp`. Some unary and binary ops are still defined in the legacy `pytorch/aten/src/TH` directory, in the (awkwardly titled) files `THTensorMath.cpp`, `THTensorMoreMath.cpp`, and `THTensorEvenMoreMath.cpp`. We won't worry about these for now.

Let's call our (native) binary ops on `x` and `b`. We can do this either by calling the binary op as a method of `x` and passing the argument `b`, or by passing `x` and `b` as arguments of the `torch` method. For instance, we could call `x.add(b)` or `torch.add(x,b)`. Let's stick with the former since it's more concise:

In [162]:
print(x.add(b))
print(x.sub(b))
print(x.mul(b))
print(x.div(b))
print(x.atan2(b))

tensor([1.8325])
tensor([0.1618])
tensor([0.8330])
tensor([1.1937])
tensor([0.8735])


Unlike unary ops, many binary ops in PyTorch are still defined in the legacy `TH` directory. However, we do see our operation `x.add(b)` that we used to add `b` to `x` to produce our model output `Y`.

To understand how PyTorch computes `x.add(b)`, we have to navigate to `pytorch/aten/src/ATen/native/cpu`. This directory contains (among other "kernels", including `UnaryOpsKernel.cpp`) our `BinaryOpsKernel.cpp` file. Here we see the `void` function `add_kernel` calls *another* function `vec256::fmadd`, defined in `pytorch/aten/src/ATen/cpu/vec256`, in `vec256_base.h`, which returns `a * b + c` for `const` inputs `a`, `b`, and `c`. Each of these inputs references `typename T`, defined as a generic template in `vec256_base.h`, and defined for other data types in (e.g. for floating point vectors) `vec256_float.h` and `vec256_double.h`. For our simple example (adding two `float32` vectors of length 1), `x.add(b)` is equivalent to `x + b`.

What if we want to compare the loss between our actual output `y` and our (predicted) model output `Y`? PyTorch has several loss functions defined in `Loss.cpp`. We call these as methods of `torch` and pass `y` and `Y` as arguments. For each of these, we pass our model output `Y` first and our *target* "actual" output `y` second. Some of these loss functions require more than two inputs; we omit these here. Another (`binary_cross_entropy_with_logits`) requires us to cast our target to a `float`; we do that.

In [163]:
print(torch.hinge_embedding_loss(Y, y))
print(torch.binary_cross_entropy_with_logits(Y, y.float()))

tensor([1.8325])
tensor([0.1484])


Since our optional arguments `weight` and `pos_weight` are not defined, `loss` is defined as `(1 - target).mul_(input).add_(max_val).add_((-max_val).exp_().add_((-input -max_val).exp_()).log_())`. We then call `apply_loss_reduction` (defined in the same `Loss.cpp` file) on `loss` and `reduction` (we also do not explicitly pass `reduction` as an argument here).

To update our *bias* parameter `b`, we first need to compute our *gradient*. This is simply the partial derivative of our loss function with respect to our input.

To illustrate this, let's re-intialize `b` with the argument `requires_grad=True`. This tells PyTorch to allocate memory for `b`'s gradient and compute it when we backpropagate. We don't need to specify `requires_grad=True` for `Y`. PyTorch will automatically define a `grad_fn` equal to `<AddBackward0>` since `Y = x.add(b)`. This `grad_fn` changes with the function input. For instance, if we define `Y = x.mul(b)`, our `grad_fn` becomes `<MulBackward0>`.

In [170]:
x = torch.rand(1)
b = torch.rand(1, requires_grad=True)
Y = x.add(b)
print(x)
print(b)
print(Y)

tensor([0.2604])
tensor([0.9283], requires_grad=True)
tensor([1.1887], grad_fn=<AddBackward0>)


If we check our gradient, we'll find that it haven't been computed:

In [171]:
print(b.grad)

None


Now let's compute our loss using `binary_cross_entropy_with_logits`:

In [172]:
loss = torch.binary_cross_entropy_with_logits(Y, y.float())
print(loss)

tensor([0.2659], grad_fn=<BinaryCrossEntropyWithLogitsBackward>)


We compute gradients by calling `backward` on `loss`:

In [173]:
loss.backward()

Now we see that our gradient has updated:

In [174]:
print(b.grad)

tensor([-0.2335])


We can then use this gradient to update our value of `b`, most simply by scaling the gradient by a *learning rate* and subtracting this value from the prior value of `b`.

Now let's make our function *slightly* more complex. We'll add a *weight* `w` such that `y = w * x + b`. We'll also initialize `w` by randomly sampling over the interval `[0, 1)`. Again, we'll assume that our target value `y` is `1`.

In [175]:
x = torch.rand(1)
w = torch.rand(1)
b = torch.rand(1, requires_grad=True)
Y = x.mul(w).add(b)
print(x)
print(w)
print(b)
print(Y)

tensor([0.2901])
tensor([0.5289])
tensor([0.7278], requires_grad=True)
tensor([0.8812], grad_fn=<AddBackward0>)
