# Automatic Differentiation
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/eleni-vasilaki/rl-notes/blob/main/notebooks/07_deltarule.ipynb)
## Evaluating partial derivatives efficiently

Gradient-based optimisation is ubiquitous in machine learning. Driven by the backpropagation algorithm, neural networks are able to minimise loss functions by gradually stepping parameters in the direction that reduces the overall error of the network. To calculate the direction we should change our parameters, we first need to consider how the loss changes in response to a change in parameters. This is quantified by the partial derivative of loss, $L$, with respect to a given parameter, $\theta$: $\frac{∂L}{\partial \theta}$

Since neural networks feature many parameters (on the order of 1.8 trillion parameters for ChatGPT 4.0), calculating these gradients efficiently is crucial for optimising implementations. At the heart of this process is automatic differentiation, an algorithmic approach for calculating the exact partial derivatives of parameters of a function with respect to its outputs.

This process requires no symbolic representation of derivatives, nor does it perform numerical estimates. Instead, it propagates information on gradients through a computational graph consisting of intermediate variables (intermediate steps the computational algorithm arrives at in the execution of a function), and elementary mathematical operations (plus, minus, divide, etc.)/ functions (sin, cos, exp) with known derivatives. Compounded operations are sequentially performed via the chain rule, and gradient information with respect to specific inputs to the function is contained numerically: \\
$\frac{\partial f(g(x))}{\partial x} = f'(g(x))\cdot g'(x)$

In the lecture, we covered how a computational graph can be constructed by breaking down a mathematical function in the same way a computer would process it: single mathematical operations on a (pair of) variable(s). Below, we will outline the process for generating a computational graph.

## Building Computational Graphs

Consider the function:\
\
$f(x_0,x_1) = (\frac{x_0^2}{x_1})\cdot \mathrm{cos}(x_0) + e^{\frac{x_0}{x_1}}$ \\
\
we can simplify this function to a series of operations, represented by intermediate variables, $v_i$, which at each stage perform either an elementary function on one variable, or an elementary operation on a pair of variables.

First we will start with mapping each of our inputs to an intermediate variable: \\
\
$v_0 = x_0$ \\
$v_1 = x_1$ \\
\
From here, we can see that the fraction $\frac{x_0}{x_1}$ appears twice (though once is multiplied again by $x_0$). We can then express these new intermediate variables with respect to the previous intermediate variables: \\
\
$v_2 = \frac{v_0}{v_1}$ \\
$v_3 = v_0 \cdot v_2$ \\
\
No repeated terms remain, so we step through the function according to order of mathematical operations (BIDMAS/BODMAS/PEMDAS): \\
\
$v_4 = e^v_2$ \\
$v_5 = \mathrm{cos}(v_0)$ \\
$v_6 = v_3 \cdot v_5$ \\
$v_7 = v_4 + v_6$ \\
\
We have arrived at the end of the function, with output equal to $v_7$. \\

# Graphical Expression
To convey this information visually, we can draw a graph of how each of these intermediate variables relate to one another. Nodes on the graph will represent intermediate variables, and edges will denote dependence between variables. Arrows pointing into an intermediate variable means that this intermediate variable takes the connected node as an input. Arrows out of an intermediate variable show that another intermediate variable is dependent upon it:
\
<img src='https://drive.google.com/uc?id=1SycSLZ8kvsyQgo9_--S_PmvOpH5W-ZNm'>



By looking at the intermediate variables we generated above for the given function, we can use this notation to plot a computational graph which represents our function:
\
$f(x_0,x_1) = (\frac{x_0^2}{x_1})\cdot \mathrm{cos}(x_0) + e^{\frac{x_0}{x_1}}$ \\
<img src='https://drive.google.com/uc?id=1VvOmzRom-v20HDUKyJJUWBlSm8vfdgHQ'>






# Exercise:
Write a python function that represents the equation
$y=\mathrm{sin}(x_0) + x_1\mathrm{cos}(x_0) + \frac{x_1}{x_2}$
as a series of operations on intermediate variables.

In [None]:
import numpy as np

def function(x0, x1, x2):


<details>
<summary>Show Solution</summary>

```python
# Solution:
import numpy as np

def function(x0, x1, x2):
  v0 = x0
  v1 = x1
  v2 = x2
  v3 = np.sin(v0)
  v4 = np.cos(v0)
  v5 = v1 * v4
  v6 = v1/v2
  v7 = v3+v5
  v8 = v6 + v7
  return v8
  ```

# Exercise:

Draw a computational graph that represents the function you have made.

<details>
<summary>Show Solution</summary>

Solution:

<img src='https://drive.google.com/uc?id=1XiDJ1Y_t7JO6h-k53wswnR-T5end9tS2'>

# Forward-Mode Automatic Differentiation
Forward-mode automatic differentiation refers to an automatic differentiation process that propagates gradient information forwards starting at the inputs through the computational graph. It is performed with respect to a single input variable at a time, and computes both the value of each of the intermediate variables, as well as the partial derivative of the intermediate variables with respect to the chosen input at each node of the graph. \\

This is often achieved through 'operator overloading', where the algorithmic implementation of the function considers both the values on intermediate variables $v_i$, as well as the partial derivative with respect to input $x_i$, $\frac{\partial v_1}{\partial x_i}$, termed the 'seed'. \\
\
Operations on the derivatives (seeds) are performed according to the differential rules for multiplication, division, and compounded operations: the product, quotient, and chain rules respectively: \\
$\frac{\partial (uv)}{\partial x} = u \frac{\partial v}{\partial x}+ v \frac{\partial u}{\partial x}$ \\
$\frac{\partial (u/v)}{\partial x} = \frac{v \frac{\partial u}{\partial x}- u \frac{\partial v}{\partial x}}{v^2}$ \\
$\frac{\partial f(g(x))}{\partial x} = f'(g(x))\cdot g'(x)$ \\

Considering the function from the earlier: \\

$f(x_0,x_1) = (\frac{x_0^2}{x_1})\cdot \mathrm{cos}(x_0) + e^{\frac{x_0}{x_1}}$
\
We can write in programmatic form:

In [None]:
import numpy as np
def function1(x0, x1):
  v0 = x0
  v1 = x1
  v2 = v0/v1
  v3 = v0 * v2
  v4 = np.exp(v2)
  v5 = np.cos(v0)
  v6 = v3 * v5
  v7 = v4 + v6
  y = v7
  return y

We can add additional terms below each of the intermediate variables to track the partial derivatives as we pass through the function. \\
While this looks clumsy as a standard python function, in packages that handle automatic differentiation, such as PyTorch, variables are stored as instantiations of a 'tensor' class, and the gradient information of each variable is an attribute of the class, and hence can be accessed when needed by other functions. The function below will return function output y as well as the the partial derivative of y with respect to x:

In [None]:
import numpy as np
def function1_dx0(x0, x1):
  v0 = x0
  dv0dx0 = 1 # The derivative of a variable with respect to itself is 1
  v1 = x1
  dv1dx0 = 0 # v1 Does not depend on x0, derivative is 0
  v2 = v0/v1
  dv2dx0 = (v1*dv0dx0-v0*dv1dx0)/v1**2 # Quotient rule, u = v0, par u = dv0dx0, v = v1, par v = dv1dx0
  v3 = v0 * v2
  dv3dx0 = v0 * dv2dx0 + v2 * dv0dx0 # Product rule, u = v0, par u = dv0dx0, v = v2, par v = dv2dx0
  v4 = np.exp(v2)
  dv4dx0 = dv2dx0 * np.exp(v2) # Chain rule, f = exp(), par f = exp(), g = v2, par g  = dv2dx0
  v5 = np.cos(v0)
  dv5dx0 = dv0dx0 * - np.sin(v0) # Chain rule, f = cos(), par f = -sin(), g = v0, par g = dv0dx0
  v6 = v3 * v5
  dv6dx0 = v5 * dv3dx0 + v3 * dv5dx0 # Product rule, u = v5, par u = dv4dx0, v = v3, par v = dv3dx0
  v7 = v4 + v6
  dv7dx0 = dv4dx0 + dv6dx0 # Addition of gradients is linear
  y = v7
  dydx0 = dv7dx0
  return y, dydx0

Running a comparison between the time taken for automatic differentiation and running the symbolic solution for the equation, we can see that automatic differentiation offers a speedup. While this increase is relatively modest for the function shown, the process offers even better scaling for more complex functions with more compounded operations.

In [None]:
def symbolic_df1dx(x0, x1):
  return ((x0)**2/x1)*np.cos(x0)+np.exp(x0/x1), ((x0**2)*(-np.sin(x0))+np.exp(x0/x1)+2*x0*np.cos(x0))/x1
import timeit
time_automatic = timeit.timeit(lambda: function1_dx0(1,1), number=100000)
time_symbolic = timeit.timeit(lambda: symbolic_df1dx(1,1), number=100000)
print(f'Time taken by automatic differentiation: {time_automatic}')
print(f'Time taken by symbolic differentiation: {time_symbolic}')

print('Automatic: y='+str(function1_dx0(1,1)[0])+', dy/dx0 = '+str(function1_dx0(1,1)[1]))
print('Symbolic: y='+str(symbolic_df1dx(1,1)[0])+', dy/dx0 = '+str(symbolic_df1dx(1,1)[1]))

Time taken by automatic differentiation: 0.588525680001112
Time taken by symbolic differentiation: 0.6741398379999737
Automatic: y=3.258584134327185, dy/dx0 = 2.9574154553874283
Symbolic: y=3.258584134327185, dy/dx0 = 2.9574154553874283


# Exercise
Perform a similar function generation for the following function:
$f(x_0,x_1) = (\frac{x_0^2}{x_1})\cdot \mathrm{cos}(x_0) + e^{\frac{x_0}{x_1}}$

In [None]:
def function2_dx0(x0,x1,x2):
  # your code here
  return y, dydx0

<details>
<summary>Show Solution</summary>

```python
# Solution
def function2_dx0(x0, x1, x2):
  v0 = x0
  dv0dx0 = 1 # The derivative of a variable with respect to itself is 1
  v1 = x1
  dv1dx0 = 0 # v1 does not depend upon x0
  v2 = x2
  dv2dx0 = 0 # v2 does not depend upon x0
  v3 = np.sin(v0)
  dv3dx0 = dv0dx0 * np.cos(v0) # Chain rule: f(g((x)) f = sin(), g = v0, par f = cos(), par v0 = dv0dx0
  v4 = np.cos(v0)
  dv4dx0 = dv0dx0 * - np.sin(v0) # Chain rule, f(g((x)) f = cos(), g = v0, par f = -sin(), par v0 = dv0dx0
  v5 = v1 * v4
  dv5dx0 = v1 * dv4dx0 + v4 * dv1dx0 # Product rule, u = v1, v = v4, par u = dv1dx0, par v = dv4dx0
  v6 = v1/v2
  dv6dx0 = (v2*dv1dx0-v1*dv2dx0)/v2**2 # Quotient rule, u = v1, v = v2, par u = dv1dx0, par v = dv2dx0
  v7 = v3+v5
  dv7dx0 = dv3dx0 + dv5dx0 # Addition/subtraction of gradients is linear
  v8 = v6 + v7
  dv8dx0 = dv6dx0 + dv7dx0 # Addition/subtraction of gradients is linear
  y = v8
  dydx0 = dv8dx0
  return y, dydx0
```

While forward mode automatic differentiation is relatively simple conceptually, and is useful for cases where the number of outputs is larger than the number of inputs, as separate calculations are required when we would like to calculate partial derivatives with respect to different inputs. This means it is well-suited to applications such as sensitivity analysis, where we would like to see how sensitive each output is to a particular input variable. At each intermediate variable, we calculate partial derivatives of the variable with respect to *input*. \\
However, in cases such as machine learning, where the number of parameters vastly outweighs the number of outputs, and we would like partial derivatives of the *output* with respect to parameters, a different approach must be taken. Here, it makes much more sense to propagate derivative information backwards through the graph. This way, the information stored on each of the intermediate variables is the partial derivative of the intermediate variable with respect to *output*, which is exactly what we need for our parameter updates. Propagating derivatives in this way is called reverse-mode automatic differentiation, and the backpropagation algorithm covered last week is itself a specific case of reverse-mode automatic differentiation.

# Reverse-Mode Automatic Differentiation
This process involves making a forward pass through the computational graph to calculate values on all of the intermediate variables, before making a backward pass through the graph and calculating the dependencies of intermediate variables with respect to the output of the function. Similarly to backpropagation, this involves calculating the partial derviative of intermediate variables with respect to everything ahead of it in the computational graph. To prevent expression swell, information for the total partial derivative of an intermediate variable with respect to all forward pathways is contained in a numerical value called the 'adjoint' of the intermediate variable, denoted by $\bar{v}_i$, where: \
$\bar{v}_i = \frac{\partial v_i}{\partial y} = \sum\limits_{j}\bar{v}_j\frac{\partial v_j}{\partial v_i}$ for all intermediate variables $v_j$ connected to $v_i$ by an outward arrow.

Taking the functional form of $f(x_0,x_1) = (\frac{x_0^2}{x_1})\cdot \mathrm{cos}(x_0) + e^{\frac{x_0}{x_1}}$, and its computational graph:

<img src='https://drive.google.com/uc?id=1VvOmzRom-v20HDUKyJJUWBlSm8vfdgHQ'>

we can separate these two passes through the network:

In [None]:
def function1_reverse(x0, x1):
  v0 = x0
  v1 = x1
  v2 = v0/v1
  v3 = v0 * v2
  v4 = np.exp(v2)
  v5 = np.cos(v0)
  v6 = v3 * v5
  v7 = v4 + v6
  y = v7
  # Backward Pass
  vbar7 = 1 # Equal to y, so partial derivative is 1
  # Considering v6: One pathway ahead, via v7
  dv7dv6 = 1 # v4 treated as constant, v6 differentiates to 1
  vbar6 = vbar7 * dv7dv6
  # Considering v5: One pathway ahead, via v6
  dv6dv5 = v3 # v3 treated as constant, derivative is v3 * dv6/dv6 (=1)
  vbar5 = vbar6 * dv6dv5
  # Considering v4: One pathway ahead, via v7
  dv7dv4 = 1 # v6 treated as constant, dv4/dv4=1
  vbar4 = vbar7 * dv7dv4
  # Considering v3: One pathway ahead, via v6
  dv6dv3 = v5 # v5 treated as constant, dv6/dv3=v5
  vbar3 = vbar6 * dv6dv3
  # Considering v2: Two pathways ahead, via v3 and v4
  dv3dv2 = v0 # v0 treated as constant, dv3/dv2=v0
  dv4dv2 = np.exp(v2) # d e^x/dx = e^x
  vbar2 = vbar3 * dv3dv2 + vbar4 * dv4dv2 # Sum the two terms
  # Considering v1: One pathway ahead, via v2
  dv2dv1 = -v0/v1**2 # v0 treated as constant, dv2/dv1=-v0/(v1^2)
  vbar1 = vbar2 * dv2dv1
  # Considering v0: Three pathways ahead, via v2, v3, and v5
  dv2dv0 = 1/v1 # v1 treated as constant, dv2/dv1=1/v1
  dv3dv0 = v2 # v2 treated as constant, dv3/dv1 = v2
  dv5dv0 = -np.sin(v0) # d cos(x)/dx=-sin(x)
  vbar0 = vbar2 * dv2dv0 + vbar3 * dv3dv0 + vbar5 * dv5dv0 # Sum the terms
  return y, vbar0, vbar1, vbar2, vbar3, vbar4, vbar5, vbar6, vbar7

Whilst the procedure is a bit more complex, we can see for the above function, we are able to return the partial derivative of every intermediate variable with respect to the output. This is especially useful in the context of neural networks where the parameters of the neural network each feature as an intermediate variable on the computational graph, and hence we get the full derivative information with respect to a given output with one backward pass through the graph. Comparison between the evaluation time of symbolic methods and automatic methods now become more stark. Again, this is further compounded when there are more compounded operations, as is the case with deep neural networks.

In [None]:
def symbolic_f1_full(x0, x1):
  y = (x0**2)/x1 * np.cos(x0) + np.exp(x0/x1)
  vbar0 = (-np.sin(x0)*x0**2 + np.exp(x0/x1)+2*x0*np.cos(x0))/x1
  vbar1 = -(x0*(np.exp(x0/x1)+x0*np.cos(x0)))/x1**2
  vbar2 = np.exp(x0/x1) + x0*np.cos(x0)
  vbar3 = np.cos(x0)
  vbar4 = 1
  vbar5 = x0**2/x1
  vbar6 = 1
  vbar7 = 1
  return y, vbar0, vbar1, vbar2, vbar3, vbar4, vbar5, vbar6, vbar7


import timeit
time_automatic = timeit.timeit(lambda: function1_reverse(1,1), number=100000)
time_symbolic = timeit.timeit(lambda: symbolic_f1_full(1,1), number=100000)
print(f'Time taken by automatic differentiation: {time_automatic}')
print(f'Time taken by symbolic differentiation: {time_symbolic}')

print('Automatic: '+str(function1_reverse(1.8,2.4)))
print('Symbolic: '+str(symbolic_f1_full(1.8,2.4)))

Time taken by automatic differentiation: 0.6372098439987894
Time taken by symbolic differentiation: 1.3550646929998038
Automatic: (np.float64(1.810277188777007), np.float64(-0.7734141034699129), np.float64(-0.5337613269265994), np.float64(1.708036246165118), np.float64(-0.2272020946930871), 1, 1.35, 1, 1)
Symbolic: (np.float64(1.810277188777007), np.float64(-0.7734141034699131), np.float64(-0.5337613269265994), np.float64(1.708036246165118), np.float64(-0.2272020946930871), 1, 1.35, 1, 1)


# Exercise:
Compose a the backward pass of a function to perform reverse-mode autodifferentiation on:
$y=\mathrm{sin}(x_0) + x_1\mathrm{cos}(x_0) + \frac{x_1}{x_2}$
<img src='https://drive.google.com/uc?id=1XiDJ1Y_t7JO6h-k53wswnR-T5end9tS2'>

In [None]:
def function(x0, x1, x2):
  # Forward Pass
  v0 = x0
  v1 = x1
  v2 = x2
  v3 = np.sin(v0)
  v4 = np.cos(v0)
  v5 = v1 * v4
  v6 = v1/v2
  v7 = v3+v5
  v8 = v6 + v7

  # Fill in the backward pass
  return v8, vbar0, vbar1, vbar2, vbar3, vbar4, vbar5, vbar6, vbar7, vbar8

<details>
<summary>Show Solution</summary>

```python
# Solution:
def function(x0, x1, x2):
  # Forward Pass
  v0 = x0
  v1 = x1
  v2 = x2
  v3 = np.sin(v0)
  v4 = np.cos(v0)
  v5 = v1 * v4
  v6 = v1/v2
  v7 = v3+v5
  v8 = v6 + v7

  # Fill in the backward pass
  vbar8 = 1
  # Considering v7, one pathway via v8
  dv8dv7 = 1
  vbar7 = vbar8 * dv8dv7
  # Considering v6, one pathway via v8
  dv8dv6 = 1
  vbar6 = vbar8 * dv8dv6
  # Considering v5, one pathway via v7
  dv7dv5 = 1
  vbar5 = vbar7 * dv7dv5
  # Considering v4, one pathway via v5
  dv5dv4 = v1
  vbar4 = vbar5 * dv5dv4
  # Considering v3, one pathway via v7
  dv7dv3 = 1
  vbar3 = vbar7 * dv7dv3
  # Considering v2, one pathway via v6
  dv6dv2 = -1*v1/v2**2
  vbar2 = vbar6 * dv6dv2
  # Considering v1, two pathways via v5 and v6
  dv5dv1 = v4
  dv6dv1 = 1/v2
  vbar1 = vbar5 * dv5dv1 + vbar6 * dv6dv1
  # Considering v0, two pathways via v3 and v4
  dv3dv0 = cos(v0)
  dv4dv0 = -sin(v0)
  vbar0 = vbar3*dv3dv0 + vbar4*dv4dv0
  return v8, vbar0, vbar1, vbar2, vbar3, vbar4, vbar5, vbar6, vbar7, vbar8
```

# Packages with automatic differentiation
While completing the exercises, you will have found that writing procedures for automatic differentiation by hand is quite cumbersome. Thankfully, modern ML packages such as PyTorch, TensorFlow, and Keras all come with tools that will procedurally perform automatic differentiation on functions/programs without the need to fill out all the terms by hand. In this section, we will do a brief introduction to using the autodiff functionality of PyTorch.

First, we will create a class to make a small multi-layer perceptron (Note: PyTorch has tools that will help generate neural networks more simply, but this higher-level approach gives us greater flexibility):

In [34]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

# A class for a simple multilayer perceptron. NetShape is a list of integers giving the shape
# of the network, act_fn should be a torch function which provides an activation function for nodes
# in the network, and lossfn is a loss function to use for optimisation (again a torch function)
class MLP(nn.Module):
    def __init__(self, NetShape, act_fn=nn.Tanh()):
        super(MLP, self).__init__()
        # Find number of layers in the network
        self.Nlayers = len(NetShape)-1
        # Initialise lists to store weights and biases
        self.weights = []
        self.biases = []
        # For each layer in the network, add parameters for weights and biases, and mark that gradients
        # should be tracked for these parameters
        for layer in range(self.Nlayers):
            self.weights.append(torch.rand((NetShape[layer], NetShape[layer+1]), requires_grad=True))
            self.biases.append(torch.zeros((NetShape[layer+1]), requires_grad=True))
        # Define an activation function for nodes in the network
        self.act_fn = act_fn

    # Define a forward pass through our model
    def forward(self, x):
        # Pass through each layer of the model
        for layer in range(self.Nlayers):
            # Combine inputs/activations with weights, add biases, then perform activation function if not the output layer
            if layer < self.Nlayers-1:
                x = self.act_fn(torch.matmul(x, self.weights[layer])+self.biases[layer])
            else:
                x = torch.matmul(x, self.weights[layer])+self.biases[layer]
        return x

Next, we will define a toy function to train our neural network to replicate

In [35]:
# Define a simple function to fit
f = lambda x: torch.sqrt((x[:, [0]]/x[:, [1]])) * torch.sin(x[:, [0]]*x[:, [1]]) * torch.exp(x[:, [2]])

# Generate a random input set
inputs = torch.rand((10000,3))

# Generate targets from function
targets = f(inputs)


Now, we will initialise our model, training parameters, and a simple function to sample minibatches of data from input/output pairs:

In [36]:
# Initialise a small model
model = MLP([3, 50, 50, 1], act_fn=nn.Sigmoid())

# Generate an optimiser to help with our parameter updates
# Define learning rate
eta = 1e-2
# Create optimiser for the parameters within the model
optimiser = optim.Adam(params=model.weights+model.biases, lr=eta)

# Define a function to sample minibatches of our training data
def gen_samples(Nbatch, inputs, outputs):
    # Create random indices to sample
    k = torch.randint(0, inputs.shape[0], (Nbatch,))
    return inputs[k], targets[k]

# Define a loss function
lossfn = nn.MSELoss()


Finally, we will create a simple training procedure which will perform automatic differentiation, and update our parameters according to our loss function:

In [37]:
# Create a simple training procedure
for i in range(10000):
    # Randomly sample inputs and targets for 100 training points
    x_in, y_in = gen_samples(100, inputs, targets)
    # Reset gradient calculation in our optimiser
    optimiser.zero_grad()
    # Pass forward through our model
    prediction = model.forward(x_in)
    # Calculate loss
    loss = lossfn(prediction, y_in)
    # Perform automatic differentiation for the parameters listed in the optimiser
    # with respect to loss
    loss.backward()
    # Use these gradients to step our parameters
    optimiser.step()
    # Print the loss of our neural network periodically
    if i%1000 == 0:
        print(loss.item())

578.545166015625
0.0071814716793596745
0.001309192506596446
0.0014369856799021363
0.000721322896424681
0.00040197873022407293
0.00021226868557278067
0.00028325655148364604
0.0006178399198688567
0.0001817882584873587


Within this training loop, we can access the partial derivatives of our parameters with respect to loss prior to stepping the optimiser:

In [39]:
# Randomly sample inputs and targets for 100 training points
x_in, y_in = gen_samples(100, inputs, targets)
# Reset gradient calculation in our optimiser
optimiser.zero_grad()
# Pass forward through our model
prediction = model.forward(x_in)
# Calculate loss
loss = lossfn(prediction, y_in)
# Perform automatic differentiation for the parameters listed in the optimiser
# with respect to loss
loss.backward()
# Considering the gradients for the first hidden layer weights:
layer1_weights = model.weights[0]
gradients = layer1_weights.grad
print(gradients)


tensor([[ 1.7136e-04, -2.6757e-05, -6.8002e-06, -1.2225e-05, -2.0450e-05,
          2.8915e-05,  1.4314e-04, -8.6720e-05, -9.0563e-05, -8.3435e-05,
         -3.5260e-05,  1.7535e-04, -5.8361e-05,  1.3866e-04,  3.7111e-05,
         -6.2662e-05, -3.0924e-05,  1.6691e-04, -9.4369e-05, -1.0459e-04,
          5.7064e-05,  1.7134e-05, -2.7458e-05, -1.7193e-04,  1.4677e-05,
          9.6307e-05,  3.1639e-04, -7.0145e-05,  2.4450e-05,  3.8485e-05,
         -1.5795e-04, -6.4853e-06, -6.7634e-05,  1.7198e-04, -5.0767e-05,
          1.2763e-05, -8.1301e-05, -7.9354e-05, -2.9504e-05,  2.5092e-05,
         -1.7073e-04, -9.0513e-06, -2.6975e-05, -1.5169e-05, -4.7157e-05,
         -7.6108e-05,  3.7039e-05,  7.9694e-05,  1.5387e-04,  2.3383e-04],
        [ 3.9263e-04, -3.6777e-05,  5.9635e-05, -1.3911e-05,  6.7872e-06,
          1.0095e-04,  2.9050e-04, -1.3036e-04, -1.1155e-04, -2.1569e-04,
         -4.8599e-05,  2.4719e-04, -3.5942e-05,  2.5647e-04,  1.0344e-04,
         -1.6315e-04,  3.5849e-06,  2

Here, we can see how simple it is to use the in-built functions within PyTorch to build a neural network, and perform parameter updates according to gradients. The entire process of building computational graph, and propagating derivatives backwards through the graph is handled internally by PyTorch. If we would like to see those gradients, we can simply access them by calling the grad attribute of the tensor class.