In [1]:
import torch
torch.__version__

'1.0.1.post2'

# Use PyTorch to calculate the gradient value

PyTorch's Autograd module implements the derivative of the propagation in the deep learning algorithm. For all operations on tensors (Tensor class), Autograd can automatically provide them with differentiation, simplifying the complex process of manually calculating derivatives.

In versions prior to 0.4, Pytorch used the Variable class to automatically calculate all gradients. The Variable class mainly contained three attributes:
data: save the Tensor contained in Variable; grad: save the gradient corresponding to data, grad is also a Variable, not Tensor, it has the same shape as data; grad_fn: points to a Function object, this Function is used to backpropagate the input calculation gradient.


Since 0.4, Variable has been officially merged into the Tensor class, and the automatic differentiation function implemented through Variable nesting has been integrated into the Tensor class. Although Variable(tensor) can be used to nest for code compatibility, this operation actually does nothing.

Therefore, the future code is recommended to use the Tensor class directly for operation, because the official document has set Variable as an expired module.

To use the autograd function through the Tensor class itself, you only need to set .requries_grad=True

The grad and grad_fn attributes in the Variable class have been integrated into the Tensor class

## Autograd

When the tensor is created, the requires_grad flag is set to Ture to tell Pytorch that the tensor needs to be automatically derived. PyTorch will record the history of each step of the tensor and automatically calculate it

In [2]:
x = torch.rand(5, 5, requires_grad=True)
x

tensor([[0.0403, 0.5633, 0.2561, 0.4064, 0.9596],
        [0.6928, 0.1832, 0.5380, 0.6386, 0.8710],
        [0.5332, 0.8216, 0.8139, 0.1925, 0.4993],
        [0.2650, 0.6230, 0.5945, 0.3230, 0.0752],
        [0.0919, 0.4770, 0.4622, 0.6185, 0.2761]], requires_grad=True)

In [3]:
y = torch.rand(5, 5, requires_grad=True)
y

tensor([[0.2269, 0.7673, 0.8179, 0.5558, 0.0493],
        [0.7762, 0.9242, 0.2872, 0.0035, 0.4197],
        [0.4322, 0.5281, 0.9001, 0.7276, 0.3218],
        [0.5123, 0.6567, 0.9465, 0.0475, 0.9172],
        [0.9899, 0.9284, 0.5303, 0.1718, 0.3937]], requires_grad=True)

PyTorch will automatically track and record all operations on the tensor. When the calculation is completed, the .backward() method is called to automatically calculate the gradient and save the calculation result to the grad attribute.

In [4]:
z=torch.sum(x+y)
z

tensor(25.6487, grad_fn=<SumBackward0>)

After the tensor is operated on, grad_fn has been assigned a new function, which references a Function object that created this Tensor class.
Tensor and Function are connected to each other to generate an acyclic graph, which records and encodes the complete calculation history. Each tensor has a .grad_fn attribute. If the tensor is manually created by the user, then the grad_fn of the tensor is None.

Let's call the backpropagation function to calculate its gradient

## Simple automatic derivative

In [5]:
z.backward()
print(x.grad,y.grad)


tensor([[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]]) tensor([[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]])


If the Tensor class represents a scalar (that is, it contains a tensor with one element), you do not need to specify any parameters for backward(), but if it has more elements, you need to specify a gradient parameter, which is shape matching Tensor.
The above `z.backward()` is equivalent to the abbreviation of `z.backward(torch.tensor(1.))`.
This kind of parameter often appears in single label classification in image classification, and outputs a scalar representing the label of the image.

## Complex automatic derivative

In [6]:
x = torch.rand(5, 5, requires_grad=True)
y = torch.rand(5, 5, requires_grad=True)
z = x**2+y**3
z

tensor([[3.3891e-01, 4.9468e-01, 8.0797e-02, 2.5656e-01, 2.9529e-01],
        [7.1946e-01, 1.6977e-02, 1.7965e-01, 3.2656e-01, 1.7665e-01],
        [3.1353e-01, 2.2096e-01, 1.2251e+00, 5.5087e-01, 5.9572e-02],
        [1.3015e+00, 3.8029e-01, 1.1103e+00, 4.0392e-01, 2.2055e-01],
        [8.8726e-02, 6.9701e-01, 8.0164e-01, 9.7221e-01, 4.2239e-04]],
       grad_fn=<AddBackward0>)

In [7]:
#Our return value is not a scalar, so we need to enter a tensor of the same size as a parameter, here we use the ones_like function to generate a tensor based on x
z.backward(torch.ones_like(x))
print(x.grad)

tensor([[0.2087, 1.3554, 0.5560, 1.0009, 0.9931],
        [1.2655, 0.1223, 0.8008, 1.1127, 0.7261],
        [1.1052, 0.2579, 1.8006, 0.1544, 0.3646],
        [1.8855, 1.2296, 1.9061, 0.9313, 0.0648],
        [0.5952, 1.6190, 0.8430, 1.9213, 0.0322]])


We can use the with torch.no_grad() context manager to temporarily prohibit automatic derivation of tensors that have set requirements_grad=True. This method is often used when calculating the accuracy of the test set, for example:

In [8]:
with torch.no_grad():
    print((x +y*2).requires_grad)

False


After using .no_grad() for nesting, the code will not track historical records, which means that the saved part of the records will reduce memory usage and speed up a little calculation.

## Autograd process analysis

In order to illustrate the principle of Pytorch's automatic derivation, let's try to analyze the source code of PyTorch. Although Pytorch's Tensor and TensorBase are implemented using CPP, some Python methods can be used to view the properties and status of these objects in Python. .
 Python's `dir()` returns a list of parameter attributes and methods. `z` is a Tensor variable, see which member variables are in it.

In [9]:
dir(z)

['__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_priority__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__div__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__idiv__',
 '__ilshift__',
 '__imul__',
 '__index__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__irshift__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__long__',
 '__lshift__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pow__',
 '__radd__',
 '__rdiv__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rfloordiv__',
 '__rmul__',
 '__rpow__',
 '__rshift__',
 '__rsub__',
 '__rtruediv__',
 '__setattr__',
 '__setitem__',
 '__setstate_

It returns a lot. We directly exclude some special methods (starting and ending with __) and private methods (starting with _) in Python, just look at a few more main attributes:
`.is_leaf`: Record whether it is a leaf node. Use this attribute to determine the type of this variable
The "graph leaves" and "leaf variables" mentioned in the official documents refer to variables such as `x` and `y` that are manually created instead of calculated. These variables become created variables.
Like `z`, the result obtained after calculation is called result variable.

Whether a variable is a creation variable or a result variable is obtained through `.is_leaf`.

In [10]:
print("x.is_leaf="+str(x.is_leaf))
print("z.is_leaf="+str(z.is_leaf))

x.is_leaf=True
z.is_leaf=False


`x` is manually created and failed calculations, so it is considered a leaf node, which is a creation variable, and `z` is obtained through a series of calculations of `x` and `y`, so it is not a leaf node That is, the result variable.

Why do we execute the `z.backward()` method to update `x.grad` and `y.grad`?
The `.grad_fn` attribute records this part of the operation. Although the `.backward()` method is also implemented by CPP, it can be simply explored through Python.

`grad_fn`: Record and encode the complete calculation history

In [11]:
z.grad_fn

<AddBackward0 at 0x120840a90>

`grad_fn` is a variable of type `AddBackward0`. `AddBackward0` This class is also written in Cpp, but we can roughly know from the name that it is the reverse of addition (ADD). Take a look What's in it

In [12]:
dir(z.grad_fn)

['__call__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_register_hook_dict',
 'metadata',
 'name',
 'next_functions',
 'register_hook',
 'requires_grad']

`next_functions` is the essence of `grad_fn`

In [13]:
z.grad_fn.next_functions

((<PowBackward0 at 0x1208409b0>, 0), (<PowBackward0 at 0x1208408d0>, 0))

`next_functions` is a tuple of tuple of PowBackward0 and int.

Why are 2 tuples?
Because our operation is `z = x**2+y**3`, just now `AddBackward0` is addition, and the previous operation is power `PowBackward0`. The first element of the tuple is the operation record related to x

In [14]:
xg = z.grad_fn.next_functions[0][0]
dir(xg)

['__call__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_register_hook_dict',
 'metadata',
 'name',
 'next_functions',
 'register_hook',
 'requires_grad']

Keep digging

In [15]:
x_leaf=xg.next_functions[0][0]
type(x_leaf)

AccumulateGrad

In PyTorch's reverse graph calculation, the `AccumulateGrad` type represents the leaf node type, which is the end node of the calculation graph. There is a `.variable` attribute in the `AccumulateGrad` class that points to the leaf node.

In [16]:
x_leaf.variable

tensor([[0.1044, 0.6777, 0.2780, 0.5005, 0.4966],
        [0.6328, 0.0611, 0.4004, 0.5564, 0.3631],
        [0.5526, 0.1290, 0.9003, 0.0772, 0.1823],
        [0.9428, 0.6148, 0.9530, 0.4657, 0.0324],
        [0.2976, 0.8095, 0.4215, 0.9606, 0.0161]], requires_grad=True)

The attribute of this `.variable` is our generated variable `x`

In [17]:
print("x_leaf.variable id:"+str(id(x_leaf.variable)))
print("id of x:"+str(id(x)))

x_leaf.variable的id:4840553424
x的id:4840553424


In [18]:
assert(id(x_leaf.variable)==id(x))

So the whole procedure is very clear:

1. When we execute z.backward(). This operation will call the attribute grad_fn in z to perform the derivative operation.
2. This operation will traverse the next_functions of grad_fn, and then take out the Function (AccumulateGrad) inside, and perform the derivative operation. This part is a recursive process until the final type is a leaf node.
3. After calculating the result, save the result to the grad attribute of the object (x and y) referenced by their corresponding variable.
4. The derivation is over. The grad variables of all leaf nodes have been updated accordingly

Finally, when we execute c.backward(), the grad values ​​in a and b are updated.

## Extend Autograd
If you need to customize autograd to extend new functions, you need to extend the Function class. Because Function uses autograd to calculate the result and gradient, and encode the operation history.
The most important methods in the Function class are `forward()` and `backward()`, which represent forward and backward propagation respectively.





A custom Function requires the following three methods:

    __init__ (optional): If this operation requires additional parameters, you need to define the Function constructor, and you can ignore it if you don't need it.

    forward(): Perform calculation code for forward propagation

    backward(): The code for gradient calculation during backpropagation. The number of parameters is the same as the number of forward return values, and each parameter represents the gradient returned to this operation.


In [19]:
# Introduce Function to facilitate expansion
from torch.autograd.function import Function

In [20]:
# Define an operation of multiplying by a constant (input parameter is a tensor)
# The method must be a static method, so @staticmethod must be added
class MulConstant(Function):
    @staticmethod
    def forward(ctx, tensor, constant):
        # ctx is used to save information here, similar to self, and the attributes of ctx can be called in the backward
        ctx.constant=constant
        return tensor *constant
    @staticmethod
    def backward(ctx, grad_output):
        # The returned parameters must be the same as the input parameters.
        # The first input is a 3x3 tensor, the second is a constant
        # The gradient of the constant must be None.
        return grad_output, None

After defining our new operation, let’s test it

In [21]:
a=torch.rand(3,3,requires_grad=True)
b=MulConstant.apply(a,5)
print("a:"+str(a))
print("b:"+str(b)) # b is the element of a multiplied by 5

a:tensor([[0.0118, 0.1434, 0.8669],
        [0.1817, 0.8904, 0.5852],
        [0.7364, 0.5234, 0.9677]], requires_grad=True)
b:tensor([[0.0588, 0.7169, 4.3347],
        [0.9084, 4.4520, 2.9259],
        [3.6820, 2.6171, 4.8386]], grad_fn=<MulConstantBackward>)


Backpropagation, the return value is not a scalar, so the `backward` method requires parameters

In [22]:
b.backward(torch.ones_like(a))

In [23]:
a.grad

tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])

Gradient because 1