In [1]:
import torch
import torchvision
import torchvision.transforms as transforms
import torchaudio
import numpy as np


print("PyTorch version    :", torch.__version__)
print("Torchvision version:", torchvision.__version__)
print("Torchaudio version :", torchaudio.__version__)
print("CUDA version       :", torch.version.cuda)
print("cuDNN version      :", torch.backends.cudnn.version())
print("NumPy version      :", np.__version__)


if hasattr(torch, 'accelerator'):
    print("torch.accelerator is available\n")
    device = torch.accelerator.current_accelerator() if torch.accelerator.is_available() else "cpu"
else:
    print("torch.accelerator is NOT available\n")
    if torch.backends.mps.is_available():
        device = "mps"
    elif torch.cuda.is_available():
        device = "cuda"
    else:
        device = "cpu"

print("Selected device    :", device)

PyTorch version    : 2.0.1+cu117
Torchvision version: 0.15.2+cu117
Torchaudio version : 2.0.2+cu117
CUDA version       : 11.7
cuDNN version      : 8500
NumPy version      : 1.24.4
torch.accelerator is NOT available

Selected device    : cuda


In [2]:


a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)

In [3]:
print(a)

tensor([2., 3.], requires_grad=True)


In [4]:
print(b)

tensor([6., 4.], requires_grad=True)


In [5]:
# element-wise square
Q = 3*a**3 - b**2


In [6]:
external_grad = torch.tensor([1., 1.])
Q.backward(gradient=external_grad)

In [7]:
a.grad

tensor([36., 81.])

In [8]:
b.grad

tensor([-12.,  -8.])

In [9]:
# check if collected gradients are correct
print(9*a**2 == a.grad)
print(-2*b == b.grad)

tensor([True, True])
tensor([True, True])


```
Differentiation 微分
Jacobian matrix 雅可比矩阵
Calculus 微积分


对于一个从R^n 映射到R^m 的函数，其雅可比矩阵是一个m 行n 列的矩阵，其中的元素是各个分量函数对各个自变量的偏导数。

Mathematically, if you have a vector valued function 

y ⃗= f ( x ⃗) then the gradient of y ⃗with respect to x ⃗is a Jacobian matrix J:



J = ( ∂y/∂x1, ∂y/∂x2, ... , ∂y/∂xn )

     ∂y1/∂x1,  ∂y1/∂x2   ...  ∂y1/∂xn 
     ∂y2/∂x1,  ∂y2/∂x2   ...  ∂y2/∂xn 
=    ...       ...       ...  ...

     ∂ym/∂x1   ∂ym/∂x2   ...  ∂ym/∂xn 


Generally speaking, torch.autograd is an engine for computing vector-Jacobian
product. That is, given any vector v ⃗, compute the product J(T) ⋅ v ⃗
 
In the context of PyTorch and autograd, a scalar function is a function that takes one or more inputs
(which can be tensors of any shape) and produces a single scalar output
— i.e., a tensor with just one element (shape torch.Size([])).

If v ⃗happens to be the gradient of a scalar function 
l = g ( y ⃗):

v ⃗ = (∂l/∂y1,  ∂l/∂y2, ... ∂l/∂yn)T

then by the chain rule, the vector-Jacobian product would be the gradient of 
l with respect to x ⃗:


              ∂y1/∂x1,  ∂y2/∂x1   ...  ∂ym/∂x1 
J(T) ⋅ v ⃗ =   ∂y1/∂x2,  ∂y2/∂x2   ...  ∂ym/∂x2    .   (∂l/∂y1,  ∂l/∂y2, ... ∂l/∂yn)T 

              ...       ...       ...  ... 
              ∂y1/∂xn   ∂y2/∂xn   ...  ∂ym/∂xn  


= (∂l/∂x1,  ∂l/∂x2, ... ∂l/∂xn)T 



This characteristic of vector-Jacobian product is what we use in the above example;
external_grad represents v ⃗.

```

# Why scalar functions matter in autograd

PyTorch’s autograd is designed to compute gradients using reverse-mode automatic
differentiation, which is especially efficient when the output is a scalar
(typical in loss functions).

In deep learning, your loss function is a scalar function of your model
parameters. So when you call: `loss.backward()` PyTorch computes gradients of
the scalar loss with respect to all tensors that have requires_grad=True

A scalar function outputs a single number (0-dimensional tensor).

In PyTorch, it's typically used for loss functions.

`autograd.backward()` only works directly on scalar outputs (or you need to
supply a gradient argument if it's not scalar).

# Computational Graph

Conceptually, autograd keeps a record of data (tensors) & all executed operations
(along with the resulting new tensors) in a directed acyclic graph
(DAG)`有向无环图` consisting of Function objects. In this DAG, leaves are the
input tensors, roots are the output tensors. By tracing this graph from roots to
leaves, you can automatically compute the gradients using the chain rule.

In a NN, parameters that don’t compute gradients are usually called frozen
parameters. It is useful to “freeze” part of your model if you know in advance
that you won’t need the gradients of those parameters (this offers some
performance benefits by reducing autograd computations).

# Finetuning
In finetuning, we freeze most of the model and typically only modify the
classifier layers to make predictions on new labels. Let’s walk through a small
example to demonstrate this. As before, we load a pretrained resnet18 model,
and freeze all the parameters.


In [11]:

import torch
from torch import nn, optim
from torchvision.models import resnet18, ResNet18_Weights
model = resnet18(weights=ResNet18_Weights.DEFAULT)
#data = torch.rand(1, 3, 64, 64)
#labels = torch.rand(1, 1000)


#model = resnet18(weights=ResNet18_Weights.DEFAULT)

#prediction = model(data)

# Freeze all the parameters in the network
for param in model.parameters():
    param.requires_grad = False

model


model.fc = nn.Linear(512, 10)

# Optimize only the classifier
optimizer = optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)



[Autograd](https://www.youtube.com/watch?v=MswxJw-8PvE)