In [1]:
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np


In [22]:
N = nn.Sequential(nn.Linear(1, 10), nn.Sigmoid(), nn.Linear(10,1, bias=False))

## How To Recover weights of the neural network?

In [23]:
N

Sequential(
  (0): Linear(in_features=1, out_features=10, bias=True)
  (1): Sigmoid()
  (2): Linear(in_features=10, out_features=1, bias=False)
)

In [24]:
first_layer = N[0]
second_layer = N[2]

In [25]:
first_layer.weight  # N[0].weight

Parameter containing:
tensor([[ 0.1753],
        [ 0.2632],
        [ 0.3580],
        [ 0.4935],
        [ 0.7921],
        [ 0.5748],
        [ 0.7542],
        [ 0.5756],
        [ 0.1214],
        [-0.8317]], requires_grad=True)

In [26]:
first_layer.bias  # N[0].bias

Parameter containing:
tensor([-0.2180, -0.0862, -0.8238,  0.0953, -0.9646, -0.7808,  0.7816,  0.2438,
        -0.3960, -0.6395], requires_grad=True)

In [27]:
second_layer.weight

Parameter containing:
tensor([[-0.2259, -0.0933, -0.2676,  0.0397, -0.0897,  0.1744,  0.0865, -0.1039,
         -0.1030,  0.1847]], requires_grad=True)

## Concept of autograd and requires_grad = True in Pytorch

Seems that requires_grad = True, and autograd, are needed for recording the history.

Not only this, but to compute some derivatives wrt to them. 


https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html Complete guide.

https://jovian.ai/forum/t/what-is-the-use-of-requires-grad-in-tensors/17718 In short, very useful.

https://pytorch.org/docs/stable/notes/autograd.html#:~:text=Setting%20requires_grad&text=Parameter%60%60%2C%20that%20allows%20for,its%20input%20tensors%20require%20grad.
Hard to understand.

https://pytorch.org/docs/stable/generated/torch.Tensor.requires_grad.html

If you set requires_grad to True to any tensor, then PyTorch will automatically track and calculate gradients w.r.t. that tensor.
During backpropagation you need gradients of the Loss function wrt to the weights. 
This is done with the .backward() method, so during this operation tensors with requires_grad set to True will be used along with the tensor used to call backward() with to calculate the gradients.

### Example1:

In [34]:
x = torch.tensor(2.0, requires_grad = True)

z = x ** 3 # z=x^3 ---> Dz/Dx = 3x**2

z.backward()  # Computes the gradient 
# print(x.grad) # this is dz/dx.... You can also try to print it if you comment #z.backward() ---> will return None.

In [35]:
x

tensor(2., requires_grad=True)

In [37]:
x.grad    # Dz/Dx = 3x^2

tensor(12.)

Notice that above we used a general function z, but in practice, we use the loss function

### Example2:

In [40]:
x = torch.tensor(10.0)
w  = torch.tensor(3.0, requires_grad=True)
b = torch.tensor(2.0, requires_grad=True)

y = w*x + b  # y is 3*5 + 2 = 17
# Notice that in this example and the one above, y is not a python funciton, but a pytorch tensor!!!
y.backward()

print(w.grad) # This returns the derivative of y w.r.t w which is equal to x=5
print(b.grad) # This returns the derivative of y w.r.t b which is  1
print(x.grad) # This will return None as we haven't set requires_grad to True for the tensor

tensor(10.)
tensor(1.)
None


https://pytorch.org/docs/stable/generated/torch.Tensor.backward.html#:~:text=Computes%20the%20gradient%20of%20current,function%20additionally%20requires%20specifying%20gradient%20.

As far as I understood, Pytorch will create a graph. Possibly through the neural network, where we use many arrays/tensors. For such tensors, when we set the requires_grad=True, when with the output of the graph we use the method .backward()  (p.s. notice that .backward() is a method for any tensor! So the output also is a tensor), it will compute all the derivatives wrt to such tensors.

### One more note: the output, as pytorch tensor, needs to be scalar! In the examples above no problems

## How To Define A Loss Function in Pytorch

In [41]:
# Very general example of loss function
def my_loss(output, target):
    ''' 
    output: is a tensor, a pytorch tensor. It is the output of the neural network
    target: is a tensor, a pytorch tensor. It is facultative argument of the function, when we
               have a supervised model. In ODE we won't use it since it is unsupervised, i.e., we don't know
               the result function.'''
    loss = torch.mean((output - target)**2)
    return loss

In [10]:
# Example of code:

mae_loss = torch.nn.L1Loss() # Normal arithmetics loss. If you want to use mean square error: .MSELoss

input_1 = torch.randn(2, 3, requires_grad=True)
print("input_1 =", input_1)

target = torch.randn(2, 3)
print("target =", target)

output = mae_loss(input_1, target)
output.backward()
print("output=", output)

input_1 = tensor([[-0.3416,  1.9846, -0.0480],
        [-0.5596, -0.3991,  0.0959]], requires_grad=True)
target = tensor([[-0.4779,  2.3243,  0.7750],
        [-2.8559,  0.8496, -1.2165]])
output= tensor(1.0261, grad_fn=<L1LossBackward>)


In [11]:
mae_loss

L1Loss()

In [12]:
input_1.grad # Yes, it is the gradient of mean average error f = (x1+x2+x3+x4+x5+x6/n -constants) ---> df/dxi
             # Only the signs are not always good, because in the definition they use absolute values | |

tensor([[ 0.1667, -0.1667, -0.1667],
        [ 0.1667, -0.1667,  0.1667]])

## more on torch.autograd.grad: computing gradients/derivatives

### Example 1

In [13]:
x = torch.Tensor(np.linspace(0,2,5)[:,None])
x.requires_grad = True
print(x)

tensor([[0.0000],
        [0.5000],
        [1.0000],
        [1.5000],
        [2.0000]], requires_grad=True)


In [14]:
y_x = lambda x: x**3

y_x_outputs = y_x(x)
y_x_outputs

tensor([[0.0000],
        [0.1250],
        [1.0000],
        [3.3750],
        [8.0000]], grad_fn=<PowBackward0>)

In [15]:
# y_der_x = torch.autograd.grad(y_x_outputs, x)   GIVES ERROR, scalar error, so we also use
#                                                 grad_outputs = torch.ones_like(x)

# https://stackoverflow.com/questions/54754153/autograd-grad-for-tensor-in-pytorch#
# for more info see the link above.

In [16]:
y_der_x = torch.autograd.grad(outputs=y_x_outputs, inputs=x, grad_outputs=torch.ones_like(y_x_outputs))

# This is very similar to calling output.backward()
# But with 2 differences:
# 1) First .backward() requires to be applied to a scalar. Therefore we took output = mean(outputs)
# 2) We use this and not backward, because this is an intermediate step. The final backward is used
#    for the neural network and the weights adjusting. These derivatives in here are only an intermediate step
#    , more precisely, in this case, only to compute the Loss Function

y_der_x    #is a tuple with one element.

(tensor([[ 0.0000],
         [ 0.7500],
         [ 3.0000],
         [ 6.7500],
         [12.0000]]),)

In [17]:
y_der_x[0]

tensor([[ 0.0000],
        [ 0.7500],
        [ 3.0000],
        [ 6.7500],
        [12.0000]])

### Example 2

In [18]:
x = np.linspace(0, 2, 100)[:, None]

input_tensor = torch.Tensor(x)
input_tensor.requires_grad = True  # Changing its attribute.

N = nn.Sequential(nn.Linear(1, 5), nn.Sigmoid(), nn.Linear(5,1, bias=False))
Psi_t = lambda x: 3.2 + x * N(x)  # A function, depending on the neural network 

outputs = Psi_t(input_tensor) 

output = torch.mean(outputs)   # Scalar output! It is a pytorch tensor, scalar!
                               # Required to make output scalar, because .backward() applies only to a scalar.
                               # That is, outputs.backward() gives error.

In [19]:
# output.backward()  # Let's compute the derivatives of output with respect the tensors with requires_grad=True
# print(input_tensor.grad) # Derivatives of output wrt to input_tensor

Notice that if we run again output.backward() gives error. It says to retain_graph=True. What does this mean?

In [20]:
Psi_t_x = torch.autograd.grad(outputs, input_tensor, grad_outputs=torch.ones_like(outputs),
                    create_graph=True)[0] 

# create_graph (bool, optional) – If True, graph of the derivative will be constructed, 
# allowing to compute higher order derivative products. Default: False.
# Especially we need it when we want to compute second derivatives etc.. etc..

In [21]:
# Psi_t_x

## Optimizer

Theoretically, optimizer is a modification of the classical gradient descent that you know, so that it 
converges to the minima faster. 
As we know, we use 

$$w_{t+1} = w_t - h \cdot \frac{\partial E}{\partial w} $$

This is the classical. Optimizers want that it converges faster. Look at the very quick example
https://www.geeksforgeeks.org/intuition-of-adam-optimizer/#:~:text=Adam%20optimizer%20involves%20a%20combination,minima%20in%20a%20faster%20pace.

-------------
from a code point of view in pytorch

https://pytorch.org/docs/stable/optim.html#:~:text=optim-,torch.,easily%20integrated%20in%20the%20future.


To use torch.optim you have to construct an optimizer object, that will hold the current state and will update the parameters based on the computed gradients.

In [22]:
# General optimizer
# optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# model.parameters() are the parameters to optimize and to change. It will do so,
# lr = learning rate

optimizer = torch.optim.LBFGS(N.parameters())

Optimizer has the method .step(), which optimize with one step/epoch the weights of the model.
There are 2 ways to implement this:

### Method 1
used for certain optimizers.

In [23]:
for input, target in dataset:
    optimizer.zero_grad()
    output = model(input)
    loss = loss_fn(output, target)
    loss.backward()              # After computing the gradients we apply the optimizer.
    optimizer.step()

NameError: name 'dataset' is not defined

### Method 2
used for other optimizers

In [None]:
for input, target in dataset:
    def closure():
        optimizer.zero_grad()
        # output = model(input)
        loss = loss_fn(output, target)
        loss.backward()         # After computing the gradients we apply the optimizer.
        return loss             
    optimizer.step(closure)     # Here updates the weights.

Under the hood: optimizer sees which is the loss function, the one in which we apply .backward(). 
When we created optimizer we also gave it the N.parameters() of the neural network. 
These will be modified accordingly: it will compute the derivative of the loss function with respect to 
these parameters. And these parameters will be changed automatically
    