<div style="text-align: center; font-size: 32px; font-weight: bold;">
    Gradient Calculation with Autograd
</div>

- Lets see the autograd package in pytorch and see how we can calculate gradeints from it. Gradients are important for model optimization

- We want to calculate gradient of some function w.r.t. x, so we have to specify the argument ` requires_grad = True`

- 
```
# Create a tensor
x = torch.randn(3, requires_grad = True)
Print(x)
# >>> tensor([-0.3172, 0.3467, 0.6648], requires_grad=True)
```
- Whenever we do operations with this, pytorch will create a computational graph for this. Let’s do an operation
$$𝑦=𝑥+2$$
- This will create a CG which looks like the below image. For each operation we have a node with inputs and outputs. Here we have operation is addition. We have input x and 2; output is y. Pytorch creae a computaitonal graphy using use backpropagation to calcaulte the gradients

<center><img src='./images/CG_addition.PNG' width=350px></center> 

- With this graph a technique called __backpropagation__  is used to calculate the gradeints.
- First we do forward pass, where we apply operation and calculate output $y$. Since we specified that it requires gradient. Pytroch will automatically create and store a function for us and this function is than used in backpropagation to get the gradients. 
- Here $y$ has an attribute `grad_fn`, which point to a gradient function. In this case it is called `Add Backward`. 
With this function we can calculate the gradients in the backward path i.e., $𝑑𝑦/𝑑𝑥$.

```
y = x+ 2
Print(y)
# >>> tensor([-1.6828, 2.3467, 2.6648], grad_fn=<AddBackward0>)
```
- We can see the `AddBackward0` function when we print $y$, because our operation is additon.

In [None]:
# when we need to calculate the gradient, later for the optimization step.
import torch
x = torch.randn(3)
print(x)

# If we need to Calcualte gradients of some fucntion w.r.t. x, we give additonal arguemnts to tensor
x = torch.randn(3, requires_grad=True)
print(x)

In [None]:
# whenever we will do a computaiton with this tensor. Pytorch will create a computaional graph. 
# let's do some operation
y = x + 2
print(y)

In [None]:
# Lets do some more operations with our tensors.
z = y*y*2 # also has grad funciton attributes
print(z)
# >>> tensor([7.2041, 0.3141, 7.1014], grad_fn=<MulBackward0>)

z = z.mean # also has grad funciton attributes
print(z)
# >>> tensor([8.9153], grad_fn=<MeanBackward0>)

# Now if we want to claculate the gradeints w.r.t. $x$. The only thing we have to do is to call `.backward`
z.backward() # dz/dx
# $x$ than has a gradeint with `.grad` attributes where the gradeints are stored.
print(x.grad) # so we have the gradeint for the tensors
# >>> tensor([0.0160, 3.3650, 4.5153]) 

# NOTE: It should be noted that we have to multiply $J \cdot v$. So  if we have a scalar value, than we dont need to
# put any vector argument. But if we have dont ahve a scalar, than we ahve to multiply vector

So in the backgraound it is createing a vector Jacobain product to get the gradeints, whcih looks like

#### Vector Jacobian 
$$
J\cdot v = 
\begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \cdot \cdot \cdot & \frac{\partial y_m}{\partial x_1}\\
\vdots & \ddots & \vdots\\
\frac{\partial y_1}{\partial x_n} & \cdot \cdot \cdot & \frac{\partial y_m}{\partial x_n}\\
\end{bmatrix}
\begin{bmatrix}
\frac{\partial l}{\partial y_1} \\
\vdots \\
\frac{\partial l}{\partial y_m} \\
\end{bmatrix}
=
\begin{bmatrix}
\frac{\partial l}{\partial x_1} \\
\vdots \\
\frac{\partial l}{\partial x_n} \\
\end{bmatrix}
$$

- Here we have a Jacobain Matrix (first term on R.H.S), multiplied with gradeitn vector. Than we get the final gradients in wheihc we are interested. this is called __Chain Rule__.
- It should be noted that we have to multiply $J \cdot v$. So  if we have a scalar value, than we dont need to put any vector argument. But if we have dont ahve a scalar, than we ahve to multiply vector
```
z = y*y*2 # also has grad funciton attributes
print(z)

z.backward() # dz/dx
```

#### <span style="color: red;">This will give an error</span>

In [None]:
import torch
x = torch.ones(5, requires_grad=True)
y = x + 2
z = y*y*2 # also has grad funciton attributes
print(z)
z.backward() # dz/dx
# ERROR: grad can be implicitly created for scalar outputs


In [None]:
# So we have to create a vector of the same size
v = torch.tensor([0.1, 1.0, 0.001], dtpye = torch.float32)
z.backward(v) # dz/dx
print(x.grad)
# So we should remeber that in backgrpund or theory it as a vector Jacobian product. Most of the times
# last operations create a scalar value so we dont need vector and call without an argument. But if its a vector we have to pass vector argument

### try above if `x = torch.ones(5, requires_grad=True)`

### Prevent Pytorch from tracking gradients history and calculating `grad_fn` attributes
- How to stop autograd from tracking history
- How to zero (empty) gradients
- Example: sometimes in our training loop when we want to update our trainign weights. When we update the weights, this operaion should not be part of gradeint computaiton. We can do it in three ways
- Stop pytorch in creat `grad_fn`
- There three ways to do that
```
x.requires_grad_(False)
x.detach() # create new tensor whcih doent require gradient
with torch.no_grad(): # wratp this with with statement. Than we can do our operations
```

In [None]:
import torch
x = torch.ones(5, requires_grad=True)
print(x)

In [None]:
# First option
x.requires_grad_(False) # rememebr whenever we have fucntion with underscore at the end it will modify our varaible inplace.
print(x) # this doent have `requires_grad` attribute anymore

In [None]:
# Second Option: this will create a new tensor with the same new values but doent require the gradeints
y = x.detach()
print(y)

In [None]:
# Third option: wrap in a `with` statement
with torch.no_grad():
    y = x + 2
    print(y)

### NOTE: 

whenever we call the __backward fucntion__ than the grdient for the tensor is accumulated in the `.grad` attribut. So the values will be summed up. So we must be very carefull lets create soem dummy training example

In [None]:
weights = torch.ones(4, requires_grad=true)

# training Loop
# First lets do only for one epoch
for epoch in range(1):
    model_output = (weights*3).sum() # this is a dummy operation whcih will simulate some model output
    # calculate the gradeints
    model_output.backward()
    # call .grad to print
    print(weights.grad)

In [None]:
# if we do it for another operation. We can see that the grad is accumulated.
for epoch in range(2):
    model_output = (weights*3).sum()
    model_output.backward()
    print(weights.grad)

# Similarly do it for 3 epoch, we can see that the gredients are summed up
for epoch in range(3):
    model_output = (weights*3).sum()
    model_output.backward()
    print(weights.grad)

### Empty the gradients
We can see all the values are summed up and gradeints are not correct. So before we do the next iteration in optimization step we must __empty the gradients__

In [None]:
for epoch in range(2):
    model_output = (weights*3).sum()
    model_output.backward()
    
    print(weights.grad)
    
    weights.grad.zero_() # This is important to make gradeints zero
    
    print(weights.grad)

In [None]:
# later we will work on pytorch builtin optimizer. So we have to do the same.
# from the torch optimization package  `toch.optim.`
weights = torch.ones(4, requires_grad=true)
optimizer = toch.optim.SGD(weights, lr=0.01)
# With this optimizer we can do the optimization step
optimizer.step()
# before doing the next step empty the gradeints
optimizer.zero_grad()

### Summary
- So for now the things you should remeber is that whenever we want to caluate teh gradeints. We have to put attribute `requires_grad=true`.
`weights = torch.ones(4, requires_grad=true)` \
- Than calcualte the gradeints by calling the backward fucntion. \
`z.backward()` \
- Before calling next iteration in our optimization step. we must empty the gradeint \
` weights.grad.zero_()`

<div style="text-align: center; font-size: 32px; font-weight: bold;">
    Backpropagation - Theory With Example
</div>

Back Propagation Algorithm to Calcualte Gradients



### Chain Rule
<center><img src='./images/chainRule.PNG' width=350px></center> 


Lets say we have two operations or two functions. We have input $x$ and apply a fucntion $a(x)$ which give output $y$, whcih is further pass to fucntion $b(y)$ and give final output $z$. We want to minimize $z$, so we want to know the derivative of `z` w.r.t. $x$. $\frac{dz}{dx} =?$ We can do this using Chain rule, so we first compute derivative locally
$$\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \cdot \frac{\partial y}{\partial x}$$

### COMPUTATIONAL Graph (CG):
- Next thing is CG. So for every operations we do with tensors, PyTorch will create a CG. For each node we apply some operation/function with some inputs and we get an output.
- 
#### Example
- Here we did multiply operation of $x$ and $y$ so the node has multiply operator.
- For these nodes we can calculate so called __local gradients__ and we can use them in the __chain rule__ to get final gradient.
- Local gradients are easy because we know the fucntion at the nodes.

<center><img src='./images/CG_explained.PNG' width=450px></center> 

#### Why do we need  local Gradients?
- because the Graph has more operations. At the end we have to calculate the Loss function, which we want to minimize.
- Lets assume if we know the loss at $z$ position i.e, $\frac{\partial loss}{\partial z}$, and we can obtain the final gradeint we want with the chain rule. 
$$\frac{\partial loss}{\partial x} =\frac{\partial loss}{\partial z}.\frac{\partial z}{\partial x}$$
where $\frac{\partial z}{\partial x}$ is the local gradient

The whole concept consist of three steps 
1. Forward Pass: Apply all the functions and Compute Loss
2. Compute Local Gradients: At each node
3. Backward Pass: Compute the gradient of the loss w.r.t. weights $dLoss/dWeight$ using chain rule.

## Linear Regression
- In Regression we want to predict continuous walues. However in classification we want to predict discrete values like 0 or 1.
- Approximation :
$$\hat{y} = wx + b$$

where $w$ is the slope and $b$ is the intercept or shift on the $y-$axis for 2D case. We have to come up with an algorithm to find w and b. For that we have to define cost function. In linear regression it is mean squared Error. Cost function: 

$$MSE = J(w,b) = \frac{1}{N} \sum_{i=1}^{n} (y_i - (wx_i + b))^2$$

- We want to minimise this error. To find the minimum we have to fins d the derivative or gradient. So we want to calculate the gradient w.r.t. $w$ and $b$ $J'(m,b) =$
- We model our output with a linear combination of some weights and input so \hat{y} = w.x. We formualte the loss ficntion. Lets assume is squared error.
- $$Loss = predicted y - actual y)^2$$
- $$Loss = (\hat{y} - y)^2$$

## Example: Linear Regression
- We model our output with a linear combination of some weights and some inputs. 
- Approximation :
$$\hat{y} = wx$$
where $w$ is the slope. We have to come up with an algorithm to find w and b. For that we have to define cost function. In linear regression it is mean squared Error. For simplicity just take sqared error. Else we will have another opeation to get the mean. 

- We formulate our loss function. Assume a squared error. It should be mean squared error. So the loss is the differentce between predicted and actual $Loss = predicted y - actual y)^2$ and we square it.
$$ loss = $(\hat{y} -y)^2 = (wx - y)^2$$$ 

- Now we want to minimize our loss function. To know the derivative of loss w.r.t. the weigths
$$Mimize \quad Loss \rightarrow \frac{\partial {Loss}}{\partial w}$$

#### how to get derivative of loss w.r.t. the weigths?
We apply three steps 
1. Forward Pass: Apply all the functions and Compute Loss
2. Compute Local Gradients: At each node
3. Backward Pass: Compute the gradient of the loss w.r.t. weights $dLoss/dWeight$ using chain rule.

<center><img src='./images/linear_regression_example.PNG' width=700px></center>

Example: Let's x, and y are our training samples \
#### (1) Forward Pass, $x=1, y=2$
- We initialize our weights, $w=1$
- Do the Forward pass. At first node we multiply $x$ and $w$. $\hat{y}=1\cdot 1 = 1$
- Next node we do a substraction So $S = \hat{y} - 2 =  1-2=-1$
- At very end we square the $S$. $Loss = 1^2 = 1$

####  (2) Calculate local gradient
- At last node $\frac{\partial loss}{\partial S} = \frac{\partial S^2}{\partial S} = 2S$
- At second last node: $\frac{\partial S}{\partial \hat{y}}=\frac{\partial (\hat{y}-y)}{\partial \hat{y}} = 1$
- At the first node: $\frac{\partial \hat{y}}{\partial w}=\frac{\partial wx)}{\partial w} = x$
- We don't ned the derivative of $x$ and $y$. Becaiuse these are fix values. We are only interested in our parameters whcih we want to update.

####  (3) Now Do backward Pass
Now we use our local gradeints to claculate the final loss . We have three gradeints so compute loss w.r.t. each
- $\frac{\partial loss}{\partial \hat{y}} =\frac{\partial loss}{\partial S} \cdot \frac{\partial S}{\partial \hat{y}} = 2\cdot S \cdot 1 = -2$
- $\frac{\partial loss}{\partial w} =\frac{\partial loss}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial \hat{w}} = -2 \cdot x = -2$


In [None]:
# To minimise the loss we apply three steps.
import torch
x = torch.tensor(1.0) y = torch.tensor(2.0)

# This is the parameter we want to optimize -> requires_grad=True
w = torch.tensor(1.0, requires_grad=True)

# forward pass to compute loss
y_predicted = w * x
loss = (y_predicted - y)**2
print(loss)

# First gradient after forwward and backward pass: backward pass to compute gradient d Loss/dw
loss.backward()
print(w.grad) # It should be -2 in the begining

# update weights: Next forward and backward pass...

# continue optimizing:
# update weights, this operation should not be part of the computational graph
with torch.no_grad():
    w -= 0.01 * w.grad
# don't forget to zero the gradients
w.grad.zero_()

# next forward and backward pass...


## 🎉 **Thank You!** 🙌  
### 🚀 Happy Coding & Keep Learning! 💡

## <span style="color: yellow;">We will see the Backpropagation with detailed theory and example in next notebook</span>