# **ICT303 - Advanced Machine Learning and Artificial Intelligence**
# **Lecture 2 - Preliminaries**

This notebook illustrates some examples of the concepts seen in Lecture 2 of ICT303. Most of the examples have been adapted from https://d2l.ai/chapter_preliminaries/index.html.

### **1. Automatic Differentiation**

Here we will illustrate how to use automatic differentiation in PyTorch.

#### **1.1. A simple example**

Consider the function $y=2\text{x}^T\text{x}$ where $\text{x} = (x_1, \dots, x_n)^T$ i.e.,  a colum vector of   $n$ variables. To start, let's assign to $\text{x}$ an initial value: 


In [None]:
import torch

x = torch.arange(4.0)  # creates a vector x of 4 elements and assign values 0, 1, 2, 3
x

tensor([0., 1., 2., 3.])

Next, we will indicate that we will need to compute the gradient with respect to the variable $\textbf{x}$.

In [None]:
x.requires_grad_(True)  # Alternatively, you can use `x = torch.arange(4.0, requires_grad=True)`
x.grad                  # The default value is None

We now calculate the function and assign the result to $\text{y}$.

In [None]:
y = 2 * torch.dot(x, x)
y

When you run the code above, it will display the value of $\text{y}$ as $28$ and the name of the function that will be used to compute the gradient.

We can now take the gradient of $\text{y}$ with respect to $\text{x}$ by calling its **backward** method, and can access the gradient via $\text{x}$’s **grad** attribute:


In [None]:
y.backward()    # Computes the gradient of y

# Gradient with respect to x
x.grad

We can vertify that the gradient computed with autograd is correct. In fact, we already know that the gradient of the function $y=2\text{x}^\top\text{x}$  with respect to $\text{x}$ should be $4\text{x}$. Thus, we can now verify that the automatic gradient computation and the expected result are identical:

In [None]:
x.grad == 4 * x  # compares whether the automatically computed gradient (i.e., x.grad) is equal to the manually computed gradient (i.e, 4*x)

Now let’s calculate another function of $\textbf{x}$ and take its gradient. Note that PyTorch does not automatically reset the gradient buffer when we record a new gradient. Instead the new gradient is added to the already stored gradient. This behavior comes in handy when we want to optimize the sum of multiple objective functions. To reset the gradient buffer, we can call x.grad.zero() as follows:

In [None]:
x.grad.zero_()  # Reset the gradient
y = x.sum()     # Sums the elements of the vector x
y.backward()    # Computes gradient with respect to x
x.grad

#### **1.2. Non-scalar variables**
If interested (not needed at this stage), please see Section 2.5.4 of https://d2l.ai/chapter_preliminaries/autograd.html.

#### **1.3. Gradients and Python Control Flow**
So far we reviewed cases where the path from input to output was well-defined via a function such as $\text{z} = \text{x}^3$. Programming offers us a lot more freedom in how we compute results. For instance, we can make them depend on auxiliary variables or condition choices on intermediate results. 

One benefit of using automatic differentiation is that even if building the computational graph of a function required passing through a maze of Python control flow (e.g., conditionals, loops, and arbitrary function calls), we can still calculate the gradient of the resulting variable. 

To illustrate this, consider the following code snippet where the number of iterations of the while loop and the evaluation of the if statement both depend on the value of the input $a$.

In [None]:
def f(a):
    b = a * 2
    while b.norm() < 1000:
        b = b * 2
    if b.sum() > 0:
        c = b
    else:
        c = 100 * b
    return c

Below, we call this function, passing in a random value as input. Since the input is a random variable, we do not know what form the computational graph will take. However, whenever we execute $f(a)$ on a specific input, we realize a specific computational graph and can subsequently run backward.

In [None]:
a = torch.randn(size=(), requires_grad=True)
d = f(a)
d.backward()

Now we can access the gradient via the attribut grad of a

In [None]:
a.grad

Dynamic control flow is very common in deep learning. For instance, when processing text, the computational graph depends on the length of the input. In these cases, automatic differentiation becomes vital for statistical modeling since it is impossible to compute the gradient a priori.

#### **1.4. Some exercises**

These exercises are from Section 2.5.6 of the textbook.

***Question 1.*** After running the function for backpropagation, immediately run it again and see what happens. Why?

***Question 2.*** Let $f(x) = sin(x)$. Plot the graph of $f$ and of its derivative. Do not exploit the fact that the derivative is $cos(x)$ but rather use automatic differentiation to get the result. Instead, use automatic derivation.

In [None]:
import math
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import math

#https://pytorch.org/tutorials/beginner/introyt/autogradyt_tutorial.html

x = torch.linspace(-math.pi, math.pi, steps=25, requires_grad=True)
y = torch.sin(x)

# plotting y
#plt.plot(x.detach(), y.detach())

# Note that the method detach creates a tensor that shares storage with tensor that does not require grad. 
# It detaches the output from the computational graph. So no gradient will be backpropagated along this variable.

print(y)


Finally, let’s compute a single-element output. When you call .backward() on a tensor with no arguments, it expects the calling tensor to contain only a single element, as is the case when computing a loss function. Then you can print the gradient with respect to $x$ and also plot its graph

In [None]:
out = y.sum()
out.backward()

print(x.grad)
plt.plot(x.detach(), x.grad.detach())
