# Lecture 8 - PyTorch

This will be the final lecture, today we will first have a brief introduction of deep learning, then we will look at some basics of using PyTorch to implement some simple models in deep learning.

1. Homework due today

# Basic Section (Start)
What is Neural Network:
1. [My recommended youtube video with great visual helpers](https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=4)
2.  [Previous Years Instructor's Recording](https://stanford-pilot.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=c79271d9-3694-49d3-8f93-afb10161e47e)
# Basic Section (End)

## Deep Learning Libraries

There are many deep learning libraries available, the most common ones for python are

- TensorFlow, Keras
- PyTorch

Working with tensorflow requires going into lot of details of the contruction of the computation graph, whereas Keras is a higher level interface for tensorflow. Tensorflow is very popular in the industry and good for production code.

PyTorch can be used as low level interface, but is much more user-friendly than tensorflow, but it also has a higher level interface. Pytorch is more popular in the research community.

## Main features that any deep learning library should provide

No matter what library or language you use, the main features provided by a deep learning library are
1. Use the GPU to speed up computation
2. Ability to do automatic differentiation
3. Useful library functions for common architectures and optimization algorithms

### PyTorch
We will look at all of the above in pytorch.
The best way to think about pytorch is that its numpy + GPU + autograd.

You can install it with

```conda install pytorch```.

Alternatively (and recommended), run this notebook in Google Colab-- it provides an environment with all of the PyTorch dependencies plus a GPU free of charge.

In [None]:
import torch
import numpy as np

The equivalent object to numpy arrays in pytorch are called tensors, but they are just multidimensional arrays.

In [None]:
torch.tensor([2,3,4,5])

In [None]:
torch.zeros((5,5))

In [None]:
x = torch.ones((5,5))
print(type(x))
x

In [None]:
2*x + 5

In [None]:
torch.randn(5,5)

In [None]:
x = torch.rand(25)
x

In [None]:
x=x.reshape(-1,5)
x

In [None]:
x.shape

In [None]:
print(torch.arange(10))
print(torch.eye(5))
print(torch.linspace(0,1,10))

Some functions are a bit different

In [None]:
A = np.random.rand(5,5)
x = np.ones((5,1))
A@x

In [None]:
A_ = torch.rand((5,5))
x_ = torch.rand(5,1)
A_@x_

In [None]:
# class Test:
#   def __init__(self):
#     self.__data = 'Great__'
#     self.data = 'Great'
# test = Test()
# print(test.__data)

In [None]:
A = torch.ones((5,5))
x = torch.ones(5,1)
A@x

In [None]:
?np.ones

In [None]:
?torch.ones

You can convert tensors to a numpy array that shares its memory with the pytorch tensor -> to use more library that are compatible to numpy but not pytorch

In [None]:
x = torch.ones(5,5)
x

In [None]:
xn = x.numpy()
print(type(xn))
xn

In [None]:
# Changes in Numpy will cause changes in Tensor
xn[4,2]=10
xn

In [None]:
x

### Using the GPU

The GPU (Graphical Processing Unit) is a separate processing unit that is specialized to handle bulk computations required for rendering high quality graphics. It mainly consists of a large number of processor cores that are individually very slow, but because of their sheer number (around 2000) they can churn through computations very quickly.

In [None]:
import torch
"""
CUDA is a parallel computing platform
and application programming interface
that allows software to use certain types
of graphics processing units for general purpose processing
""";
#CUDA -> Recommend CME213 (C++)
torch.cuda.is_available()

Installing the GPU drivers and the CUDA toolkit can be quite messy, so if you just want to experiment with GPUs and deep learning libraries, you can use [Google colaboratory](https://colab.research.google.com/)

In [None]:
gpu = torch.device("cuda")
cpu = torch.device("cpu")

In [None]:
A = torch.rand(100,100)
B = torch.rand(100,100)

In [None]:
A@B

In [None]:
A_gpu = A.to(gpu)
B_gpu = B.to(gpu)

In [None]:
A_gpu@B_gpu

In [None]:
A@B_gpu #this won't work!

In [None]:
C_gpu = A_gpu@B_gpu
C = C_gpu.to(cpu)
C

## Speedup from GPU
`%%timeit` is a Jupyter Notebook magic command that is used to measure the execution time of a Python code snippet. When you add `%%timeit` at the beginning of a cell in a Jupyter Notebook, it will run the code in the cell multiple times and measure the average execution time.

In [None]:
%%timeit
A = torch.rand(3000,3000)
B = torch.rand(3000,3000)
for i in range(5):
    B=torch.mm(A,B)

In [None]:
%%timeit
A = torch.rand(3000,3000, device = gpu)
B = torch.rand(3000,3000, device = gpu)
for i in range(5):
    B=torch.mm(A,B)

## Automatic Differentiation

PyTorch uses dynamic computation graphs to compute the gradients of the parameters.

In [None]:
x = torch.tensor([2.0])
w = torch.tensor([5.0], requires_grad = True)
b = torch.tensor([2.0], requires_grad = True)

In [None]:
y = w*x + b #12 = 5*2 + 2
y

Define an error for your function

In [None]:
loss = torch.norm( y - 13)
loss

In [None]:
w.grad

Calling `x.backward()` on any tensor forces pytorch to compute all the gradients of the tensors used to compute `x` which had the `requires_grad` flag set to `True`. The computed gradient will be stored in the `.grad` property of the tensors

In [None]:
loss.backward()

In [None]:
w.grad

In [None]:
b.grad

we can tell PyTorch to skip the gradient calculations, which can help to reduce the memory usage and speed up computations.

In [None]:
# It will reduce memory consumption for computations
# that would otherwise have requires_grad=True
with torch.no_grad(): #when we update w and b, do not update gradients
    w -= 0.01 * w.grad
    b -= 0.3 * b.grad

In [None]:
w,b

In [None]:
w.grad, b.grad

In [None]:
# the gradients will accumulate and lead to incorrect updates and slower convergence.
w.grad.zero_()
b.grad.zero_()

w.grad, b.grad

In [None]:
y = w*x + b

In [None]:
y

In [None]:
loss = torch.norm( y - 13)
loss

In [None]:
loss.backward()
w.grad, b.grad

### Making it more compact

In [None]:
def model_fn(x,w,b):
    return w*x + b

In [None]:
def loss_fn(y,yt):
    return torch.norm(y-yt)

In [None]:
w = torch.tensor([5.0], requires_grad = True)
b = torch.tensor([2.0], requires_grad = True)

In [None]:
x = torch.tensor([2.0])
yt = torch.tensor([13.0])

In [None]:
y = model_fn(x,w,b)
loss = loss_fn(y,yt)
loss.backward()
with torch.no_grad():
    w -= 0.05 * w.grad
    b -= 0.05 * b.grad
w.grad.zero_()
b.grad.zero_()

print( f" w = {w}\n b = {b}\n y = {y}\n loss = {loss}")
#note that 'loss' indicates the loss for the previous m,c values

### Slightly more complicated problem

1. Forward propagation to get predicted yhat
2. Get the loss by loss_fun(y, yhat)
3. Backward propagation to get the gradient
4. Update parameters
5. Repeat 1 ~ 4 until convergence

In [None]:
import matplotlib.pyplot as plt

In [None]:
def model_fn(x,w,b):
    return w@x + b

In [None]:
def loss_fn(y,yt):
    return torch.norm(y-yt)

In [None]:
w = torch.rand((5,5), requires_grad = True)
b = torch.ones((5,1), requires_grad = True)

In [None]:
x = torch.randn(5,100)
yt = torch.randn(1,100)
losses = []

In [None]:
for i in range(50):
  # 1. Forward
  y = model_fn(x,w,b)
  # 2. Get loss
  loss = loss_fn(y,yt)
  # 3. backward
  loss.backward()
  # 4. Update
  with torch.no_grad():
      w -= 0.05 * w.grad
      b -= 0.05 * b.grad
  w.grad.zero_()
  b.grad.zero_()

  losses+=[loss.item()]
  print( f"loss = {loss}")
  plt.plot(losses);

## Using Library functions

In [None]:
model = torch.nn.Sequential(
    # create layer of sequence
    # F(WX + B)
    # First layer: ReLU(WX + B)
      # ReLU is important because it is doing some non-linear tranformation
    torch.nn.Linear(5, 5),
    torch.nn.ReLU(),

    # Output layer
    torch.nn.Linear(5, 1),
)

In [None]:
list(model.parameters())

In [None]:
#mean square loss → MSE = (1/n) * sum((y_pred - y_true)^2)
loss_fn = torch.nn.MSELoss(reduction='sum')

In [None]:
x = torch.randn(100,5)
yt = torch.randn(100,1)
losses = []

Using the optim package

In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.03)

`optimizer.step()` is used to update the model parameters based on the gradients computed during backpropagation. During training, the optimizer computes the gradients of the loss function with respect to the model parameters
  
`optimizer.zero_grad()` is used to set the gradients of all the model parameters to zero before computing the gradients for the next batch of data. If we don't zero out the gradients before computing the gradients for the next batch, the gradients will accumulate, leading to incorrect updates and slower convergence.

In [None]:
for i in range(1000):
    y = model(x)
    loss = loss_fn(y,yt)
    loss.backward()

    #make the parameter to take a step -> update the parameter
    optimizer.step()
    optimizer.zero_grad()

    losses+=[loss.item()]
    print( f"loss = {loss}")
plt.plot(losses);

## MNIST Example

In [None]:
from torchvision.datasets import MNIST

In [None]:
data = MNIST(".",download=True)

In [None]:
len(data)

In [None]:
import numpy as np
img,y = data[np.random.randint(1,60000)]
print(y)
img

In [None]:
img,y = data[2]
img

In [None]:
print(data.train_data[2].shape)

In [None]:
print(data.train_labels[2])

### MNIST Training

In [None]:
model = torch.nn.Sequential( # 28*28 = 784
    torch.nn.Linear(784,  100),
    torch.nn.ReLU(),
    torch.nn.Linear(100, 100),
    torch.nn.ReLU(),
    torch.nn.Linear(100, 10),
    torch.nn.LogSoftmax(dim=1)
)

In [None]:
loss_fn = torch.nn.CrossEntropyLoss()

In [None]:
sample = np.random.choice(range(len(data.train_data)),1000)
x = data.train_data[sample].reshape(1000,-1).float()/255
yt = data.train_labels[sample]

In [None]:
x.shape,yt.shape

In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.03)
losses = []

In [None]:
for i in range(100):

    sample = np.random.choice(range(len(data.train_data)),1000)
    x = data.train_data[sample].reshape(1000,-1).float()/255
    yt = data.train_labels[sample]

    y = model(x)

    # input (Tensor) – Predicted unnormalized logits
    # target (Tensor) – Ground truth class indices or class probabilities
    loss = loss_fn(y,yt)

    loss.backward()

    optimizer.step()
    optimizer.zero_grad()

    losses+=[loss.item()]
    #print( f"loss = {loss}")
plt.plot(losses);

In [None]:
x_test = data.train_data[-1000:].reshape(1000,-1).float()/255
y_test = data.train_labels[-1000:]

In [None]:
with torch.no_grad():
    y_pred = model(x_test)

In [None]:
y_test[4]

In [None]:
y_pred.argmax(dim=1)[4]

In [None]:
print("Accuracy = ", (y_pred.argmax(dim=1) == y_test).sum().float().item()/1000.0)

## Course Conclusion

1. By now you should have a sufficient introduction to the various ways one can use python for scientific computing. The best way to learn more is to start using python for whatever project you are working on. Only practice will make you comfortable with using python.   

Recommended Project Source: kaggle  

Recommended ML/DL Courses: CS229, 230, 231N, 224 series, 238, 246  

2. I appreciate your time to submit the course feedback, which means a lot to me and improvement for this course in the future  

