<a href="https://colab.research.google.com/github/marianqian/Intro-to-ML-and-DL-Using-fast.ai/blob/master/notebooks/Lesson_3_PyTorch_and_NumPy_Neural_Network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Welcome to the AI Academy! This is the third lesson, focused on 
introducing how to use the Python library, PyTorch, to create neural networks. The essentials of PyTorch is  extremely similar to NumPy, where Pytroch uses tensors, which is very similar to arrays. 

You should have already read and understood all the content from the Lesson 2 document, Deep Learning. If you would like a review on neural network structure and how neural networks are trained, watch the following videos. 

*   https://www.youtube.com/watch?v=bxe2T-V8XRs 
*   https://www.youtube.com/watch?v=aircAruvnKk

NOTE: Educational use and distribution is permitted, but credit and attribution to AIM Academy is required. 




#Learning Objectives

* Learn how tensors relate to NumPy arrays
* Learn about torch.nn modules and the components making up code in training neural network
* Autograd and gradient descent 




#What is the difference between using PyTorch and NumPy?
PyTorch uses tensors, which are essentially arrays which can have different numbers of dimensions. PyTorch also can be run on GPUs, or graphic processing units, which allow the neural network to train faster. GPUs are often separate machines located elsewhere and is connected to your computer to train your neural network digitally through wifi.

PyTorch includes automatic differentiation, which gives us the gradients for updating the weights and biases in the neural network during backpropagation. Automatic differentiation makes finding what the gradients are much easier; if we used NumPy, we would have to define the derivatives (slopes of functions) manually, but PyTorch automatically gives us the gradients by calling a method. 

Watch this [Khan Academy video](https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/gradient-and-directional-derivatives/v/gradient) to learn more about gradients. If you do not know how to take derivatives, explore the videos located on the Introduction to Derivatives course on Khan Academy linked [here](https://www.khanacademy.org/math/calculus-all-old/taking-derivatives-calc). The following Khan Academy links are also useful to understanding derivatives and gradients:
* [Multivariable functions](https://www.khanacademy.org/math/multivariable-calculus/thinking-about-multivariable-function/ways-to-represent-multivariable-functions/a/multivariable-functions)
* [Introduction to partial derivatives](https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/partial-derivative-and-gradient-articles/a/introduction-to-partial-derivatives)
* [Gradients](https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/partial-derivative-and-gradient-articles/a/the-gradient) 

In [0]:
import torch

How tensors are used and how to convert between numpy and tensors 
Notice how the behavior for creating NumPy arrays and PyTorch tensors are similar. When `t` and `tt` were added together, the smaller tensor `tt` was broadcasted along the y-axis of `t`. 

In [0]:
t = torch.zeros(3, 4)
print("Tensor t: \n", t)

tt = torch.tensor([3, 6, 7, 8])
print("Tensor tt: \n", tt)

print("Tensor t + tt: \n", t + tt)

Tensor t: 
 tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])
Tensor tt: 
 tensor([3, 6, 7, 8])
Tensor t + tt: 
 tensor([[3., 6., 7., 8.],
        [3., 6., 7., 8.],
        [3., 6., 7., 8.]])


Here, we are changing the PyTorch tensor to a NumPy array. 

In [0]:
n = t.numpy()
print("Numpy array n: \n", n)

Numpy array n: 
 [[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


Here, we are creating a NumPy array and changing it to a PyTorch tensor. Notice how PyTorch sets the data type to be a float. 

In [0]:
import numpy as np
nn = np.ones(3)
print("Numpy array nn: \n", nn)

nt = torch.from_numpy(nn)
print("Tensor nt: \n", nt)

Numpy array nn: 
 [1. 1. 1.]
Tensor nt: 
 tensor([1., 1., 1.], dtype=torch.float64)


#Running neural networks on GPUS
We mentioned before that neural networks are often ran on GPUs which speed up the training process. With PyTorch, the process of using the GPU is simple, as shown in this example below. The CUDA platform is a program developed by NVIDIA and allows for parallel computing on GPUs. Because CUDA is not provided on Google Colab, the code below is only in text as an example and will not work if it is ran. To learn more about CUDA and its uses, go to this [link](https://www.infoworld.com/article/3299703/what-is-cuda-parallel-programming-for-gpus.html). 


**The code below is fully credited to the PyTorch team and can be found at the documentation site, pytorch.org. The code below can be directly found at https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py.**
```
#CODE BELOW IS FULLY CREDITED TO PYTORCH (pytorch.org)
#USED ONLY FOR EDUCATIONAL PURPOSES UNDER FAIR USE

x = torch.empty(5, 3)

# let us run this cell only if CUDA is available
# We will use ``torch.device`` objects to move tensors in and out of GPU
if torch.cuda.is_available():
    device = torch.device("cuda")          # creates a CUDA device object
    y = torch.ones_like(x, device=device)  # directly create a tensor called y on GPU, where device specifies the CUDA GPU 
    x = x.to(device)                       # or just use strings ``.to("cuda")`` to move an already created tensor to the GPU
```



#NumPy Implementation of neural network
Here is a NumPy implementation of a neural network with only two layers. Because NumPy does not know how to compute gradients, the gradients, or derivatives of the weights biases must be calculated manually (we have to write the code ourselves). 

Below, a two-layer network is created, with the randomly created input layer, and a randomly created output layer for their weights. To make this network as simple as possible, biases were not included in this network. 

The network will take in 64 (`N`) number of different data x and y pairs. The data is formatted into x-variables of 1000 (`D_in`) features and y-variables of 10 (`D_out`) features. In other words, the neural network will take 1000 values as inputs in order to do **ONE PREDICTION**, which is 10 output values. The neural network will do 64 (`N`) of these predictions at once. 


**The code below is fully credited to the PyTorch team and can be found at the documentation site, pytorch.org. The code below can be directly found at https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#warm-up-numpy.**

In [0]:
#CODE BELOW IS FULLY CREDITED TO PYTORCH (pytorch.org)
#USED ONLY FOR EDUCATIONAL PURPOSES UNDER FAIR USE

import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N = 64
D_in = 1000
H = 100
D_out = 10

* Here, the input array x is created with dimensions of 64 by 1000 (`N x D_in`) 
and the target or target values are created with dimensions of 64 by 10 (`N x D_out`). 

* Two layers are created, `w1` and `w2`, and we will be changing the values of these arrays since these arrays are the actual **weights**. 

* The first layer, or weight array/matrix `w1`, has dimensions of 1000 by 100 (`D_in x H`). The size for the output of w1 can be changed, because that is the outcome for the "inbetween" of `w1` and `w2`. The output of `w1` will be the input for `w2`, so the weight array/matrix `w2` has dimensions 100 by 10 (`H x D_out`). 

* The results of `w2` are the predictions of the neural network, 10 results for each prediction. Notice how it matches the dimensions of the target array `y`, which for one prediction, has 10 features. 

**Final network:**

* **input** `x` (64 x 1000) -> **weight** `w1` (1000 x 100) -> **weight** `w2` (100 x 10) which is also predictions

These different arrays will be **matrix multiplied** together. Watch this [video](https://www.khanacademy.org/math/linear-algebra/vectors-and-spaces/dot-cross-products/v/vector-dot-product-and-vector-length) about matrix multiplication, also known as dot product.



In [0]:
#CODE BELOW IS FULLY CREDITED TO PYTORCH (pytorch.org)
#USED ONLY FOR EDUCATIONAL PURPOSES UNDER FAIR USE

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

Here, the neural network will train 500 times, as seen in the for-loop. 

The `learning_rate`, how much the weight matricies `w1` and `w2` will change is defined as 1e-6, or 1*10^-6. 

The forward pass is when the neural network uses its current weights to make a prediction. After the predictions are made, then the neural network calculates its loss, how wrong its predictions are, and changes its weights through backpropagation. ***For a more indepth review, go back and read over Lesson 2's Deep Learning document***. 

* Remember in NumPy, the `dot()` method indicates a matrix multiply/dot product.
* After `x` is matrix multiplied with `w1`, the inbetween matrix `h` is created. 
* The ReLU activation function is applied on `h`, which turns any negative numbers in array `h` to 0 and leaves positive numbers the same as before. 
* `np.maximum(h, 0)` is a way to apply the ReLU activation function (it means that for each element in the resulting array, it will take the maximum of the same element postion in `h` or 0. 
* If the element in a position is negative, then that element in the new `h_relu` array will be 0, because 0 is greater than all negative numbers).
* Finally, the `h_relu` array is matrix multiplied with the second weight matrix `w2`, resulting in `y_pred`, the final predictions. 


```
#CODE BELOW IS FULLY CREDITED TO PYTORCH (pytorch.org)
#USED ONLY FOR EDUCATIONAL PURPOSES UNDER FAIR USE

# Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
```



Now that we have `y_pred`, the neural netwok computes the loss. The loss used here is Mean Squared Error. For each prediction in `y_pred`, we take how far away or how "wrong" it is compared the actual value in `y`. We calculate how "wrong" the prediction is by `y_pred - y` and squaring the difference. The total loss is all the squares of the differences added together. 


```
#CODE BELOW IS FULLY CREDITED TO PYTORCH (pytorch.org)
#USED ONLY FOR EDUCATIONAL PURPOSES UNDER FAIR USE

  # Compute and print loss
    loss = np.square(y_pred - y).sum()
    if t % 100 == 99:
      print("Iteration: {}, Loss: {}".format(t, loss.item()))
```



Finally, the weight matrices are updated during backpropagation. We won't go into too much detail about how the gradients are calculated manually, but remember that gradients are the same as the slope of a line, or how much the model will change in order to decrease the loss as much as possible. The final gradients for `w1` and `w1` are `grad_w1` and `grad_w2`. 

```
#CODE BELOW IS FULLY CREDITED TO PYTORCH (pytorch.org)
#USED ONLY FOR EDUCATIONAL PURPOSES UNDER FAIR USE

 # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)
```

`w1` and `w2` are updated by **subtracting the learning rate * gradient.** Remember the learning rate is by how much the weight matrices `w1` and `w2` change. It's important to not have a too large or too small learning rate. 

```
#CODE BELOW IS FULLY CREDITED TO PYTORCH (pytorch.org)
#USED ONLY FOR EDUCATIONAL PURPOSES UNDER FAIR USE

 # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2
```
Try running the code below. Notice that as the iterations keep going, the loss eventually decreases to almost 0. That means the model is lowering its loss based on updating the gradient. The loss is printed for every 100 iterations.


In [0]:
#CODE BELOW IS FULLY CREDITED TO PYTORCH (pytorch.org)
#USED ONLY FOR EDUCATIONAL PURPOSES UNDER FAIR USE

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    if t % 100 == 99:
      print("Iteration: {}, Loss: {}".format(t, loss.item()))

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

#PyTorch implementation of neural network
Below, a Pytorch implementation of a neural network is used. Notice how. Notice how tensors are used instead of NumPy arrays. Tensors are are arrays with different number of dimensions, but they can also calculate the gradient without requiring us to manually code how to calculate the gradients.

Exactly same to the NumPy neural network above, the input array x and target value array y are created with the following dimensions. 

**input array** `x` (64 x 1000) `N x D_in`

**target value array** `y` (64 x 10) `N x D_out`

**The code below is fully credited to the PyTorch team and can be found at the documentation site, pytorch.org. The code below can be directly found at https://pytorch.org/tutorials/beginner/examples_nn/two_layer_net_optim.html#sphx-glr-beginner-examples-nn-two-layer-net-optim-py.**

In [0]:
#CODE BELOW IS FULLY CREDITED TO PYTORCH (pytorch.org)
#USED ONLY FOR EDUCATIONAL PURPOSES UNDER FAIR USE

#USE# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N = 64
D_in = 1000
H = 100
D_out = 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)


* Here, the network is composed of a class of layers compared to pure arrays. The `nn` package is included in the PyTorch library, and defines layers as `Modules`. 

* The model contains two linear layers, which take an input Tensor and output a Tensor after applying a matrix multiply with its own weight matrices. Notice how the dimensions for the `nn.Sequential` layers are the same as `w1` and `w2` in the NumPy neural network. In between the two `nn.Sequential` layers, a `nn.ReLU` activation function is applied to the inbetween outputted matrix. 

* The `nn` package also includes how to compute the loss. The `loss_fn` function, defined by `nn.MSELoss`, will allow us to pass in the final predicted y-values and the actual target y-values and give us the total loss. 

In [0]:
#CODE BELOW IS FULLY CREDITED TO PYTORCH (pytorch.org)
#USED ONLY FOR EDUCATIONAL PURPOSES UNDER FAIR USE

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

PyTorch also includes an `optim` package, which allows the model to control the gradients of the weights without being manually told to do so. **Optimizers are different ways to "optimize" or make the algorithm for decreasing the loss as efficient, or as fast, as possible.** Read this link for more information about [optimizers](https://algorithmia.com/blog/introduction-to-optimizers). 

The `optim.SGD` optimizer applies **stochastic gradient descent**, which means that the loss is calculated over the entire batch; the loss is based off of all 64 elements passed into the neural network. The total loss is the sum of the individual 64 losses.

In [0]:
#CODE BELOW IS FULLY CREDITED TO PYTORCH (pytorch.org)
#USED ONLY FOR EDUCATIONAL PURPOSES UNDER FAIR USE

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)


Similar to the previous NumPy neural network, this loop is where the model trains for 500 iterations. 

* `y_pred` is calculated by calling the model that we created in previous code blocks, and the input array `x` is passed into the model. 

* Instead of calculating the loss manually, `y_pred` and `y` are passed into `loss_fn` which was also created in the code blocks above. (The loss is printed for every 100 iterations.)

* When `optimizer.zero_grad()` is called, the **gradients are set to 0, or rather reset.** 
 * Every time the neural network calculates the final prediction, the gradients need to be **calculated only based on that one iteration**, not several iterations over and over (which is what will happen if `optimizer.zero_grad()` is not called). 

* When `loss.backward()` is called, the model **calculates the gradient with respect to its previous calculations with Tensors from the `loss_fn`.** 
  * The `autograd` package is what PyTorch uses to calculate these gradients esaily through automatic differentiation. 
  *Notice how this only takes 1 line, as compared to 6 lines when using the NumPy library. 

* When `loss.backward()` is called, the gradients are **calculated based on the calculations applied by the weight matricies inside the model.** 
 * All the Tensors inside the model have `requires_grad=True`, which is a parameter that we can set in order to calculate gradients automatically. 

* Where are the gradients stored? 
 * The weights themselves can be accessed through the model by calling `model.parameters()`, and the gradients are stored for each layer in the `.grad` attribute. 
 * We multiply the learning rate by the gradients and update the model parameters that way. 
 * Because we used the optimizers, **all of these can be completed by calling `optimizer.step()`, which removes the need to manually update the weights.** 

In [0]:
#CODE BELOW IS FULLY CREDITED TO PYTORCH (pytorch.org)
#USED ONLY FOR EDUCATIONAL PURPOSES UNDER FAIR USE

for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print("Iteration: {}, Loss: {}".format(t, loss.item()))

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call param.grad for param in model.parameters() will be Tensors holding the gradient
    # of the loss with respect to the model parameters. 
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

Iteration: 99, Loss: 2.3015968799591064
Iteration: 199, Loss: 0.03412647545337677
Iteration: 299, Loss: 0.001003907178528607
Iteration: 399, Loss: 4.040964267915115e-05
Iteration: 499, Loss: 2.016114422076498e-06


Notice if you keep running the above code cell, the same model or weights are kept for each layer, and the loss continues to decrease. 

Once you have gone through this notebook once, go through it again to make sure you understand why the code works the way it is. If you have any questions, do not be afraid to ask. 

#Using .backward()

In [0]:
import torch
x = torch.ones(2, 2, requires_grad=True)
print(x)

y = x * x
z = y.mean()
z.backward() # computes the gradients 
print(x.grad) # where the gradient is being stored 

tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
tensor([[0.5000, 0.5000],
        [0.5000, 0.5000]])
