# Introduction
[PyTorch] is one of the two most popular Deep Learning frameworks in Python, besides TensorFlow. Here is some key points when comparing the two:
- In terms of low or high level, PyTorch falls somewhere in between TensorFlow and Keras. No fit-and-predict interface, must be done by hand.
- PyTorch is prefered by research community with more customizations, as we normally see newly published architectures written in PyTorch.
- TensorFlow/Keras is better for production due to high-level interface and large deployment ecosystem.

[PyTorch]: https://github.com/pytorch/pytorch

# 1. Data manipulation

## 1.1. Computation

In [1]:
import numpy as np
import pandas as pd
import janitor
import torch
import torch.nn as nn
import torch.nn.functional as F

### Tensor
In PyTorch, we work most of the time with
<code style='font-size:13px'><a href=https://pytorch.org/docs/stable/tensors.html>Tensor</a></code>
whose operations are very much like NumPy's array. Being very natural to PyTorch, tensor operations are provided directly in the [mother package]. One thing to notice is that PyTorch requires tensors to be of the same data type so mathematical computation can be performed on them. When error occurs, simply call the <code style='font-size:13px'>double()</code> method to convert the tensor to float type.

[mother package]: https://pytorch.org/docs/stable/torch.html

In [40]:
a = torch.tensor([
    [1., 2., 3.],
    [4., 5., 6.]
])

In [41]:
torch.rand_like(a).double()

tensor([[0.1039, 0.6032, 0.2776],
        [0.3127, 0.8976, 0.3040]], dtype=torch.float64)

### Autograd
PyTorch provides automatic differentiation via the sub-module
<code style='font-size:13px'><a href=https://pytorch.org/docs/stable/autograd.html>autograd</a></code>,
with functions implemented as
<code style='font-size:13px'>Tensor</code>
methods. Being a mathematical module, it distinguishes two types of tensor, *constant* and *variable*, indicated via the flag <code style='font-size:13px'>requires_grad</code>. All tensors are constants by default, and become variables when this flag is enabled. Autograd is designed to work on a computational graph, where:
- The *foward* pass requires inputs to be tensors and output to be a scalar. This pass is done using normal tensor operations.
- During *backward* pass (by calling the <code style='font-size:13px'>backward()</code> method on the output), PyTorch will compute and accumulate partial derivatives for leaf nodes. This information can be accessed via the <code style='font-size:13px'>grad</code> attribute of input variables.

In [2]:
def f(x1, x2):
    y = (x1 - 0.3)**2 + (x2 - 0.7)**2 + 1
    return y.mean()

In [3]:
x1 = torch.rand(8, requires_grad=True)
x2 = torch.rand(8, requires_grad=True)

In [4]:
y = f(x1, x2)
y.backward()
x1.grad

tensor([ 0.0980,  0.0163,  0.0375,  0.0945, -0.0603, -0.0421,  0.0848,  0.0035])

### Optimization
The sub-module <code style='font-size:13px'><a href=https://pytorch.org/docs/stable/optim.html>optim</a></code>
implements various optimization algorithms from the basic SGD to the standard Adam. We are going to demonstrate
<code style='font-size:13px'>autograd</code> and <code style='font-size:13px'>optim</code>
combining together using a simple problem, finding $\min\bar{\mathbf{y}}$ where $\mathbf{y}=(\mathbf{x}_1-0.3)^2+(\mathbf{x}_2-0.7)^2+1$. Recall that all gradient descent algorithms share three components:
- (1) configurations such as learning rate $\eta$, momentum $\gamma$ and weight decay $\rho$
- (2) the variable $x_t$ and its gradient $g_t$ at each iteration $t$
- (3) intermediate variables such as exponential moving averages $m$ and $v$

PyTorch implements its optimization algorithms in a low-level interface. During initialization, we set component (1) and register component (2) to our optimizer. Then, when we call the
<code style='font-size:13px'>step()</code>
method, PyTorch will compute everything needed in component (3) and perform update in-place to $x$.

The registration step surprisingly makes sense, as we realize that gradient descent actually only cares about $x$ and never touches $y$. The optimization process is very clearly here: Autograd handles computational graphs to updates $g_t$, then Optim uses $g_t$ to perform a gradient descent step and passes the updated $x_t$ back to Autograd. We can see how smartly PyTorch is designed, each sub-module is account for a specific task, they alternately do their jobs and pass the output to the other. Learning optimization in PyTorch makes understanding about gradient descent much deeper.

In [5]:
def f(x1, x2):
    y = (x1 - 0.3)**2 + (x2 - 0.7)**2 + 1
    return y.mean()

In [31]:
x1 = torch.rand(8, requires_grad=True)
x2 = torch.rand(8, requires_grad=True)
params = (x1, x2)
optimizer = torch.optim.Adam(params)

In [32]:
optimizer.state_dict()['state']

{}

In [30]:
optimizer.zero_grad()
y = f(x1, x2)
y.backward()
optimizer.step()
optimizer.state_dict()['state']

{0: {'step': tensor(1.),
  'exp_avg': tensor([0.0128, 0.0034, 0.0069, 0.0109, 0.0157, 0.0140, 0.0157, 0.0117]),
  'exp_avg_sq': tensor([1.6259e-05, 1.1660e-06, 4.7038e-06, 1.1785e-05, 2.4588e-05, 1.9502e-05,
          2.4645e-05, 1.3725e-05])},
 1: {'step': tensor(1.),
  'exp_avg': tensor([ 0.0041, -0.0049, -0.0042, -0.0075,  0.0022, -0.0111,  0.0006, -0.0064]),
  'exp_avg_sq': tensor([1.6493e-06, 2.4000e-06, 1.7603e-06, 5.6875e-06, 4.8614e-07, 1.2309e-05,
          3.5359e-08, 4.0557e-06])}}

In [20]:
x1.grad

tensor([ 0.0971,  0.0389,  0.0405, -0.0444,  0.1366,  0.0634, -0.0026,  0.0251])

In [18]:
optimizer.zero_grad()
y = f(x1, x2)
y.backward()
optimizer.step()
optimizer.state_dict()['state']

{0: {'momentum_buffer': tensor([ 0.1916,  0.0767,  0.0799, -0.0876,  0.2695,  0.1250, -0.0051,  0.0495])},
 1: {'momentum_buffer': tensor([ 0.0007, -0.2121,  0.0425,  0.0691, -0.0214, -0.0744,  0.0686, -0.1629])}}

In [21]:
optimizer.zero_grad()
y = f(x1, x2)
y.backward()
optimizer.step()
optimizer.state_dict()['state']

{0: {'momentum_buffer': tensor([ 0.2552,  0.1022,  0.1065, -0.1167,  0.3590,  0.1665, -0.0068,  0.0659])},
 1: {'momentum_buffer': tensor([ 0.0009, -0.2826,  0.0566,  0.0921, -0.0285, -0.0991,  0.0913, -0.2170])}}

In [25]:
x1.grad

tensor([ 0.0828,  0.0331,  0.0345, -0.0378,  0.1164,  0.0540, -0.0022,  0.0214])

In [27]:
0.2552 - 0.9*0.1916

0.08275999999999997

In [7]:
nIter = 50
for _ in range(nIter):
    optimizer.zero_grad()
    y = f(x1, x2)
    y.backward()
    optimizer.step()

In [8]:
y

tensor(1.0001, grad_fn=<MeanBackward0>)

In [9]:
x1

tensor([0.3092, 0.2988, 0.3117, 0.2961, 0.3081, 0.2995, 0.3086, 0.3084],
       requires_grad=True)

In [10]:
x2

tensor([0.6960, 0.7005, 0.6860, 0.6932, 0.6972, 0.6889, 0.7025, 0.7025],
       requires_grad=True)

In [11]:
optimizer.state_dict()

{'state': {0: {'momentum_buffer': None}, 1: {'momentum_buffer': None}},
 'param_groups': [{'lr': 0.3,
   'momentum': 0,
   'dampening': 0,
   'weight_decay': 0,
   'nesterov': False,
   'maximize': False,
   'foreach': None,
   'params': [0, 1]}]}

In [101]:
# constants
df = pd.read_csv('data/boston.csv')
x = torch.tensor(df.drop(columns='price').values, dtype=torch.float32)
yTrue = torch.tensor(df.price, dtype=torch.float32).reshape(-1, 1)
x.shape

torch.Size([506, 13])

In [118]:
# variables
w = torch.rand(13, 1, requires_grad=True)
b = torch.rand(1, requires_grad=True)

In [8]:
# foward pass
yPred = torch.matmul(x, w) + b
loss = F.mse_loss(yPred, yTrue)

In [5]:
# backward pass
loss.backward()
w.grad

tensor([[4.0881e+03],
        [8.9216e+03],
        [1.1059e+04],
        [6.2815e+01],
        [5.2048e+02],
        [5.6970e+03],
        [6.5683e+04],
        [3.2624e+03],
        [9.9968e+03],
        [4.0062e+05],
        [1.7052e+04],
        [3.2636e+05],
        [1.2301e+04]])

In [119]:
params = w, b
optimizer = torch.optim.SGD(params, lr=0.1, momentum=0.9)

In [122]:
optimizer.zero_grad()
yPred = torch.matmul(x, w) + b
loss = F.mse_loss(yTrue, yPred)

loss.backward()
optimizer.step()

loss

tensor(4.0422e+24, grad_fn=<MseLossBackward0>)

In [34]:
b.grad

tensor([-38109960.])

In [None]:
nIter = 100
for epoch in range(nIter):
    optimizer.zero_grad()
    yPred = 

In [37]:
from torch.utils.data import Dataset, DataLoader

In [43]:
len(y)

506

In [50]:
x[2]

tensor([2.7290e-02, 0.0000e+00, 7.0700e+00, 0.0000e+00, 4.6900e-01, 7.1850e+00,
        6.1100e+01, 4.9671e+00, 2.0000e+00, 2.4200e+02, 1.7800e+01, 3.9283e+02,
        4.0300e+00], dtype=torch.float64)

In [51]:
y[2]

tensor([34.7000], dtype=torch.float64)

In [52]:
class TabularData(Dataset):
    def __init__(self, df, labelName):
        self.features = df.drop(columns=labelName)
        self.label = df[labelName]
    
    def __len__(self):
        return len(self.label)
    
    def __getitem__(self, idx):
        return self.features[idx], self.label[idx]

In [55]:
TabularData(df, 'price')

<__main__.TabularData at 0x1a72763b550>

In [66]:
size = 7
x = torch.rand(size, requires_grad=True)
# y = (x-0.5)**2

optimizer = torch.optim.SGD([x], lr=0.1)

In [85]:
optimizer.zero_grad()
y = ((x-0.5)**2).sum()
y.backward()
optimizer.step()
y

tensor(8.1594e-05, grad_fn=<SumBackward0>)

## (w) Data loader

## (w) Gradient accumulation

## (w) Language processing
<code style='font-size:13px'><a href=https://pytorch.org/text/stable/index.html>torchtext</a></code>

## (w) Image processing
<code style='font-size:13px'><a href=https://pytorch.org/vision/stable/index.html>torchvision</a></code>

# (w) Neural networks

PyTorch has two APIs for creating layers,
<code style='font-size:13px'><a href=https://pytorch.org/docs/stable/nn.html>nn</a></code>
(abbreviated for *neural network*) and
<code style='font-size:13px'><a href=https://pytorch.org/docs/stable/nn.functional.html>nn.functional</a></code>.
The first module provides object interface (that supports auto differentiation) and the second module provides function interface (easier to use). So, the best practice is using object interface for layers with trainable parameters such as recurrent or convolutional, and using function interface for loss functions or activation functions.

PyTorch has two APIs for creating models, where the recommended one is
<code style='font-size:13px'><a href=https://pytorch.org/docs/stable/generated/torch.nn.Module.html>nn.Module</a></code>,
being equivalent to functional API in Keras. To create a model, we inherit this class, define building blocks inside the <code style='font-size:13px'>\_\_init__()</code> method and design the neural network architecture with the <code style='font-size:13px'>foward()</code> method. We don't need to to specify the backward pass, as the submodule
<code style='font-size:13px'><a href=https://pytorch.org/docs/stable/autograd.html>autograd</a></code>
will handle it for us. The second API,
<code style='font-size:13px'><a href=https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html>nn.Sequential</a></code>,
is good for simple architectures as well as small blocks of large networks, inception block of GooLeNet for example.

In [1]:
import torch
from torch import nn
from torch.autograd import Variable

In [2]:
torch.cuda.is_available()

False

In [56]:
x = torch.autograd.Variable(torch.Tensor([[1,2,3,4,5]]), requires_grad=True)
y = torch.sum(x**2)
y.backward() 
x.grad

tensor([[ 2.,  4.,  6.,  8., 10.]])

In [57]:
class TwoLayerNet(nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.

        D_in: input dimension
        H: dimension of hidden layer
        D_out: output dimension
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = nn.Linear(D_in, H) 
        self.linear2 = nn.Linear(H, D_out)

def forward(self, x):
        """
        In the forward function we accept a Variable of input data and we must 
        return a Variable of output data. We can use Modules defined in the 
        constructor as well as arbitrary operators on Variables.
        """
        h_relu = nn.functional.relu(self.linear1(x))
        y_pred = self.linear2(h_relu)
        return y_pred

In [72]:
class Rectangle:
    def __init__(self, length, width):
        self.length = length
        self.width = width

    def area(self):
        return self.length * self.width

    def perimeter(self):
        return 2 * self.length + 2 * self.width

class Square(Rectangle):
    def __init__(self, length):
        super().__init__(length, length)
        
class Cube(Square):
    def surface_area(self):
        face_area = super().area()
        return face_area * 6

    def volume(self):
        face_area = super().area()
        return face_area * self.length

# References
- *pytorch.org - [Autograd mechanics](https://pytorch.org/docs/stable/notes/autograd.html)*
- *pytorch.org - [Automatic differentiation with Torch.Autograd](https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html)*
- *pytorch.org - [Deep Learning with PyTorch: A 60-minute blitz](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)*
- *towardsdatascience.com - [Understanding PyTorch with an example: a step-by-step tutorial](https://towardsdatascience.com/understanding-pytorch-with-an-example-a-step-by-step-tutorial-81fc5f8c4e8e)*
- *towardsdatascience.com - [PyTorch vs TensorFlow - spotting the difference](https://towardsdatascience.com/pytorch-vs-tensorflow-spotting-the-difference-25c75777377b)*
- *blog.paperspace.com - [PyTorch 101 advanced](https://blog.paperspace.com/pytorch-101-advanced/)*
- *poloclub.github.io - [CNN explainer](https://poloclub.github.io/cnn-explainer/)*
- https://cs230.stanford.edu/blog/pytorch/

In [None]:
conda install pytorch torchvision torchaudio cpuonly -c pytorch

In [None]:
!pip3 install torch torchvision torchaudio --user

---
*&#9829; By Quang Hung x Thuy Linh &#9829;*