# Pytorch
Nguồn :

https://towardsdatascience.com/understanding-pytorch-with-an-example-a-step-by-step-tutorial-81fc5f8c4e8e

autograd, dynamic computation graph, model classes

It is worth mentioning that, if we use all points in the training set (N) to compute the loss, we are performing a batch gradient descent. If we were to use a single point at each time, it would be a stochastic gradient descent. Anything else (n) in-between 1 and N characterizes a mini-batch gradient descent.

> In PyTorch, every method that ends with an underscore (_) makes changes in-place, meaning, they will modify the underlying variable.

Trong pytorch, những hàm nào mà kết thúc với dấu _ nó sẽ thay đổi theo biến   


#### Datatype
Khi khởi tạo dữ liệu cần nói cho nó biết đây là biến cần đạo hàm ngược bằng thuộc tính `requires_grad=True`

#### Autograd

Ta sử dụng cộng trừ nhân chia bình thường để gây dựng biểu thức loss
Sau đó sử dụng loss.backward để nói cho pytorch biết đây là điểm dừng để nó autograd (tự động tính toán đạo hàm ngược)

## Data Generation

In [1]:
import numpy as np
# Data Generation
np.random.seed(42)
x = np.random.rand(100, 1)
y = 1 + 2 * x + .1 * np.random.randn(100, 1)

# Shuffles the indices
idx = np.arange(100)
np.random.shuffle(idx)

# Uses first 80 random indices for train
train_idx = idx[:80]
# Uses the remaining indices for validation
val_idx = idx[80:]

# Generates train and validation sets
x_train, y_train = x[train_idx], y[train_idx]
x_val, y_val = x[val_idx], y[val_idx]

In [2]:
len(x_train)
len(y_train)

len(x_val)

20

In [3]:
x_train.shape

(80, 1)

## Gradient Descent

## Linear Regression in Numpy

In [4]:
import numpy as np
# Initializes parameters "a" and "b" randomly
np.random.seed(42)
a = np.random.randn(1)
b = np.random.randn(1)

print(a, b)

# Sets learning rate
lr = 1e-1
# Defines number of epochs
n_epochs = 1000

for epoch in range(n_epochs):
    # Computes our model's predicted output
    yhat = a + b * x_train
    
    # How wrong is our model? That's the error! 
    error = (y_train - yhat)
    # It is a regression, so it computes mean squared error (MSE)
    loss = (error ** 2).mean()
    
    # Computes gradients for both "a" and "b" parameters
    a_grad = -2 * error.mean()
    b_grad = -2 * (x_train * error).mean()
    
    # Updates parameters using gradients and the learning rate
    a = a - lr * a_grad
    b = b - lr * b_grad
    
    print(a, b)
    
print(a, b)

[0.49671415] [-0.1382643]
[0.80119529] [0.04511107]
[1.02745273] [0.1880898]
[1.19494837] [0.30062446]
[1.31831128] [0.39019766]
[1.40853768] [0.46243529]
[1.47389293] [0.52156753]
[1.52058962] [0.57077536]
[1.55329724] [0.61245104]
[1.57552532] [0.64839396]
[1.58991148] [0.67995779]
[1.59843788] [0.70816115]
[1.60259402] [0.7337708]
[1.60349903] [0.75736412]
[1.60199366] [0.77937615]
[1.59870941] [0.8001349]
[1.59412047] [0.81988791]
[1.58858283] [0.83882224]
[1.58236358] [0.85707942]
[1.57566302] [0.8747668]
[1.56863126] [0.89196598]
[1.56138068] [0.90873921]
[1.55399527] [0.92513416]
[1.54653776] [0.94118756]
[1.53905484] [0.95692788]
[1.53158117] [0.97237736]
[1.52414239] [0.98755358]
[1.51675733] [1.00247055]
[1.50943976] [1.01713964]
[1.50219959] [1.03157019]
[1.49504388] [1.04577]
[1.48797756] [1.05974573]
[1.4810039] [1.07350312]
[1.47412502] [1.08704728]
[1.4673421] [1.10038276]
[1.46065567] [1.11351372]
[1.45406576] [1.12644402]
[1.44757202] [1.13917725]
[1.44117385] [1.15171

[1.02530875] [1.96550502]
[1.02528204] [1.96555728]
[1.02525574] [1.96560875]
[1.02522983] [1.96565945]
[1.02520431] [1.96570938]
[1.02517918] [1.96575855]
[1.02515443] [1.96580698]
[1.02513005] [1.96585468]
[1.02510604] [1.96590166]
[1.02508239] [1.96594793]
[1.0250591] [1.9659935]
[1.02503617] [1.96603839]
[1.02501357] [1.96608259]
[1.02499132] [1.96612613]
[1.02496941] [1.96616901]
[1.02494783] [1.96621124]
[1.02492657] [1.96625283]
[1.02490564] [1.9662938]
[1.02488502] [1.96633414]
[1.02486471] [1.96637388]
[1.02484471] [1.96641302]
[1.02482501] [1.96645156]
[1.02480561] [1.96648952]
[1.0247865] [1.96652691]
[1.02476768] [1.96656374]
[1.02474914] [1.9666]
[1.02473089] [1.96663572]
[1.02471291] [1.96667091]
[1.0246952] [1.96670555]
[1.02467776] [1.96673968]
[1.02466058] [1.96677329]
[1.02464367] [1.96680639]
[1.02462701] [1.96683899]
[1.0246106] [1.9668711]
[1.02459443] [1.96690273]
[1.02457852] [1.96693388]
[1.02456284] [1.96696455]
[1.0245474] [1.96699476]
[1.02453219] [1.96702452

[1.0235517] [1.96894304]
[1.02355154] [1.96894337]
[1.02355138] [1.96894369]
[1.02355122] [1.968944]
[1.02355106] [1.96894431]
[1.0235509] [1.96894461]
[1.02355075] [1.96894491]
[1.0235506] [1.96894521]
[1.02355045] [1.9689455]
[1.0235503] [1.96894579]
[1.02355016] [1.96894607]
[1.02355002] [1.96894635]
[1.02354988] [1.96894662]
[1.02354974] [1.96894689]
[1.0235496] [1.96894716]
[1.02354947] [1.96894742]
[1.02354934] [1.96894768]
[1.02354921] [1.96894793]
[1.02354908] [1.96894818]
[1.02354895] [1.96894843]
[1.02354883] [1.96894867]
[1.02354871] [1.96894891]
[1.02354859] [1.96894914]
[1.02354847] [1.96894937]
[1.02354835] [1.9689496]
[1.02354824] [1.96894983]
[1.02354813] [1.96895005]
[1.02354801] [1.96895027]
[1.0235479] [1.96895048]
[1.0235478] [1.96895069]
[1.02354769] [1.9689509]
[1.02354759] [1.9689511]
[1.02354748] [1.96895131]
[1.02354738] [1.96895151]
[1.02354728] [1.9689517]
[1.02354718] [1.96895189]
[1.02354708] [1.96895208]
[1.02354699] [1.96895227]
[1.0235469] [1.96895246]
[

Lưu ý : Đạo hàm ngược là quá trình cập nhật trọng số (đạo hàm riêng lẻ) để thay đổi sao cho khi forward thì sẽ cho ra kết quả gần giống với kết quả sau cùng

In [5]:
# Sanity Check: do we get the same results as our gradient descent?
from sklearn.linear_model import LinearRegression
linr = LinearRegression()
linr.fit(x_train, y_train)
print(linr.intercept_, linr.coef_[0])

[1.02354075] [1.96896447]


## PyTorch

In [7]:
import torch
import torch.optim as optim
import torch.nn as nn
from torchviz import make_dot

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Our data was in Numpy arrays, but we need to transform them into PyTorch's Tensors
# and then we send them to the chosen device
x_train_tensor = torch.from_numpy(x_train).float().to(device)
y_train_tensor = torch.from_numpy(y_train).float().to(device)

# Here we can see the difference - notice that .type() is more useful
# since it also tells us WHERE the tensor is (device)
print(type(x_train), type(x_train_tensor), x_train_tensor.type())

<class 'numpy.ndarray'> <class 'torch.Tensor'> torch.FloatTensor


## Creating Parameters

In [8]:
# FIRST
# Initializes parameters "a" and "b" randomly, ALMOST as we did in Numpy
# since we want to apply gradient descent on these parameters, we need
# to set REQUIRES_GRAD = TRUE
a = torch.randn(1, requires_grad=True, dtype=torch.float)
b = torch.randn(1, requires_grad=True, dtype=torch.float)
print(a, b)

# SECOND
# But what if we want to run it on a GPU? We could just send them to device, right?
a = torch.randn(1, requires_grad=True, dtype=torch.float).to(device)
b = torch.randn(1, requires_grad=True, dtype=torch.float).to(device)
print(a, b)
# Sorry, but NO! The to(device) "shadows" the gradient...

# THIRD
# We can either create regular tensors and send them to the device (as we did with our data)
a = torch.randn(1, dtype=torch.float).to(device)
b = torch.randn(1, dtype=torch.float).to(device)
# and THEN set them as requiring gradients...
a.requires_grad_()
b.requires_grad_()
print(a, b)

tensor([0.2706], requires_grad=True) tensor([-0.4380], requires_grad=True)
tensor([0.3330], requires_grad=True) tensor([0.3203], requires_grad=True)
tensor([-1.2678], requires_grad=True) tensor([0.1246], requires_grad=True)


In [13]:
# We can specify the device at the moment of creation - RECOMMENDED!
torch.manual_seed(42)
a = torch.randn((2,2), requires_grad=True, dtype=torch.float, device=device)
b = torch.randn([2,2], requires_grad=True, dtype=torch.float, device=device)
print(a, b)

tensor([[0.3367, 0.1288],
        [0.2345, 0.2303]], requires_grad=True) tensor([[-1.1229, -0.1863],
        [ 2.2082, -0.6380]], requires_grad=True)


## Autograd

In [14]:
lr = 1e-1
n_epochs = 1000

torch.manual_seed(42)
a = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)

for epoch in range(n_epochs):
    yhat = a + b * x_train_tensor
    error = y_train_tensor - yhat
    loss = (error ** 2).mean()

    # No more manual computation of gradients! 
    # a_grad = -2 * error.mean()
    # b_grad = -2 * (x_tensor * error).mean()
    
    # We just tell PyTorch to work its way BACKWARDS from the specified loss!
    loss.backward()
    # Let's check the computed gradients...
    print(a.grad)
    print(b.grad)
    
    # What about UPDATING the parameters? Not so fast...
    
    # FIRST ATTEMPT
    # AttributeError: 'NoneType' object has no attribute 'zero_'
    # a = a - lr * a.grad
    # b = b - lr * b.grad
    # print(a)

    # SECOND ATTEMPT
    # RuntimeError: a leaf Variable that requires grad has been used in an in-place operation.
    # a -= lr * a.grad
    # b -= lr * b.grad        
    
    # THIRD ATTEMPT
    # We need to use NO_GRAD to keep the update out of the gradient computation
    # Why is that? It boils down to the DYNAMIC GRAPH that PyTorch uses...
    with torch.no_grad():
        a -= lr * a.grad
        b -= lr * b.grad
    
    # PyTorch is "clingy" to its computed gradients, we need to tell it to let it go...
    a.grad.zero_()
    b.grad.zero_()
    
print(a, b)

tensor([-3.1125])
tensor([-1.8156])
tensor([-2.3184])
tensor([-1.4064])
tensor([-1.7219])
tensor([-1.0982])
tensor([-1.2737])
tensor([-0.8659])
tensor([-0.9372])
tensor([-0.6906])
tensor([-0.6845])
tensor([-0.5583])
tensor([-0.4948])
tensor([-0.4582])
tensor([-0.3526])
tensor([-0.3824])
tensor([-0.2459])
tensor([-0.3248])
tensor([-0.1660])
tensor([-0.2810])
tensor([-0.1063])
tensor([-0.2475])
tensor([-0.0616])
tensor([-0.2218])
tensor([-0.0283])
tensor([-0.2019])
tensor([-0.0036])
tensor([-0.1864])
tensor([0.0147])
tensor([-0.1743])
tensor([0.0283])
tensor([-0.1646])
tensor([0.0382])
tensor([-0.1568])
tensor([0.0453])
tensor([-0.1505])
tensor([0.0505])
tensor([-0.1452])
tensor([0.0541])
tensor([-0.1408])
tensor([0.0566])
tensor([-0.1370])
tensor([0.0582])
tensor([-0.1337])
tensor([0.0592])
tensor([-0.1307])
tensor([0.0597])
tensor([-0.1280])
tensor([0.0599])
tensor([-0.1255])
tensor([0.0598])
tensor([-0.1232])
tensor([0.0594])
tensor([-0.1211])
tensor([0.0590])
tensor([-0.1190])
tensor

tensor([-0.0025])
tensor([0.0013])
tensor([-0.0025])
tensor([0.0012])
tensor([-0.0024])
tensor([0.0012])
tensor([-0.0024])
tensor([0.0012])
tensor([-0.0024])
tensor([0.0012])
tensor([-0.0023])
tensor([0.0012])
tensor([-0.0023])
tensor([0.0012])
tensor([-0.0023])
tensor([0.0011])
tensor([-0.0022])
tensor([0.0011])
tensor([-0.0022])
tensor([0.0011])
tensor([-0.0022])
tensor([0.0011])
tensor([-0.0021])
tensor([0.0011])
tensor([-0.0021])
tensor([0.0011])
tensor([-0.0021])
tensor([0.0010])
tensor([-0.0020])
tensor([0.0010])
tensor([-0.0020])
tensor([0.0010])
tensor([-0.0020])
tensor([0.0010])
tensor([-0.0019])
tensor([0.0010])
tensor([-0.0019])
tensor([0.0010])
tensor([-0.0019])
tensor([0.0009])
tensor([-0.0019])
tensor([0.0009])
tensor([-0.0018])
tensor([0.0009])
tensor([-0.0018])
tensor([0.0009])
tensor([-0.0018])
tensor([0.0009])
tensor([-0.0017])
tensor([0.0009])
tensor([-0.0017])
tensor([0.0009])
tensor([-0.0017])
tensor([0.0009])
tensor([-0.0017])
tensor([0.0008])
tensor([-0.0016])
te

tensor([3.8651e-05])
tensor([-7.5801e-05])
tensor([3.8224e-05])
tensor([-7.4577e-05])
tensor([3.7707e-05])
tensor([-7.3411e-05])
tensor([3.7103e-05])
tensor([-7.2301e-05])
tensor([3.6565e-05])
tensor([-7.1188e-05])
tensor([3.5924e-05])
tensor([-7.0153e-05])
tensor([3.5397e-05])
tensor([-6.9097e-05])
tensor([3.4793e-05])
tensor([-6.8070e-05])
tensor([3.4280e-05])
tensor([-6.7045e-05])
tensor([3.3689e-05])
tensor([-6.6071e-05])
tensor([3.3202e-05])
tensor([-6.5076e-05])
tensor([3.2708e-05])
tensor([-6.4070e-05])
tensor([3.2351e-05])
tensor([-6.3039e-05])
tensor([3.1894e-05])
tensor([-6.2071e-05])
tensor([3.1338e-05])
tensor([-6.1164e-05])
tensor([3.0884e-05])
tensor([-6.0245e-05])
tensor([3.0420e-05])
tensor([-5.9323e-05])
tensor([2.9829e-05])
tensor([-5.8492e-05])
tensor([2.9391e-05])
tensor([-5.7597e-05])
tensor([2.8837e-05])
tensor([-5.6787e-05])
tensor([2.8545e-05])
tensor([-5.5844e-05])
tensor([2.8068e-05])
tensor([-5.5038e-05])
tensor([2.7576e-05])
tensor([-5.4227e-05])
tensor([2.7

tensor([1.6052e-06])
tensor([-2.8973e-06])
tensor([1.5740e-06])
tensor([-2.8721e-06])
tensor([1.6280e-06])
tensor([-2.7748e-06])
tensor([1.6018e-06])
tensor([-2.7442e-06])
tensor([1.5683e-06])
tensor([-2.7200e-06])
tensor([1.5324e-06])
tensor([-2.7094e-06])
tensor([1.5143e-06])
tensor([-2.6750e-06])
tensor([1.5338e-06])
tensor([-2.6159e-06])
tensor([1.5271e-06])
tensor([-2.5660e-06])
tensor([1.5131e-06])
tensor([-2.5269e-06])
tensor([1.4782e-06])
tensor([-2.5077e-06])
tensor([1.5043e-06])
tensor([-2.4402e-06])
tensor([1.4775e-06])
tensor([-2.4112e-06])
tensor([1.4729e-06])
tensor([-2.3637e-06])
tensor([1.4377e-06])
tensor([-2.3390e-06])
tensor([1.4158e-06])
tensor([-2.3076e-06])
tensor([1.4075e-06])
tensor([-2.2663e-06])
tensor([1.3832e-06])
tensor([-2.2354e-06])
tensor([1.3728e-06])
tensor([-2.1949e-06])
tensor([1.3587e-06])
tensor([-2.1585e-06])
tensor([1.3668e-06])
tensor([-2.1037e-06])
tensor([1.3468e-06])
tensor([-2.0684e-06])
tensor([1.3252e-06])
tensor([-2.0346e-06])
tensor([1.2

tensor([-5.7230e-07])
tensor([5.4640e-07])
tensor([-5.7230e-07])
tensor([5.4640e-07])
tensor([-5.7230e-07])
tensor([5.4640e-07])
tensor([-5.7230e-07])
tensor([5.4640e-07])
tensor([-5.7230e-07])
tensor([5.4640e-07])
tensor([-5.7230e-07])
tensor([5.4640e-07])
tensor([-5.7230e-07])
tensor([5.4640e-07])
tensor([-5.7230e-07])
tensor([5.4640e-07])
tensor([-5.7230e-07])
tensor([5.4640e-07])
tensor([-5.7230e-07])
tensor([5.4640e-07])
tensor([-5.7230e-07])
tensor([5.4640e-07])
tensor([-5.7230e-07])
tensor([5.4640e-07])
tensor([-5.7230e-07])
tensor([5.4640e-07])
tensor([-5.7230e-07])
tensor([5.4640e-07])
tensor([-5.7230e-07])
tensor([5.4640e-07])
tensor([-5.7230e-07])
tensor([5.4640e-07])
tensor([-5.7230e-07])
tensor([5.4640e-07])
tensor([-5.7230e-07])
tensor([5.4640e-07])
tensor([-5.7230e-07])
tensor([5.4640e-07])
tensor([-5.7230e-07])
tensor([5.4640e-07])
tensor([-5.7230e-07])
tensor([1.0235], requires_grad=True) tensor([1.9690], requires_grad=True)


## Dynamic Computation Graph

In [15]:
torch.manual_seed(42)
a = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)

yhat = a + b * x_train_tensor
error = y_train_tensor - yhat
loss = (error ** 2).mean()
loss

tensor(2.7475, grad_fn=<MeanBackward0>)

## Optimizer

In [19]:
torch.manual_seed(42)
a = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
print(a, b)

lr = 1e-1
n_epochs = 1000

# Defines a SGD optimizer to update the parameters
optimizer = optim.SGD([a, b], lr=lr)

for epoch in range(n_epochs):
    yhat = a + b * x_train_tensor
    error = y_train_tensor - yhat
    loss = (error ** 2).mean()

    loss.backward()    
    
    # No more manual update!
    # with torch.no_grad():
    #     a -= lr * a.grad
    #     b -= lr * b.grad
    optimizer.step()
    
    # No more telling PyTorch to let gradients go!
    # a.grad.zero_()
    # b.grad.zero_()
    optimizer.zero_grad()
    
print(a, b)

tensor([0.3367], requires_grad=True) tensor([0.1288], requires_grad=True)
tensor([1.0235], requires_grad=True) tensor([1.9690], requires_grad=True)


## Loss

In [20]:
torch.manual_seed(42)
a = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
print(a, b)

lr = 1e-1
n_epochs = 1000

# Defines a MSE loss function
loss_fn = nn.MSELoss(reduction='mean')

optimizer = optim.SGD([a, b], lr=lr)

for epoch in range(n_epochs):
    yhat = a + b * x_train_tensor
    
    # No more manual loss!
    # error = y_tensor - yhat
    # loss = (error ** 2).mean()
    loss = loss_fn(y_train_tensor, yhat)

    loss.backward()    
    optimizer.step()
    optimizer.zero_grad()
    
print(a, b)

tensor([0.3367], requires_grad=True) tensor([0.1288], requires_grad=True)
tensor([1.0235], requires_grad=True) tensor([1.9690], requires_grad=True)


Dropout thì chỉ sử dụng cho quá trình training, không sử dụng cho quá trình evalutation

## Model

In [52]:
class ManualLinearRegression(nn.Module):
    def __init__(self):
        super().__init__()
        # To make "a" and "b" real parameters of the model, we need to wrap them with nn.Parameter
        self.a = nn.Parameter(torch.randn(1, requires_grad=True, dtype=torch.float))
        self.b = nn.Parameter(torch.randn(1, requires_grad=True, dtype=torch.float))
        
    def forward(self, x):
        # Computes the outputs / predictions
        return self.a + self.b * x

In [53]:
torch.manual_seed(42)

# Now we can create a model and send it at once to the device
model = ManualLinearRegression().to(device)
# We can also inspect its parameters using its state_dict
print(model.state_dict())

lr = 1e-1
n_epochs = 1000

loss_fn = nn.MSELoss(reduction='mean')
optimizer = optim.SGD(model.parameters(), lr=lr)

for epoch in range(n_epochs):
    # What is this?!?
    model.train()

    # No more manual prediction!
    # yhat = a + b * x_tensor
    yhat = model(x_train_tensor)
    
    loss = loss_fn(y_train_tensor, yhat)
    loss.backward()    
    optimizer.step()
    optimizer.zero_grad()
    
print(model.state_dict())

OrderedDict([('a', tensor([0.3367])), ('b', tensor([0.1288]))])
OrderedDict([('a', tensor([1.0235])), ('b', tensor([1.9690]))])


In [50]:
for epoch in range(n_epochs):
    # What is this?!?
    print("1 model.train()")
    model.train()

    # No more manual prediction!
    # yhat = a + b * x_tensor
    print("2 yhat = model(x_train_tensor)")
    yhat = model(x_train_tensor)
    
    print("2.4 loss.backward()")
    loss = loss_fn(y_train_tensor, yhat)
    print("3 loss.backward()")
    loss.backward()
    print("4 optimizer.step()")
    optimizer.step()
    print("5 optimizer.zero_grad()")
    optimizer.zero_grad()
    
print(model.state_dict())

1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
f

2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimize

5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat =

2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimize

4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 mod

forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 

3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer

5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat =

2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimizer.step()
5 optimizer.zero_grad()
1 model.train()
2 yhat = model(x_train_tensor)
forward
2.4 loss.backward()
3 loss.backward()
4 optimize

In [23]:
class LayerLinearRegression(nn.Module):
    def __init__(self):
        super().__init__()
        # Instead of our custom parameters, we use a Linear layer with single input and single output
        self.linear = nn.Linear(1, 1)
                
    def forward(self, x):
        # Now it only takes a call to the layer to make predictions
        return self.linear(x)

In [29]:
torch.manual_seed(42)

# Now we can create a model and send it at once to the device
model = LayerLinearRegression().to(device)
# We can also inspect its parameters using its state_dict
print(model.state_dict())

lr = 1e-1
n_epochs = 1000

loss_fn = nn.MSELoss(reduction='mean')
optimizer = optim.SGD(model.parameters(), lr=lr)

for epoch in range(n_epochs):
    # What is this?!?
    model.train()

    # No more manual prediction!
    # yhat = a + b * x_tensor
    yhat = model(x_train_tensor)
    
    loss = loss_fn(y_train_tensor, yhat)
    loss.backward()    
    optimizer.step()
    optimizer.zero_grad()
    
print(model.state_dict())

OrderedDict([('linear.weight', tensor([[0.7645]])), ('linear.bias', tensor([0.8300]))])
OrderedDict([('linear.weight', tensor([[1.9690]])), ('linear.bias', tensor([1.0235]))])


In [32]:
[*LayerLinearRegression().parameters()]

[Parameter containing:
 tensor([[-0.2191]], requires_grad=True),
 Parameter containing:
 tensor([0.2018], requires_grad=True)]

`model.train()` để báo cho pytorch biết đang ở chế độ training, không phải chế độ evaluation

## Training Step

In [34]:
def make_train_step(model, loss_fn, optimizer):
    # Builds function that performs a step in the train loop
    def train_step(x, y):
        # Sets model to TRAIN mode
        model.train()
        # Makes predictions
        yhat = model(x)
        # Computes loss
        loss = loss_fn(y, yhat)
        # Computes gradients
        loss.backward()
        # Updates parameters and zeroes gradients
        optimizer.step()
        optimizer.zero_grad()
        # Returns the loss
        return loss.item()
    
    # Returns the function that will be called inside the train loop
    return train_step

# Creates the train_step function for our model, loss function and optimizer
train_step = make_train_step(model, loss_fn, optimizer)
losses = []

# For each epoch...
for epoch in range(n_epochs):
    # Performs one train step and returns the corresponding loss
    loss = train_step(x_train_tensor, y_train_tensor)
    losses.append(loss)
    
# Checks model's parameters
print(model.state_dict())

OrderedDict([('linear.weight', tensor([[1.9690]])), ('linear.bias', tensor([1.0235]))])
OrderedDict([('linear.weight', tensor([[1.9690]])), ('linear.bias', tensor([1.0235]))])


## Dataset

In [39]:
from torch.utils.data import Dataset, TensorDataset

class CustomDataset(Dataset):
    def __init__(self, x_tensor, y_tensor):
        self.x = x_tensor
        self.y = y_tensor
        
    def __getitem__(self, index):
        return (self.x[index], self.y[index])

    def __len__(self):
        return len(self.x)

# Wait, is this a CPU tensor now? Why? Where is .to(device)?
x_train_tensor = torch.from_numpy(x_train).float()
y_train_tensor = torch.from_numpy(y_train).float()

train_data = CustomDataset(x_train_tensor, y_train_tensor)
print(train_data[0])

train_data = TensorDataset(x_train_tensor, y_train_tensor)
print(train_data[0])

(tensor([0.7713]), tensor([2.4745]))
(tensor([0.7713]), tensor([2.4745]))


## DataLoader

In [41]:
from torch.utils.data import DataLoader

train_loader = DataLoader(dataset=train_data, batch_size=16, shuffle=True)

In [42]:
losses = []
train_step = make_train_step(model, loss_fn, optimizer)

for epoch in range(n_epochs):
    for x_batch, y_batch in train_loader:
        # the dataset "lives" in the CPU, so do our mini-batches
        # therefore, we need to send those mini-batches to the
        # device where the model "lives"
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)
        
        loss = train_step(x_batch, y_batch)
        losses.append(loss)
        
print(model.state_dict())

OrderedDict([('linear.weight', tensor([[1.9698]])), ('linear.bias', tensor([1.0255]))])


##### Random Split

In [43]:
from torch.utils.data.dataset import random_split

x_tensor = torch.from_numpy(x).float()
y_tensor = torch.from_numpy(y).float()

dataset = TensorDataset(x_tensor, y_tensor)

train_dataset, val_dataset = random_split(dataset, [80, 20])

train_loader = DataLoader(dataset=train_dataset, batch_size=16)
val_loader = DataLoader(dataset=val_dataset, batch_size=20)

## Evaluation

`torch.no_grad()` : để nói cho pytorch biết chỉ có feedward chứ không đạo hàm ngược lại

`eval()` : cho biết đang ở chế độ evaluation (không phải `.train()`), nên không có quá tính toán đạo hàm ngược lại

In [44]:
losses = []
val_losses = []
train_step = make_train_step(model, loss_fn, optimizer)

for epoch in range(n_epochs):
    for x_batch, y_batch in train_loader:
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)

        loss = train_step(x_batch, y_batch)
        losses.append(loss)
        
    with torch.no_grad():
        for x_val, y_val in val_loader:
            x_val = x_val.to(device)
            y_val = y_val.to(device)
            
            model.eval()

            yhat = model(x_val)
            val_loss = loss_fn(y_val, yhat)
            val_losses.append(val_loss.item())

print(model.state_dict())

OrderedDict([('linear.weight', tensor([[1.9559]])), ('linear.bias', tensor([1.0289]))])


In [57]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchsummary import summary

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x, y):
        y = F.relu(F.max_pool2d(self.conv1(y), 2))
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # PyTorch v0.4.0
model = Net().to(device)

In [62]:
summary(model, input_size=((1, 28, 28), (1, 28, 28)))

TypeError: rand(): argument 'size' must be tuple of ints, but found element of type tuple at pos 2

In [None]:
import torch
import torch.nn as nn
from torch.autograd import Variable

from collections import OrderedDict
import numpy as np


def summary(model,  m,input_size, batch_size=-1, device=torch.device('cuda:0'), dtypes=None):
    result, params_info = summary_string(
        model, input_size, batch_size, device, dtypes)
    print(result)

    return params_info


def summary_string(model, input_size, batch_size=-1, device=torch.device('cuda:0'), dtypes=None):
    if dtypes == None:
        dtypes = [torch.FloatTensor]*len(input_size)

    summary_str = ''

    def register_hook(module):
        def hook(module, input, output):
            class_name = str(module.__class__).split(".")[-1].split("'")[0]
            module_idx = len(summary)

            m_key = "%s-%i" % (class_name, module_idx + 1)
            summary[m_key] = OrderedDict()
            summary[m_key]["input_shape"] = list(input[0].size())
            summary[m_key]["input_shape"][0] = batch_size
            if isinstance(output, (list, tuple)):
                summary[m_key]["output_shape"] = [
                    [-1] + list(o.size())[1:] for o in output
                ]
            else:
                summary[m_key]["output_shape"] = list(output.size())
                summary[m_key]["output_shape"][0] = batch_size

            params = 0
            if hasattr(module, "weight") and hasattr(module.weight, "size"):
                params += torch.prod(torch.LongTensor(list(module.weight.size())))
                summary[m_key]["trainable"] = module.weight.requires_grad
            if hasattr(module, "bias") and hasattr(module.bias, "size"):
                params += torch.prod(torch.LongTensor(list(module.bias.size())))
            summary[m_key]["nb_params"] = params

        if (
            not isinstance(module, nn.Sequential)
            and not isinstance(module, nn.ModuleList)
        ):
            hooks.append(module.register_forward_hook(hook))

    # multiple inputs to the network
    if isinstance(input_size, tuple):
        input_size = [input_size]

    # batch_size of 2 for batchnorm
    x = [torch.rand(2, *in_size).type(dtype).to(device=device)
         for in_size, dtype in zip(input_size, dtypes)]

    # create properties
    summary = OrderedDict()
    hooks = []

    # register hook
    model.apply(register_hook)

    # make a forward pass
    # print(x.shape)
    model(*x)

    # remove these hooks
    for h in hooks:
        h.remove()

    summary_str += "----------------------------------------------------------------" + "\n"
    line_new = "{:>20}  {:>25} {:>15}".format(
        "Layer (type)", "Output Shape", "Param #")
    summary_str += line_new + "\n"
    summary_str += "================================================================" + "\n"
    total_params = 0
    total_output = 0
    trainable_params = 0
    for layer in summary:
        # input_shape, output_shape, trainable, nb_params
        line_new = "{:>20}  {:>25} {:>15}".format(
            layer,
            str(summary[layer]["output_shape"]),
            "{0:,}".format(summary[layer]["nb_params"]),
        )
        total_params += summary[layer]["nb_params"]

        total_output += np.prod(summary[layer]["output_shape"])
        if "trainable" in summary[layer]:
            if summary[layer]["trainable"] == True:
                trainable_params += summary[layer]["nb_params"]
        summary_str += line_new + "\n"

    # assume 4 bytes/number (float on cuda).
    total_input_size = abs(np.prod(sum(input_size, ()))
                           * batch_size * 4. / (1024 ** 2.))
    total_output_size = abs(2. * total_output * 4. /
                            (1024 ** 2.))  # x2 for gradients
    total_params_size = abs(total_params * 4. / (1024 ** 2.))
    total_size = total_params_size + total_output_size + total_input_size

    summary_str += "================================================================" + "\n"
    summary_str += "Total params: {0:,}".format(total_params) + "\n"
    summary_str += "Trainable params: {0:,}".format(trainable_params) + "\n"
    summary_str += "Non-trainable params: {0:,}".format(total_params -
                                                        trainable_params) + "\n"
    summary_str += "----------------------------------------------------------------" + "\n"
    summary_str += "Input size (MB): %0.2f" % total_input_size + "\n"
    summary_str += "Forward/backward pass size (MB): %0.2f" % total_output_size + "\n"
    summary_str += "Params size (MB): %0.2f" % total_params_size + "\n"
    summary_str += "Estimated Total Size (MB): %0.2f" % total_size + "\n"
    summary_str += "----------------------------------------------------------------" + "\n"
    # return summary
    return summary_str, (total_params, trainable_params)

## Kiến thức bổ sung về pytorch

#### Tìm hiểu về cấu trúc sparse trong pytorch
https://towardsdatascience.com/sparse-matrices-in-pytorch-be8ecaccae6

Cấu trúc sparse là cấu trúc mà thay vì lưu toàn bộ ma trận thì ta lưu chỉ số và giá trị của tất cả các phần tử khác 0

=> Cấu trúc này thích hợp để lưu một mảng mà có nhiều phần tử bằng 0, giống như one_hot_encoding (Biểu diễn thành vector mà phần tử ở vị trí đó bằng 1, còn lại bằng 0)


#### ctx là gì :

ctx trong pytorch giống như biến self trong một class, dùng để chia sẻ thông tin lúc backward()  - đạo hàm ngược