A very good tutorial to build a CNN and gives a good overview of pyTorch (should revisit everytime): https://lelon.io/blog/pytorch-baby-steps#one <br>

Most common neural net mistakes: (by a legend) <br> 1) You didn't try to overfit a single batch first. <br>
2) You forgot to toggle train/eval mode for the net. <br>
3) You forgot to .zero_grad() (in pytorch) before .backward(). <br>
4) You passed softmaxed outputs to a loss that expects raw logits. <br>
5) You didn't use bias=False for your Linear/Conv2d layer when using BatchNorm, or conversely forget to include it for the output layer. (This one won't make you silently fail, but they are spurious parameters.) <br>
6) thinking view() and permute() are the same thing (& incorrectly using view)

In this notebook we try to fit a third order polynomial with y = sin(x) using <br>
1) numpy <br>
2) torch <br>

Learning: <br>
Backward call computes the gradients of the variable on which this was called w.r.t all the tensors associated with it and stores it in the respective variable.grad

Full tutorial:
https://pytorch.org/tutorials/beginner/pytorch_with_examples.html

In [3]:
import numpy as np
import math

In [4]:
# linspace is linearly spaced
x = np.linspace(-math.pi, math.pi, 2000)
y = np.sin(x)

In [6]:
y

array([-1.22464680e-16, -3.14315906e-03, -6.28628707e-03, ...,
        6.28628707e-03,  3.14315906e-03,  1.22464680e-16])

We will have a grad variable corresponding to each of the parameters, the gradients are so intuitive and so easy, they are governed by chain rule of differential equation (see below) <br>

dE/dy (signifies rate of change of error w.r.t y: How does Error change w.r.t y) <br>
dE/da = dE/dy * dy/da (We have equations for E and y, we can substitute them and get the "How does Error change w.r.t a". We now use this to update a. We take a very little step and this is governed by learning rate "lr")

In [8]:
lr = 1e-6

a = np.random.randn()
b = np.random.randn()
c = np.random.randn()
d = np.random.randn()

# Derivation for the gradients in NLP notebook
for t in range(10000):
    y_pred = a + b*x + c* x**2 + d* x**3
    
    loss = np.square(y_pred - y).sum()
    
    y_grad = 2*(y_pred-y)
    a_grad = (y_grad).sum()
    b_grad = (y_grad*x).sum()
    c_grad = (y_grad* x**2).sum()
    d_grad = (y_grad* x**3).sum()
    if t % 100 == 0:
        print(t, loss)
#         print("y_grad", y_grad)
#         print("a_grad", a_grad)
#         print("b_grad", b_grad)
#         print("c_grad", c_grad)
#         print("d_grad", d_grad)
#         print("-"*15)
    
    a -= lr * a_grad
    b -= lr * b_grad
    c -= lr * c_grad
    d -= lr * d_grad

print(f'Result: y = {a} + {b} x + {c} x^2 + {d} x^3')

0 8895.945760091989
100 1900.6849659161203
200 1259.8572287977734
300 836.0975691162881
400 555.8776441417838
500 370.5759669605484
600 248.04081606146443
700 167.0113267075772
800 113.42837157674413
900 77.99508400398676
1000 54.563721107463195
1100 39.06895383439867
1200 28.822488630847474
1300 22.046624461899018
1400 17.565809229812544
1500 14.602675692063205
1600 12.643166141598478
1700 11.34734322638857
1800 10.490411572644012
1900 9.923716898079604
2000 9.54895566153817
2100 9.301120477235347
2200 9.137222263512351
2300 9.028832405771642
2400 8.957151008721222
2500 8.90974559932911
2600 8.87839446772133
2700 8.857660486965727
2800 8.843947988205992
2900 8.834879074658561
3000 8.828881177645595
3100 8.824914305677705
3200 8.822290672108778
3300 8.82055541274795
3400 8.819407702477369
3500 8.818648588026
3600 8.818146488437835
3700 8.817814379460085
3800 8.817594704741946
3900 8.81744939696562
4000 8.817353278329477
4100 8.817289695917484
4200 8.817247635091892
4300 8.8172198103738

A PyTorch Tensor is conceptually identical to a numpy array. A Tensor is an n-dimensional array, and PyTorch provides many functions for operating on these Tensors. Behind the scenes, Tensors can keep track of a computational graph and gradients, but they’re also useful as a generic tool for scientific computing.

In [2]:
# Using torch

import torch
import math


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# Create random input and output data
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Randomly initialize weights
a = torch.randn((), device=device, dtype=dtype)
b = torch.randn((), device=device, dtype=dtype)
c = torch.randn((), device=device, dtype=dtype)
d = torch.randn((), device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(2000):
    # Forward pass: compute predicted y
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 100 == 0:
        print(t, loss)

    # Backprop to compute gradients of a, b, c, d with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_a = grad_y_pred.sum()
    grad_b = (grad_y_pred * x).sum()
    grad_c = (grad_y_pred * x ** 2).sum()
    grad_d = (grad_y_pred * x ** 3).sum()

    # Update weights using gradient descent
    a -= learning_rate * grad_a
    b -= learning_rate * grad_b

    c -= learning_rate * grad_c
    d -= learning_rate * grad_d


print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')

0 85201.8984375
100 3452.5068359375
200 2305.0283203125
300 1540.684814453125
400 1031.31640625
500 691.7040405273438
600 465.1593322753906
700 313.9584045410156
800 212.9871063232422
900 145.5195770263672
1000 100.41130065917969
1100 70.23267364501953
1200 50.02892303466797
1300 36.493526458740234
1400 27.41893768310547
1500 21.330392837524414
1600 17.242149353027344
1700 14.49479866027832
1800 12.646946907043457
1900 11.402992248535156
Result: y = -0.025448594242334366 + 0.8233986496925354 x + 0.004390306305140257 x^2 + -0.08858775347471237 x^3


This Autograd feature of pyTorch is intriguing! When using autograd, the forward pass of your network will define a computational graph; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors. All you have to do is <br>

1) Define variables (that are to be used in the computation of loss) with "requires_grad=True". <br>
2) Write the loss equation. <br>
3) Call loss.backward(). This call will compute the gradient of loss with respect to all Tensors with requires_grad=True. (After this call a.grad, b.grad. c.grad and d.grad will be Tensors holding the gradient of the loss with respect to a, b, c, d respectively.) <br>

The wrapper with torch.no_grad() temporarily sets all of the "requires_grad" flags to False. Visit this wonderful tutorial on Autograd: https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#gradients

In [10]:
# Using Autograd....

import torch
import math

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0")  # Uncomment this to run on GPU

# Create Tensors to hold input and outputs.
# By default, requires_grad=False, which indicates that we do not need to
# compute gradients with respect to these Tensors during the backward pass.
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Create random Tensors for weights. For a third order polynomial, we need
# 4 weights: y = a + b x + c x^2 + d x^3
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
a = torch.randn((), device=device, dtype=dtype, requires_grad=True)
b = torch.randn((), device=device, dtype=dtype, requires_grad=True)
c = torch.randn((), device=device, dtype=dtype, requires_grad=True)
d = torch.randn((), device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(2000):
    # Forward pass: compute predicted y using operations on Tensors.
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call a.grad, b.grad. c.grad and d.grad will be Tensors holding
    # the gradient of the loss with respect to a, b, c, d respectively.
    loss.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    with torch.no_grad():
        a -= learning_rate * a.grad
        b -= learning_rate * b.grad
        c -= learning_rate * c.grad
        d -= learning_rate * d.grad

        # Manually zero the gradients after updating weights
        a.grad = None
        b.grad = None
        c.grad = None
        d.grad = None

print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')

99 259.9820251464844
199 186.5506134033203
299 134.59092712402344
399 97.82147979736328
499 71.80137634277344
599 53.38817596435547
699 40.35801696777344
799 31.137165069580078
899 24.612014770507812
999 19.99445152282715
1099 16.726818084716797
1199 14.414458274841309
1299 12.778117179870605
1399 11.620148658752441
1499 10.800712585449219
1599 10.220829963684082
1699 9.810477256774902
1799 9.520086288452148
1899 9.314591407775879
1999 9.169170379638672
Result: y = 0.01985906809568405 + 0.8566898703575134 x + -0.003426017239689827 x^2 + -0.09332313388586044 x^3


Below is an example code to define a custom autograd function. Under the hood, each primitive autograd operator is really two functions that operate on Tensors. <br> 

1) The forward function computes output Tensors from input Tensors. <br> 
2) The backward function receives the gradient of the output Tensors with respect to some scalar value, and computes the gradient of the input Tensors with respect to that same scalar value.

In [15]:
import torch
import math

class LegrangePoly(torch.autograd.Function):

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return 0.5 * (5 * input ** 3 - 3 * input)
    
    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        return grad_output * 1.5 * (5 * input ** 2 - 1)
    
dtype = torch.float
device = torch.device("cpu")

x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Create random Tensors for weights. For this example, we need
# 4 weights: y = a + b * P3(c + d * x), these weights need to be initialized
# not too far from the correct result to ensure convergence.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
a = torch.full((), 0.0, device=device, dtype=dtype, requires_grad=True)
b = torch.full((), -1.5, device=device, dtype=dtype, requires_grad=True)
c = torch.full((), 0.0, device=device, dtype=dtype, requires_grad=True)
d = torch.full((), 0.3, device=device, dtype=dtype, requires_grad=True)

lr = 5e-6

for t in range(5000):
    # To apply our Function, we use Function.apply method. We alias this as 'P3'.
    P3 = LegrangePoly.apply
    
    # Forward pass: compute predicted y using operations; we compute
    # P3 using our custom autograd operation.
    y_pred = a + b * P3(c + d * x)
    
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 0:
        print(t, loss.item())
    
    loss.backward()
    
    with torch.no_grad():
        a -= lr * a.grad
        b -= lr * b.grad
        c -= lr * c.grad
        d -= lr * d.grad
        
        a.grad = None
        b.grad = None
        c.grad = None
        d.grad = None

print(f'Result: y = {a.item()} + {b.item()} * P3({c.item()} + {d.item()} x)')

0 356.0281677246094
100 79.79436492919922
200 56.90309143066406
300 41.41451644897461
400 30.92487335205078
500 23.815921783447266
600 18.995601654052734
700 15.725776672363281
800 13.507123947143555
900 12.001437187194824
1000 10.979339599609375
1100 10.285469055175781
1200 9.814346313476562
1300 9.494412422180176
1400 9.277144432067871
1500 9.129594802856445
1600 9.029376983642578
1700 8.961315155029297
1800 8.915084838867188
1900 8.883682250976562
2000 8.862346649169922
2100 8.847857475280762
2200 8.83801555633545
2300 8.831326484680176
2400 8.826786041259766
2500 8.823702812194824
2600 8.821605682373047
2700 8.820181846618652
2800 8.819214820861816
2900 8.818557739257812
3000 8.81811237335205
3100 8.81781005859375
3200 8.817604064941406
3300 8.817463874816895
3400 8.817367553710938
3500 8.817303657531738
3600 8.8172607421875
3700 8.817232131958008
3800 8.81721019744873
3900 8.817197799682617
4000 8.81718635559082
4100 8.817181587219238
4200 8.817177772521973
4300 8.817174911499023


In [22]:
torch.tensor([1, 2, 3]).unsqueeze(-1)

tensor([[1],
        [2],
        [3]])

In [23]:
x = torch.tensor([1, 2, 3]).unsqueeze(-1)

p = torch.tensor([1, 2, 3])
xx = x.unsqueeze(-1).pow(2)
xx

tensor([[[1]],

        [[4]],

        [[9]]])

In [21]:
xx

tensor([[[ 1,  1,  1]],

        [[ 2,  4,  8]],

        [[ 3,  9, 27]]])

The nn package defines a set of Modules, which are roughly equivalent to neural network layers. A Module receives input Tensors and computes output Tensors, but may also hold internal state such as Tensors containing learnable parameters. The nn package also defines a set of useful loss functions that are commonly used when training neural networks.

In [25]:
import torch
import math


# Create Tensors to hold input and outputs.
x = torch.linspace(-math.pi, math.pi, 2000)
y = torch.sin(x)


p = torch.tensor([1, 2, 3])
xx = x.unsqueeze(-1).pow(p)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. The Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
# The Flatten layer flatens the output of the linear layer to a 1D tensor,
# to match the shape of `y`.
model = torch.nn.Sequential(
    torch.nn.Linear(3,1),
    torch.nn.Flatten(0,1)
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

for t in range(5000):
    
    # Passing 2000x3 input
    y_pred = model(xx)
    
    loss = loss_fn(y_pred, y)
    if t % 100 == 0:
        print(t, loss.item())
        
        
    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

# You can access the first layer of `model` like accessing the first item of a list
linear_layer = model[0]

# For linear layer, its parameters are stored as `weight` and `bias`.
print(f'Result: y = {linear_layer.bias.item()} + {linear_layer.weight[:, 0].item()} x + {linear_layer.weight[:, 1].item()} x^2 + {linear_layer.weight[:, 2].item()} x^3')

0 35012.96875
100 569.6470947265625
200 389.2541198730469
300 267.16632080078125
400 184.45217895507812
500 128.35369873046875
600 90.26519012451172
700 64.37628936767578
800 46.75971603393555
900 34.75875473022461
1000 26.573970794677734
1100 20.985580444335938
1200 17.165512084960938
1300 14.551153182983398
1400 12.75997257232666
1500 11.531349182128906
1600 10.687653541564941
1700 10.107610702514648
1800 9.708394050598145
1900 9.433326721191406
2000 9.243589401245117
2100 9.11257553100586
2200 9.022014617919922
2300 8.95935344696045
2400 8.915950775146484
2500 8.885856628417969
2600 8.864974021911621
2700 8.850467681884766
2800 8.840380668640137
2900 8.833362579345703
3000 8.828474044799805
3100 8.825067520141602
3200 8.822690963745117
3300 8.82103157043457
3400 8.819872856140137
3500 8.819062232971191
3600 8.818496704101562
3700 8.818098068237305
3800 8.817821502685547
3900 8.817626953125
4000 8.817489624023438
4100 8.81739330291748
4200 8.817325592041016
4300 8.817279815673828
440

Up to this point we have updated the weights of our models by manually mutating the Tensors holding learnable parameters with torch.no_grad(). This is not a huge burden for simple optimization algorithms like stochastic gradient descent, but in practice we often train neural networks using more sophisticated optimizers like AdaGrad, RMSProp, Adam, etc.

In [27]:
import torch
import math


# Create Tensors to hold input and outputs.
x = torch.linspace(-math.pi, math.pi, 2000)
y = torch.sin(x)


p = torch.tensor([1, 2, 3])
xx = x.unsqueeze(-1).pow(p)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. The Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
# The Flatten layer flatens the output of the linear layer to a 1D tensor,
# to match the shape of `y`.
model = torch.nn.Sequential(
    torch.nn.Linear(3,1),
    torch.nn.Flatten(0,1)
)

loss_fn = torch.nn.MSELoss(reduction='sum')

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use RMSprop; the optim package contains many other
# optimization algorithms. The first argument to the RMSprop constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-3
optimizer = torch.optim.RMSprop(model.parameters(), lr=learning_rate)

for t in range(2000):
    y_pred = model(xx)
    loss = loss_fn(y_pred, y)
    if t % 100 == 0:
        print(t, loss.item())
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

linear_layer = model[0]
print(f'Result: y = {linear_layer.bias.item()} + {linear_layer.weight[:, 0].item()} x + {linear_layer.weight[:, 1].item()} x^2 + {linear_layer.weight[:, 2].item()} x^3')

0 22764.03125
100 3713.03662109375
200 1920.541259765625
300 1476.4471435546875
400 1191.3753662109375
500 947.5926513671875
600 735.4619140625
700 553.8648071289062
800 403.3292236328125
900 282.95233154296875
1000 190.40501403808594
1100 122.24386596679688
1200 74.36962890625
1300 42.0420036315918
1400 22.90239143371582
1500 13.074748992919922
1600 9.662647247314453
1700 8.929827690124512
1800 8.90841293334961
1900 8.930431365966797
Result: y = 0.0004279618733562529 + 0.8572030663490295 x + 0.00043960666516795754 x^2 + -0.09280064702033997 x^3
