### The most impressive thing that the framework does is that it calculates the entire backward propagation for us.

In [1]:
import torch

For differentiation, we can manually define it:

In [3]:
# Method 1
x = torch.randn(3,4,requires_grad=True)
x

tensor([[-0.4331,  0.3370,  1.5373, -0.3957],
        [-0.7298,  0.9322, -0.0695,  1.0893],
        [ 2.6122, -0.8767,  0.9718,  1.1008]], requires_grad=True)

In [5]:
# Method 2
x = torch.randn(3,4)
x.requires_grad=True
x

tensor([[-0.6246, -3.0934, -0.6821,  0.0772],
        [-0.0456,  1.2092, -0.1928,  0.3894],
        [ 2.0155, -1.1630, -0.6860,  1.1175]], requires_grad=True)

In [9]:
b = torch.randn(3,4,requires_grad=True)

In [10]:
t = x + b
t

tensor([[ 1.6527, -4.3626, -2.0832, -0.9918],
        [-0.0331,  1.7364,  0.6924,  0.1056],
        [ 2.9148,  0.5487, -0.2787,  3.5601]], grad_fn=<AddBackward0>)

In [11]:
y = t.sum()
y

tensor(3.4613, grad_fn=<SumBackward0>)

In [12]:
y.backward()

In [13]:
b.grad

tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]])

Although 'requires_grad' is not specified for 't', it will be used by default if needed if you use `backward`

In [14]:
x.requires_grad, b.requires_grad, t.requires_grad

(True, True, True)

### 举个例子看一下：

![title](./img/2.png)

In [15]:
#the flow of calculation
x = torch.rand(1)
b = torch.rand(1, requires_grad = True)
w = torch.rand(1, requires_grad = True)
y = w * x 
z = y + b 

In [16]:
x.requires_grad, b.requires_grad, w.requires_grad, y.requires_grad # please noted y is also need

(False, True, True, True)

In [59]:
# leaf vs root:
x.is_leaf, w.is_leaf, b.is_leaf, y.is_leaf, z.is_leaf

(True, True, True, False, False)

### Backpropagation calculation

Backpropagation is a key algorithm in training neural networks, which uses a gradient descent method to minimize the cost function. It calculates the gradient of the error with respect to the network's weights.

In [30]:
z.backward(retain_graph=True) # If the gradients are not cleared, they will accumulate

In the context of training a machine learning model, it's crucial to zero out the gradients at the beginning of each optimization step, because by default, gradients are accumulated in buffers (i.e., not overwritten) whenever a forward and backward propagation is performed. 

Otherwise, this might lead to incorrect model training results.

In [31]:
w.grad

tensor([1.3093])

In [32]:
b.grad

tensor([3.])

In [33]:
z.backward(retain_graph=True) # If the gradients are not cleared, they will accumulate
b.grad

tensor([4.])

### Let's try a linear regression for experiment

Linear regression is a basic and commonly used type of predictive analysis, especially useful in establishing a relationship between two or more variables.

Construct a set of input data 'X' and its corresponding labels 'y'

The `numpy.reshape` function provides a way to reorganize an array according to specified dimensions without changing its data. 

Here, the reshape(-1, 1) function aims to shape the x_train array to have one column and as many rows as needed to accommodate the original data.

In the context of numpy's reshape function, -1 is often used as an argument. This value is used to automatically infer the length of a certain dimension, based on other dimensions.

In [35]:
import numpy as np
x_values = [i for i in range(11)]
x_train = np.array(x_values, dtype=np.float32)
x_train = x_train.reshape(-1, 1)
print(x_train,'\n')
print('shape',x_train.shape)

[[ 0.]
 [ 1.]
 [ 2.]
 [ 3.]
 [ 4.]
 [ 5.]
 [ 6.]
 [ 7.]
 [ 8.]
 [ 9.]
 [10.]] 

shape (11, 1)


In [36]:
y_values = [2*i + 1 for i in x_values]
y_train = np.array(y_values, dtype=np.float32)
y_train = y_train.reshape(-1, 1)
print(y_train,'\n')
print('shape',y_train.shape)

[[ 1.]
 [ 3.]
 [ 5.]
 [ 7.]
 [ 9.]
 [11.]
 [13.]
 [15.]
 [17.]
 [19.]
 [21.]] 

shape (11, 1)


In [37]:
import torch
import torch.nn as nn

Linear Regression Model

     In fact, linear regression is just a fully connected layer without an activation function.

A fully connected layer in a neural network is a layer where each input node is connected to each output node. In the context of a linear regression model, this would mean that all input feature variables contribute linearly to the output predictions. As mentioned, we do not apply an activation function in linear regression, as the output is a linear combination of input features.

In [38]:
class LinearRegressionModel(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(LinearRegressionModel, self).__init__()
        self.linear = nn.Linear(input_dim, output_dim)  

    def forward(self, x):
        out = self.linear(x)
        return out

In [39]:
input_dim = 1
output_dim = 1

model = LinearRegressionModel(input_dim, output_dim)

In [40]:
model

LinearRegressionModel(
  (linear): Linear(in_features=1, out_features=1, bias=True)
)

Specify the parameters and loss function.

In [41]:
epochs = 1000
learning_rate = 0.01
# Stochastic Gradient Descent (SGD) is used for optimization, with the aforementioned learning rate.
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
# calculates the Mean Squared Error (MSE) between each element in the input x and target y
criterion = nn.MSELoss()

训练模型

In [42]:
for epoch in range(epochs):
    epoch += 1
    # Remember to convert ndarray into tensor
    inputs = torch.from_numpy(x_train)
    labels = torch.from_numpy(y_train)

    # At each iteration, the gradient must be cleared to zero
    optimizer.zero_grad() 

    # Forward propagation
    outputs = model(inputs)

    # Calculate the loss
    loss = criterion(outputs, labels)

    # Back propagation
    loss.backward()

    # Update weight parameters
    optimizer.step()
    if epoch % 50 == 0:
        print('epoch {}, loss {}'.format(epoch, loss.item()))

epoch 50, loss 0.10558705031871796
epoch 100, loss 0.06022291257977486
epoch 150, loss 0.034348901361227036
epoch 200, loss 0.01959127001464367
epoch 250, loss 0.011174176819622517
epoch 300, loss 0.006373362150043249
epoch 350, loss 0.003635118715465069
epoch 400, loss 0.002073340117931366
epoch 450, loss 0.0011825737310573459
epoch 500, loss 0.0006744735292159021
epoch 550, loss 0.00038468989077955484
epoch 600, loss 0.00021941652812529355
epoch 650, loss 0.00012514815898612142
epoch 700, loss 7.137949432944879e-05
epoch 750, loss 4.071067814948037e-05
epoch 800, loss 2.3219095965032466e-05
epoch 850, loss 1.324387631029822e-05
epoch 900, loss 7.553593150078086e-06
epoch 950, loss 4.3082504816993605e-06
epoch 1000, loss 2.457582922943402e-06


### The model prediction results were tested

In [43]:
predicted = model(torch.from_numpy(x_train).requires_grad_()).data.numpy()
predicted

array([[ 0.99708414],
       [ 2.997504  ],
       [ 4.997924  ],
       [ 6.998344  ],
       [ 8.998764  ],
       [10.999184  ],
       [12.999603  ],
       [15.000023  ],
       [17.000443  ],
       [19.000862  ],
       [21.001282  ]], dtype=float32)

### Model saving and loading

In [44]:
torch.save(model.state_dict(), 'model.pkl')

In [45]:
model.load_state_dict(torch.load('model.pkl'))

<All keys matched successfully>

### Training using GPU

    You just need to load the data and model into CUDA.

In the context of machine learning, especially deep learning, training models on a GPU can significantly decrease training times because GPUs are designed to efficiently perform matrix operations, which are fundamental in deep learning. By specifying to use CUDA (an extension of the programming language developed by NVIDIA for general computing on its GPUs), you direct your model and data to be loaded onto the GPU, if available.

In [46]:
import torch
import torch.nn as nn
import numpy as np


class LinearRegressionModel(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(LinearRegressionModel, self).__init__()
        self.linear = nn.Linear(input_dim, output_dim)  

    def forward(self, x):
        out = self.linear(x)
        return out

input_dim = 1
output_dim = 1

model = LinearRegressionModel(input_dim, output_dim)

# gpu acceleration
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)


criterion = nn.MSELoss()


learning_rate = 0.01

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

epochs = 1000
for epoch in range(epochs):
    epoch += 1
    inputs = torch.from_numpy(x_train).to(device)
    labels = torch.from_numpy(y_train).to(device)

    optimizer.zero_grad() 

    outputs = model(inputs)

    loss = criterion(outputs, labels)

    loss.backward()

    optimizer.step()

    if epoch % 50 == 0:
        print('epoch {}, loss {}'.format(epoch, loss.item()))

epoch 50, loss 0.2915838360786438
epoch 100, loss 0.16630853712558746
epoch 150, loss 0.09485612064599991
epoch 200, loss 0.05410239100456238
epoch 250, loss 0.03085799515247345
epoch 300, loss 0.017600249499082565
epoch 350, loss 0.010038497857749462
epoch 400, loss 0.005725600756704807
epoch 450, loss 0.003265665378421545
epoch 500, loss 0.0018626069650053978
epoch 550, loss 0.0010623658308759332
epoch 600, loss 0.0006059351726435125
epoch 650, loss 0.00034560466883704066
epoch 700, loss 0.00019711993809323758
epoch 750, loss 0.00011243174958508462
epoch 800, loss 6.412637594621629e-05
epoch 850, loss 3.657592969830148e-05
epoch 900, loss 2.0861582015641034e-05
epoch 950, loss 1.1898667253262829e-05
epoch 1000, loss 6.78708011037088e-06
