<a href="https://colab.research.google.com/github/kimsooyoung/pytorch_til/blob/main/youtube_series/4_building_models_with_pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## `torch.nn`.Module and `torch.nn.Parameter`

One important behavior of torch.nn.Module is registering parameters. If a particular Module subclass has learning weights, these weights are expressed as instances of `torch.nn.Parameter`. The Parameter class is a subclass of torch.Tensor, with the special behavior that when they are assigned as attributes of a Module, they are added to the list of that modules parameters. These parameters may be accessed through the `parameters()` method on the Module class.


As a simple example, here’s a very simple model with two linear layers and an activation function. We’ll create an instance of it and ask it to report on its parameters:



In [None]:
import torch

class TinyModel(torch.nn.Module):

    def __init__(self):
        super(TinyModel, self).__init__()

        self.linear1 = torch.nn.Linear(100, 200)
        self.activation = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(200, 10)
        self.softmax = torch.nn.Softmax()

    def forward(self, x):
        x = self.linear1(x)
        x = self.activation(x)
        x = self.linear2(x)
        x = self.softmax(x)
        return x

tinymodel = TinyModel()

print('The model:')
print(tinymodel)

print('\n\nJust one layer:')
print(tinymodel.linear2)

print('\n\nModel params:')
for param in tinymodel.parameters():
    print(param)

print('\n\nLayer params:')
for param in tinymodel.linear2.parameters():
    print(param)

The model:
TinyModel(
  (linear1): Linear(in_features=100, out_features=200, bias=True)
  (activation): ReLU()
  (linear2): Linear(in_features=200, out_features=10, bias=True)
  (softmax): Softmax(dim=None)
)


Just one layer:
Linear(in_features=200, out_features=10, bias=True)


Model params:
Parameter containing:
tensor([[ 0.0422, -0.0064, -0.0296,  ..., -0.0506, -0.0024,  0.0316],
        [ 0.0690,  0.0997, -0.0809,  ..., -0.0023,  0.0551, -0.0145],
        [-0.0646,  0.0662,  0.0890,  ..., -0.0688,  0.0827,  0.0901],
        ...,
        [ 0.0178,  0.0048, -0.0026,  ...,  0.0307,  0.0820, -0.0692],
        [ 0.0379, -0.0731, -0.0311,  ..., -0.0770, -0.0619,  0.0920],
        [-0.0068, -0.0298, -0.0436,  ...,  0.0265,  0.0568, -0.0203]],
       requires_grad=True)
Parameter containing:
tensor([ 6.3280e-02,  9.3010e-02, -7.3512e-02,  9.1296e-03, -1.3778e-02,
         1.0001e-02, -7.3201e-02, -5.1876e-02, -4.6833e-02, -5.9591e-03,
         9.2902e-02, -1.6865e-02,  4.9538e-02, -2.2501

> This shows the fundamental structure of a PyTorch model: there is an `__init__()` method that defines the layers and other components of a model, and a forward() method where the computation gets done. Note that we can print the model, or any of its submodules, to learn about its structure.



# Common Layer Types

## Linear Layers

The most basic type of neural network layer is a linear or fully connected layer. This is a layer where every input influences every output of the layer to a degree specified by the layer’s weights. If a model has m inputs and n outputs, the weights will be an m x n matrix. For example:



In [None]:
lin = torch.nn.Linear(3, 2)
x = torch.rand(1, 3)
print('Input : ')
print(x)

print('\n\nWeight and Bias parameters:')
for param in lin.parameters():
  print(param)

y = lin(x)
print('\n\nOutput:')
print(y)

Input : 
tensor([[0.2064, 0.6039, 0.0304]])


Weight and Bias parameters:
Parameter containing:
tensor([[-0.2475,  0.0194, -0.5053],
        [-0.4036, -0.1715, -0.2951]], requires_grad=True)
Parameter containing:
tensor([0.0609, 0.4465], requires_grad=True)


Output:
tensor([[0.0062, 0.2507]], grad_fn=<AddmmBackward0>)


- If you do the matrix multiplication of x by the linear layer’s weights, and add the biases, you’ll find that you get the output vector y.

- One other important feature to note: When we checked the weights of our layer with lin.weight, it reported itself as a Parameter (which is a subclass of Tensor), and let us know that it’s tracking gradients with autograd. This is a default behavior for Parameter that differs from Tensor.

- Linear layers are used widely in deep learning models. One of the most common places you’ll see them is in classifier models, which will usually have one or more linear layers at the end, where the last layer will have n outputs, where n is the number of classes the classifier addresses.

## Convolutional Layers

Convolutional layers are built to handle data with a high degree of spatial correlation. They are very commonly used in computer vision, where they detect close groupings of features which the compose into higher-level features. They pop up in other contexts too - for example, in NLP applications, where a word’s immediate context (that is, the other words nearby in the sequence) can affect the meaning of a sentence.

> We saw convolutional layers in action in LeNet5 in an earlier video:

In [None]:
import torch.functional as F


class LeNet(torch.nn.Module):

    def __init__(self):
        super(LeNet, self).__init__()
        # 1 input image channel (black & white), 6 output channels, 5x5 square convolution
        # kernel
        self.conv1 = torch.nn.Conv2d(1, 6, 5)
        self.conv2 = torch.nn.Conv2d(6, 16, 3)
        # an affine operation: y = Wx + b
        self.fc1 = torch.nn.Linear(16 * 6 * 6, 120)  # 6*6 from image dimension
        self.fc2 = torch.nn.Linear(120, 84)
        self.fc3 = torch.nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features


Let’s break down what’s happening in the convolutional layers of this model. Starting with conv1:

- LeNet5 is meant to take in a 1x32x32 black & white image. The first argument to a convolutional layer’s constructor is the number of input channels. Here, it is 1. If we were building this model to look at 3-color channels, it would be 3.

- A convolutional layer is like a window that scans over the image, looking for a pattern it recognizes. These patterns are called features, and one of the parameters of a convolutional layer is the number of features we would like it to learn. This is the second argument to the constructor is the number of output features. Here, we’re asking our layer to learn 6 features.

- Just above, I likened the convolutional layer to a window - but how big is the window? The third argument is the window or kernel size. Here, the “5” means we’ve chosen a 5x5 kernel. (If you want a kernel with height different from width, you can specify a tuple for this argument - e.g., (3, 5) to get a 3x5 convolution kernel.)

## Recurrent Layers

Recurrent neural networks (or RNNs) are used for sequential data - anything from time-series measurements from a scientific instrument to natural language sentences to DNA nucleotides. An RNN does this by maintaining a hidden state that acts as a sort of memory for what it has seen in the sequence so far.



In [None]:
class LSTMTagger(torch.nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = torch.nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = torch.nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = torch.nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

The constructor has four arguments:

- `vocab_size` is the number of words in the input vocabulary. Each word is a one-hot vector (or unit vector) in a vocab_size-dimensional space.

- `tagset_size` is the number of tags in the output set.

- `embedding_dim` is the size of the embedding space for the vocabulary. An embedding maps a vocabulary onto a low-dimensional space, where words with similar meanings are close together in the space.

- `hidden_dim` is the size of the LSTM’s memory.

## Transformers

PyTorch has a Transformer class that allows you to define the overall parameters of a transformer model - the number of attention heads, the number of encoder & decoder layers, dropout and activation functions, etc.

For details, check out the documentation on transformer classes, and the relevant [tutorial](https://pytorch.org/tutorials/beginner/transformer_tutorial.html) on pytorch.org.

## Other Layers and Functions

### Data Manipulation Layers

There are other layer types that perform important functions in models, but don’t participate in the learning process themselves. **Max pooling** (and its twin, min pooling) reduce a tensor by combining cells, and assigning the maximum value of the input cells to the output cell (we saw this). For example:



In [None]:
my_tensor = torch.rand(1, 6, 6)
print(my_tensor)

maxpool_layer = torch.nn.MaxPool2d(3)
print(maxpool_layer(my_tensor))

tensor([[[0.5747, 0.6558, 0.6151, 0.5992, 0.7093, 0.6753],
         [0.0551, 0.7878, 0.7702, 0.7950, 0.4861, 0.2960],
         [0.2916, 0.8792, 0.8282, 0.3240, 0.9235, 0.2314],
         [0.2363, 0.2916, 0.0016, 0.1204, 0.4170, 0.2871],
         [0.0452, 0.8503, 0.6008, 0.2773, 0.8376, 0.4313],
         [0.5447, 0.0856, 0.5336, 0.0569, 0.5850, 0.2095]]])
tensor([[[0.8792, 0.9235],
         [0.8503, 0.8376]]])


**Normalization layers** re-center and normalize the output of one layer before feeding it to another. Centering the and scaling the intermediate tensors has a number of beneficial effects, such as letting you use higher learning rates without exploding/vanishing gradients.

This is beneficial because many activation functions (discussed below) have their strongest gradients near 0, but sometimes suffer from vanishing or exploding gradients for inputs that drive them far away from zero. Keeping the data centered around the area of steepest gradient will tend to mean faster, better learning and higher feasible learning rates.




In [None]:
my_tensor = torch.rand(1, 4, 4) * 20 + 5
print(my_tensor)

print(my_tensor.mean())

norm_layer = torch.nn.BatchNorm1d(4)
normed_tensor = norm_layer(my_tensor)
print(normed_tensor)

print(normed_tensor.mean())

tensor([[[ 5.5855,  7.4518, 11.1203, 13.6877],
         [23.0146, 18.0233,  9.5671, 17.6425],
         [23.0110,  6.1079, 16.2284, 10.0754],
         [17.3357, 23.6466,  6.8381,  7.0104]]])
tensor(13.5216)
tensor([[[-1.2307, -0.6381,  0.5268,  1.3420],
         [ 1.2354,  0.1995, -1.5555,  0.1205],
         [ 1.4308, -1.2109,  0.3708, -0.5908],
         [ 0.5080,  1.3918, -0.9620, -0.9378]]],
       grad_fn=<NativeBatchNormBackward0>)
tensor(-2.9802e-08, grad_fn=<MeanBackward0>)


**Dropout layers** are a tool for encouraging sparse representations in your model - that is, pushing it to do inference with less data.


Dropout layers work by randomly setting parts of the input tensor during training - dropout layers are always turned off for inference. This forces the model to learn against this masked or reduced dataset. For example:

In [None]:
my_tensor = torch.rand(1, 4, 4)

dropout = torch.nn.Dropout(p=0.4)
print(dropout(my_tensor))
print(dropout(my_tensor))

tensor([[[0.0000, 0.0000, 0.0000, 1.5199],
         [0.4390, 1.5375, 0.4332, 1.1481],
         [1.5698, 0.3263, 0.0000, 0.7532],
         [1.1801, 0.0000, 1.3801, 1.1654]]])
tensor([[[0.5754, 0.0000, 0.0000, 0.0000],
         [0.4390, 0.0000, 0.0000, 1.1481],
         [1.5698, 0.3263, 0.0000, 0.0000],
         [1.1801, 0.0000, 1.3801, 1.1654]]])


Above, you can see the effect of dropout on a sample tensor. You can use the optional p argument to set the probability of an individual weight dropping out; if you don’t it defaults to 0.5.

## Activation Functions

`torch.nn.Module` has objects encapsulating all of the major activation functions including ReLU and its many variants, Tanh, Hardtanh, sigmoid, and more. It also includes other functions, such as Softmax, that are most useful at the output stage of a model.



## Loss Functions

Loss functions tell us how far a model’s prediction is from the correct answer. PyTorch contains a variety of loss functions, including common MSE (mean squared error = L2 norm), Cross Entropy Loss and Negative Likelihood Loss (useful for classifiers), and others.

