<a href="https://colab.research.google.com/github/lblogan14/PyTorch_tutorial_colab/blob/main/4_Building_Models_with_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#`torch.nn.Module` and `torch.nn.Parameter`

Except for `Parameter`, the classes we discuss here are all subclasses of `torch.nn.Module`. This is the PyTorch base class meant to encapsulate behaviors specific to PyTorch Models and their components.

If a particular `Module` subclass has learning weights, these weights are expressed as instances of `torch.nn.Parameter`. The `Parameter` class is a subclass of `torch.Tensor`, with the special behavior that when they are assigned as attributes of a `Module`, they are added to the list of that module's parameters. These parameters may be accessed through the `parameters()` method on the `Module` class.

Here, we created an instance of `Module` and ask it to report on its parameters:

In [None]:
import torch

In [None]:
class TinyModel(torch.nn.Module):
    def __init__(self):
        super(TinyModel, self).__init__()

        self.linear1 = torch.nn.Linear(100, 200)
        self.activation = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(200, 10)
        self.softmax = torch.nn.Softmax()

    def forward(self, x):
        x = self.linear1(x)
        x = self.activation(x)
        x = self.linear2(x)
        x = self.softmax(x)
        return x

In [None]:
tinymodel = TinyModel()

print('The model:')
print(tinymodel)
print('\n\nJust one layer:')
print(tinymodel.linear2)
print('\n\nModel params:')
for param in tinymodel.parameters():
    print(param)

print('\n\nLayer params:')
for param in tinymodel.linear2.parameters():
    print(param)

This shows the fundamental structure of a PyTorch model: there is an `__init__()` method that defines the layers and other components of a model, and a `forward()` method where the computation gets done.

#Common Layer Types

##Linear Layers
most basic type - linear or fully connected layer.

If a model has *m* inputs and *n* outputs, the weights will be an [m x n] matrix:

In [None]:
lin = torch.nn.Linear(3,2)
x = torch.rand(1,3)
print('Input:')
print(x)

print('\n\nWeight and Bias parameters:')
for param in lin.parameters():
    print(param)

y = lin(x)
print('\n\nOutput:')
print(y)

When we checked the weights with `lin.weight`, it reported itself as a `Parameter`, and let us know that it is tracking gradients with autograd.

In [None]:
lin.weight

In [None]:
lin.bias

##Convolutional Layers
Conv layers are built to handle data with a high degree of spatial correlation.

In [None]:
import torch.functional as F

In [None]:
class LeNet(torch.nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()

        # 1 input image channel (black&white), 6 output channels, 3x3 square convolution kernel
        self.conv1 = torch.nn.Conv2d(1, 6, 5)
        self.conv2 = torch.nn.Conv2d(6, 16, 3)

        # an affine operation y = Wx + b
        self.fc1 = torch.nn.Linear(16*6*6, 120) # 6*6 from image dimension
        self.fc2 = torch.nn.Linear(120, 84)
        self.fc3 = torch.nn.Linear(84, 10)

    def forward(self, x):
        # Maxpooling over (2,2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2,2))
        # If the size is a square we can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:] # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features

* LeNet5 takes in a 1x32x32 black & white image.
* A Conv layer is like a window that scans over the image, looking for a pattern it recognizes.
* The 3rd argument of `Conv2d` is the window or kernel size.

The output of a conv layer is an activation map - a spatial representation of the presence of features in the input tensor.

##Recurrent Layers
RNN are used for sequential data - anything from time-series measurements from a scientific instrucment to natural language sentences to DNA nucleotides. An RNN does this by maintaining a hidden state that acts as a sort of memory for what it has seen in the sequence so far.

The internal structrue of an RNN layer - or its variants, the LSTM and GRU:

In [None]:
class LSTMTagger(torch.nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = torch.nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states with dim of hidden_dim
        self.lstm = torch.nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = torch.nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

The constructor has four arguments:
* `vocab_size` is the number of words in the input vocabulary. Each word is a one-hot vector in a `vocab_size`-dimensional space.
* `tagset_size` is the number of tags in the output set.
* `embedding_dim` is the size of the embedding space for the vocabulary. An embedding maps a vocabulary onto a low-dimensional space, where words with similar meanings are close together in the space.
* `hidden_dim` is the size of the LSTM's memory

The input will be a sentence with the words represented as indices of one-hot vectors. The embedding layer will then map these down to an `embedding_dim`-dimensional space. The LSTM takes this sequence of embeddings and iterates over it, fielding an output vector of length `hidden_dim`. The final linear layer acts as a classifier; applying `log_softmax()` to the output of the final layer converts the output into a normalized set of estimated probabilities that a given word maps to a given tag.

#Transformer
PyTorch has a `Transformer` class that allows us to defin the overall parameters of a transformer model - the number of attention heads, the number of encoder & decoder layers, dropout and activation functions, etc.

The `torch.nn.Transformer` class also has classes to encapsulate the individual components (`TransformerEncoder`, `TransformerDecoder`) and subcomponents (`TransformerEncoderLayer`, `TransformerDecoderLayer`)

[Official documentation about Transformer class](https://pytorch.org/docs/stable/nn.html#transformer-layers)

[Official Transformer Tutorial](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)

#Other Layers and Functions

##Data Manipulation Layers
There are layer types tha perform important functions in models but do not participate in the learning process.

**Max pooling** reduces a tensor by combining cells, and assigning the max value of the input cells to the output cell. (similar to min pooling)

In [None]:
my_tensor = torch.rand(1, 6, 6)
print(my_tensor)

maxpool_layer = torch.nn.MaxPool2d(3)
print(maxpool_layer(my_tensor))

**Normalization layers** re-center and normalize the output of one layer before feeding it to another. Centering and scaling the intermediate tensors has a number of beneficial effects, such as letting us use higher learning rates without exploding/vanishing gradients:

In [None]:
my_tensor = torch.rand(1,4,4) * 20 + 5
print('My tensor:')
print(my_tensor)

print('\nMean of my tensor:')
print(my_tensor.mean())

norm_layer = torch.nn.BatchNorm1d(4)
normed_tensor = norm_layer(my_tensor)
print('\n\nNormalized tensor:')
print(normed_tensor)
print('\nMean of normalized tensor:')
print(normed_tensor.mean())

The normalization layer is beneficial because many activation functions have their strongest gradients near 0, but sometimes suffer from vanishing or exploding gradients for inputs that drive them far away from zero. Keeping the data centered around the area of steepest gradient will tend to mean faster, better learning and higher feasible learning rates.

**Dropout layers** are a tool for encouraging sparse representations in model - that is, pushing it to do inference with less data.

Dropout layers work by randomly setting parts of the input tensor during training - they always turned off for inference. This forces the model to learn against this masked or reduced dataset:

In [None]:
my_tensor = torch.rand(1, 4, 4)

dropout = torch.nn.Dropout(p=0.4)
print(dropout(my_tensor))
print(dropout(my_tensor))

##Activation Functions
Inserting non-linear activaiton functions between layers is what allows a deep learning model to simulate any function, rather than just linear ones.

`torch.nn.Module` has objects encapsulating all of the major activation functions including ReLU and its many variants, Tanh, Hardtanh, sigmoid, and more.

##Loss Functions
Loss functions tell us how far a model's prediction is from the correct answer.