# PyTorch Reference Layers

A reference made for personal use on common layers in PyTorch. Not meant to be comprehensive. 

Currently:

1. Linear
2. Embeddings
3. Dropout
4. Transformers

Created by Josiah Davis.

In [1]:
import platform; print("Platform", platform.platform())
import sys; print("Python", sys.version)
import torch; print("PyTorch", torch.__version__)
import torch.nn as nn
import torch.nn.functional as F

Platform Linux-4.15.0-1060-aws-x86_64-with-debian-buster-sid
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0]
PyTorch 1.3.1


## 1. Linear

Handles the core matrix multiply of the form $y = xA^T + b$, where $x$ is the data, $A$ is the learned weight parameter and $b$ is the bias term.

### [`nn.Linear(in_features, out_features)`](https://pytorch.org/docs/stable/nn.html#linear)

- **Input**: 2D of the form [observations, input_features] (e.g., [10, 8]).
- **Arguments**: First arg is number of columns in the input called `in_features` (e.g., 8), second arg is number of columns in output called `out_features` (e.g., 16).
- **Output**: 2D of the form [observations, output_features] (e.g., [10, 16])
- **Stores**: Layer stores two parameters, the bias with shape=`out_features` and weight with shape=[`out_features`, `in_features`]). The weight matrix is tranposed before being multiplied by the input matrix.

In [2]:
x = torch.rand(10, 8) # [observations, in_features]
lin = nn.Linear(8, 16)
y = lin(x)
y.shape

torch.Size([10, 16])

In [3]:
lin.bias.shape # [out_features]

torch.Size([16])

In [4]:
lin.weight.shape # [out_features, in_features]

torch.Size([16, 8])

Values are initialized by default with uniform distribution about $\sqrt{\frac{1}{{in\_features}}}$ ([source code](https://pytorch.org/docs/stable/_modules/torch/nn/modules/linear.html#Linear)). 

For example, we had 8 input features.

In [5]:
import numpy as np
print(np.sqrt(1/8))

0.3535533905932738


In [6]:
vals = lin.weight.detach().numpy()
counts, cutoffs = np.histogram(vals)
print(cutoffs)

[-0.34973994 -0.2796788  -0.20961763 -0.13955648 -0.06949533  0.00056583
  0.07062698  0.14068814  0.21074928  0.28081045  0.3508716 ]


Initialization could be changed as desired. For example.

In [7]:
class CustomLin(nn.Module):
    
    def __init__(self):
        super(CustomLin, self).__init__()
        self.lin1 = nn.Linear(8, 16)
        self.lin2 = nn.Linear(16, 6)
        self._init_layers()
        
    def _init_layers(self):
        init_1 = np.sqrt(6) / np.sqrt(8 + 16)
        self.lin1.weight.data.uniform_(-init_1, init_1)
        self.lin1.bias.data.zero_()
        
        self.lin2.weight.data.normal_(mean=0, std=np.sqrt(6 / 16))
        self.lin2.bias.data.zero_()
        
    def forward(self, x):
        return self.lin2(self.lin1(x))

In [8]:
cl = CustomLin()

In [9]:
print(np.sqrt(6) / np.sqrt(8 + 16))
counts, cutoffs = np.histogram(cl.lin1.weight.detach().numpy())
print(cutoffs)

0.5
[-0.49987888 -0.40455016 -0.30922145 -0.21389274 -0.11856403 -0.02323532
  0.07209339  0.1674221   0.2627508   0.35807952  0.45340824]


In [10]:
print(np.sqrt(6 / 16))
wts = cl.lin2.weight.detach().numpy()
wts.mean(), wts.std()

0.6123724356957945


(-0.071864285, 0.605536)

For learning about some of the history and development of weight initialization, consult:
- [Understanding the difficulty of training deep feedforward neural networks (2010)](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) by Xavier Glorot and Yoshua Bengio
- [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (2015)](https://arxiv.org/pdf/1502.01852.pdf) by Kaiming He, Xiangyu Zhang, Shaoqing Ren,  Jian Sun

## 2. Embeddings

### [`nn.Embedding(num_embeddings, embedding_dim)`](https://pytorch.org/docs/stable/nn.html#embedding)

Simple lookup of any categorical variable (often NLP vocabulary) mapping to dense floating point representations.

#### Shape

- Input: Could be 1D or 2D depending on the application.
- Output: If the input is 1D, output will be 2D, if the input is 2D, output will be 2D.

#### Parameters

- **num_embeddings** (int) – how large is the vocab in dictionary or how many categories are there? (e.g., 25)
- **embedding_dim** (int) – the number of features (e.g., 5)

1D input example (e.g., a single column in tablular data)

In [11]:
x = torch.tensor([15, 20, 7])
print(f'input shape: {x.shape}')
emb = nn.Embedding(num_embeddings=25, embedding_dim=5)
y = emb(x)
print(f'output shape: {y.shape}')

input shape: torch.Size([3])
output shape: torch.Size([3, 5])


2D input example (e.g., text for language modelling with sequence on the rows and "text chunk" on the columns)

In [12]:
x = torch.tensor([[15, 20, 7],[23, 10, 6]])
print(f'input shape: {x.shape}')
emb = nn.Embedding(num_embeddings=25, embedding_dim=5)
y = emb(x)
print(f'output shape: {y.shape}')

input shape: torch.Size([2, 3])
output shape: torch.Size([2, 3, 5])


In [13]:
emb.weight

Parameter containing:
tensor([[-0.9857,  0.6738, -1.3308,  0.6104,  0.4368],
        [ 1.1923,  0.2208,  1.3420,  0.2724, -0.3383],
        [ 1.1209, -1.0355, -1.0666, -0.6748,  1.8354],
        [ 0.1832, -0.7247,  0.3047,  0.5994, -0.1622],
        [ 1.2293, -0.9944,  0.0545,  0.2088,  0.8442],
        [-0.2275,  1.0348, -0.1023, -1.4188, -0.1926],
        [-1.1492, -1.2622, -0.7481,  0.8993,  1.0637],
        [-0.1893, -0.4812, -0.8839, -0.3168,  1.1439],
        [ 0.9222,  0.3558, -0.7736,  0.4515, -1.9237],
        [ 0.2411,  1.3968, -1.2450, -0.0553, -0.5514],
        [-1.0053,  0.9058,  0.9180,  0.4477,  0.0537],
        [-0.2617,  0.6926, -0.5694, -0.6246,  2.1864],
        [ 0.4094,  0.1337, -0.5880, -0.7409,  0.5854],
        [-1.1285, -0.4147, -0.2701, -0.1995,  0.0327],
        [ 1.5854, -0.3498, -0.4903, -0.0224,  0.9896],
        [-0.8889,  0.2809, -2.2613,  0.2511,  0.6522],
        [-0.5991,  1.0116,  0.0659, -0.5952,  1.8904],
        [ 1.0922,  0.3825,  0.2867,  0.6715

## 3. Dropout

### [`nn.Dropout(p)`](https://pytorch.org/docs/master/nn.html#dropout)

Randomly zero out elements from input matrix for regularization. Doesn't work when `model.eval()` is set. Also re-normalizes the output values not zero'd out by $\frac{1}{1-p}$.

**Input:** Can be any shape

**Output:** Same shape as input

**See Also:** `nn.Dropout2d` and `nn.Dropout3d` for zero-ing entire channels at a time.

**Paper**: Improving neural networks by preventing co-adaptation of feature detectors (2012) ([arxiv](https://arxiv.org/pdf/1207.0580.pdf))

In [13]:
x = torch.randn(2, 4)
print(f'input shape: {x.shape}')
do = nn.Dropout(p=0.6)
y = do(x)
print(f'output shape: {y.shape}')
print(f'output:\n{y}')

input shape: torch.Size([2, 4])
output shape: torch.Size([2, 4])
output:
tensor([[ 0.0000, -0.0000,  0.0000, -0.0000],
        [-2.6023, -0.2766,  0.0000, -0.4595]])


Dropout is only activated once the training mode is turned on.

In [14]:
class DOModel(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.do = nn.Dropout(0.6)
        
    def forward(self, x):
        return self.do(x)

In [15]:
x = torch.randn(2, 4)
x

tensor([[ 0.1225,  0.1723, -2.2045,  1.0320],
        [-0.6691, -0.3356, -1.1725,  0.1653]])

Running with `model.train()`. This will give you a different output each time you call it (unless you set a seed). 

Incidentally, this is the key to monte carlo dropout, a technique for uncertainty estimation.

In [16]:
model = DOModel()
model.train()
y = model(x)
print(f'output:\n{y}')

output:
tensor([[ 0.0000,  0.4308, -5.5112,  2.5801],
        [-0.0000, -0.8390, -0.0000,  0.4134]])


Running again with `model.eval()`. This will give you the same output no matter how many times you call it.

In [17]:
model.eval()
y = model(x)
print(f'output:\n{y}')

output:
tensor([[ 0.1225,  0.1723, -2.2045,  1.0320],
        [-0.6691, -0.3356, -1.1725,  0.1653]])


## 4. Transformers

### [`nn.TransformerEncoderLayer(d_model, nhead)`](https://pytorch.org/docs/master/nn.html?highlight=transformerencoderlayer#torch.nn.TransformerEncoderLayer)

Transformer operations are defined in "Attention is all you need (2017)." Differece between this and the Decoder layer is that the Encoder only attends to itself (Key/Value/Query is the source langage). Whereas in Decoder layer there is an attention over the memory (i.e., encoding of the input sequence) as well as the self-attention over the target sequence.

**Input**: A 3D input of the structure [sequence length, batch size, hidden features].

**Output**: Same structure as input.

**Operation**: Key, Value, and Query are all the source sentence (only self-attention). See [source code](https://pytorch.org/docs/master/_modules/torch/nn/modules/transformer.html#TransformerEncoderLayer)

```
src2 = self.self_attn(src, src, src, attn_mask=src_mask, key_padding_mask=src_key_padding_mask)[0]
src = src + self.dropout1(src2)
src = self.norm1(src)
src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
src = src + self.dropout2(src2)
src = self.norm2(src)
return src
```

**Parameters**

- **d_model** – the number of expected features in the input (required).
- **nhead** – the number of heads in the multiheadattention models (required).
- **dim_feedforward** – the dimension of the feedforward network model (default=2048).
- **dropout** – the dropout value (default=0.1).
- **activation** – the activation function of intermediate layer, relu or gelu (default=relu).

In [18]:
x = torch.rand(10, 32, 512)
print(f'input shape: {x.shape}')
encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
y = encoder_layer(x)
print(f'output shape: {y.shape}')

input shape: torch.Size([10, 32, 512])
output shape: torch.Size([10, 32, 512])
