# PyTorch Reference Layers

A reference made for personal use on common layers in PyTorch. Not meant to be comprehensive. 

Currently:

1. Linear
2. Embeddings
3. Dropout
4. Transformers

Created by Josiah Davis.

In [2]:
import platform; print("Platform", platform.platform())
import sys; print("Python", sys.version)
import torch; print("PyTorch", torch.__version__)
import torch.nn as nn
import torch.nn.functional as F

Platform Linux-4.15.0-1060-aws-x86_64-with-debian-buster-sid
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0]
PyTorch 1.3.1


## 1. Linear

Handles the core matrix multiply of the form $y = xA^T + b$, where $x$ is the data, $A$ is the learned weight parameter and $b$ is the bias term.

### [`nn.Linear(in_features, out_features)`](https://pytorch.org/docs/stable/nn.html#linear)

- **Input**: 2D,3D,4D of the form [observations [, n_heads, seq_len,] in_features] (e.g., [10, 8]).
- **Arguments**: First arg is number of columns in the input called `in_features` (e.g., 8), second arg is number of columns in output called `out_features` (e.g., 16).
- **Output**: 2D,3D,4D of the form [observations [, n_heads, seq_len,] out_features] (e.g., [10, 16])
- **Stores**: Layer stores two parameters, the bias with shape=`out_features` and weight with shape=[`out_features`, `in_features`]). The weight matrix is tranposed before being multiplied by the input matrix.

2D example

In [18]:
x = torch.rand(10, 8) # e.g., [observations, hidden]
lin = nn.Linear(8, 16)
y = lin(x)
y.shape

torch.Size([10, 16])

3D Example, e.g., language modelling

In [19]:
x = torch.rand(10,3, 8) # e.g., [observations, seq_len, hidden]
y = lin(x)
y.shape

torch.Size([10, 3, 16])

4D example, e.g., transformer self-attention

In [15]:
x = torch.rand(10, 2, 3, 8) # e.g., [observations, n_heads, seq_len, hidden_per_head]
y = lin(x)
y.shape

torch.Size([10, 2, 3, 16])

In [16]:
lin.bias.shape # [out_features]

torch.Size([16])

In [17]:
lin.weight.shape # [out_features, in_features]

torch.Size([16, 8])

Values are initialized by default with uniform distribution about $\sqrt{\frac{1}{{in\_features}}}$ ([source code](https://pytorch.org/docs/stable/_modules/torch/nn/modules/linear.html#Linear)). 

For example, we had 8 input features.

In [5]:
import numpy as np
print(np.sqrt(1/8))

0.3535533905932738


In [6]:
vals = lin.weight.detach().numpy()
counts, cutoffs = np.histogram(vals)
print(cutoffs)

[-3.4990752e-01 -2.7985743e-01 -2.0980732e-01 -1.3975722e-01
 -6.9707118e-02  3.4298003e-04  7.0393078e-02  1.4044318e-01
  2.1049328e-01  2.8054339e-01  3.5059348e-01]


Initialization could be changed as desired. For example.

In [7]:
class CustomLin(nn.Module):
    
    def __init__(self):
        super(CustomLin, self).__init__()
        self.lin1 = nn.Linear(8, 16)
        self.lin2 = nn.Linear(16, 6)
        self._init_layers()
        
    def _init_layers(self):
        init_1 = np.sqrt(6) / np.sqrt(8 + 16)
        self.lin1.weight.data.uniform_(-init_1, init_1)
        self.lin1.bias.data.zero_()
        
        self.lin2.weight.data.normal_(mean=0, std=np.sqrt(6 / 16))
        self.lin2.bias.data.zero_()
        
    def forward(self, x):
        return self.lin2(self.lin1(x))

In [8]:
cl = CustomLin()

In [9]:
print(np.sqrt(6) / np.sqrt(8 + 16))
counts, cutoffs = np.histogram(cl.lin1.weight.detach().numpy())
print(cutoffs)

0.5
[-0.49227554 -0.39338982 -0.29450414 -0.19561842 -0.09673272  0.00215298
  0.10103868  0.19992438  0.2988101   0.39769578  0.4965815 ]


In [10]:
print(np.sqrt(6 / 16))
wts = cl.lin2.weight.detach().numpy()
wts.mean(), wts.std()

0.6123724356957945


(0.055361047, 0.65983176)

For learning about some of the history and development of weight initialization, consult:
- [Understanding the difficulty of training deep feedforward neural networks (2010)](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) by Xavier Glorot and Yoshua Bengio
- [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (2015)](https://arxiv.org/pdf/1502.01852.pdf) by Kaiming He, Xiangyu Zhang, Shaoqing Ren,  Jian Sun

## 2. Embeddings

### [`nn.Embedding(num_embeddings, embedding_dim)`](https://pytorch.org/docs/stable/nn.html#embedding)

Simple lookup of any categorical variable (often NLP vocabulary) mapping to dense floating point representations.

#### Shape

- Input: Could be 1D or 2D depending on the application.
- Output: If the input is 1D, output will be 2D, if the input is 2D, output will be 2D.

#### Parameters

- **num_embeddings** (int) – how large is the vocab in dictionary or how many categories are there? (e.g., 25)
- **embedding_dim** (int) – the number of features (e.g., 5)

1D input example (e.g., a single column in tablular data)

In [11]:
x = torch.tensor([15, 20, 7])
print(f'input shape: {x.shape}')
emb = nn.Embedding(num_embeddings=25, embedding_dim=5)
y = emb(x)
print(f'output shape: {y.shape}')

input shape: torch.Size([3])
output shape: torch.Size([3, 5])


2D input example (e.g., text for language modelling with sequence on the rows and "text chunk" on the columns)

In [12]:
x = torch.tensor([[15, 20, 7],[23, 10, 6]])
print(f'input shape: {x.shape}')
emb = nn.Embedding(num_embeddings=25, embedding_dim=5)
y = emb(x)
print(f'output shape: {y.shape}')

input shape: torch.Size([2, 3])
output shape: torch.Size([2, 3, 5])


In [13]:
emb.weight

Parameter containing:
tensor([[-0.0860,  1.2532,  0.0364, -0.4219,  0.7459],
        [ 3.5131, -0.4061,  0.3684,  0.6196, -0.9366],
        [ 0.4579, -2.9096, -0.1020,  1.9300,  0.6954],
        [-0.9681, -0.6774,  0.2202,  1.0518,  0.9375],
        [-1.6973,  1.3657, -0.1358,  0.1784,  0.8525],
        [ 0.4791, -0.6650,  1.3976, -2.5129, -0.0492],
        [-0.1043, -0.8984, -1.5367, -1.1828,  1.3752],
        [ 0.9068, -0.0710, -1.5567,  0.8874,  0.1792],
        [-1.0142,  1.0876, -1.7989, -0.9655, -0.2474],
        [-1.4501, -1.2906,  0.5528,  3.1640, -1.8497],
        [ 0.2514, -0.4580,  0.4639,  0.1454, -0.4032],
        [-1.1059,  0.4435,  1.2726, -0.0141, -0.0754],
        [ 0.0366, -0.9740,  0.5161,  1.2957,  0.7978],
        [-0.3188, -0.9667, -0.8374,  0.3967,  0.3336],
        [ 0.7357, -1.3129, -2.5905, -0.1475,  0.5877],
        [-0.5707,  0.0450, -0.2404,  0.6378, -0.9729],
        [ 1.4942, -0.5200,  1.0599, -0.2466, -1.0081],
        [ 0.2316,  0.4852,  0.7389,  0.4103

## 3. Dropout

### [`nn.Dropout(p)`](https://pytorch.org/docs/master/nn.html#dropout)

Randomly zero out elements from input matrix for regularization. Doesn't work when `model.eval()` is set. Also re-normalizes the output values not zero'd out by $\frac{1}{1-p}$.

**Input:** Can be any shape

**Output:** Same shape as input

**See Also:** `nn.Dropout2d` and `nn.Dropout3d` for zero-ing entire channels at a time.

**Paper**: Improving neural networks by preventing co-adaptation of feature detectors (2012) ([arxiv](https://arxiv.org/pdf/1207.0580.pdf))

In [14]:
x = torch.randn(2, 4)
print(f'input shape: {x.shape}')
do = nn.Dropout(p=0.6)
y = do(x)
print(f'output shape: {y.shape}')
print(f'output:\n{y}')

input shape: torch.Size([2, 4])
output shape: torch.Size([2, 4])
output:
tensor([[ 4.3167,  0.0000,  0.0000, -0.8370],
        [-1.0616, -0.0000,  0.0000, -1.3493]])


Dropout is only activated once the training mode is turned on.

In [15]:
class DOModel(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.do = nn.Dropout(0.6)
        
    def forward(self, x):
        return self.do(x)

In [16]:
x = torch.randn(2, 4)
x

tensor([[ 0.3708, -0.2339,  0.5004, -0.7387],
        [-1.8816, -0.2013, -0.8256, -0.7826]])

Running with `model.train()`. This will give you a different output each time you call it (unless you set a seed). 

Incidentally, this is the key to monte carlo dropout, a technique for uncertainty estimation.

In [17]:
model = DOModel()
model.train()
y = model(x)
print(f'output:\n{y}')

output:
tensor([[ 0.9271, -0.5847,  0.0000, -1.8468],
        [-0.0000, -0.0000, -0.0000, -0.0000]])


Running again with `model.eval()`. This will give you the same output no matter how many times you call it.

In [18]:
model.eval()
y = model(x)
print(f'output:\n{y}')

output:
tensor([[ 0.3708, -0.2339,  0.5004, -0.7387],
        [-1.8816, -0.2013, -0.8256, -0.7826]])


## 4. Transformers

### [`nn.TransformerEncoderLayer(d_model, nhead)`](https://pytorch.org/docs/master/nn.html?highlight=transformerencoderlayer#torch.nn.TransformerEncoderLayer)

Transformer operations are defined in "Attention is all you need (2017)." Differece between this and the Decoder layer is that the Encoder only attends to itself (Key/Value/Query is the source langage). Whereas in Decoder layer there is an attention over the memory (i.e., encoding of the input sequence) as well as the self-attention over the target sequence.

**Input**: A 3D input of the structure [sequence length, batch size, hidden features].

**Output**: Same structure as input.

**Operation**: Key, Value, and Query are all the source sentence (only self-attention). See [source code](https://pytorch.org/docs/master/_modules/torch/nn/modules/transformer.html#TransformerEncoderLayer)

```
src2 = self.self_attn(src, src, src, attn_mask=src_mask, key_padding_mask=src_key_padding_mask)[0]
src = src + self.dropout1(src2)
src = self.norm1(src)
src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
src = src + self.dropout2(src2)
src = self.norm2(src)
return src
```

**Parameters**

- **d_model** – the number of expected features in the input (required).
- **nhead** – the number of heads in the multiheadattention models (required).
- **dim_feedforward** – the dimension of the feedforward network model (default=2048).
- **dropout** – the dropout value (default=0.1).
- **activation** – the activation function of intermediate layer, relu or gelu (default=relu).

In [19]:
x = torch.rand(10, 32, 512)
print(f'input shape: {x.shape}')
encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
y = encoder_layer(x)
print(f'output shape: {y.shape}')

input shape: torch.Size([10, 32, 512])
output shape: torch.Size([10, 32, 512])


## 5. Normalization

### [`nn.LayerNorm`](https://pytorch.org/docs/stable/nn.html#layernorm)

Layer Normalization is described in the paper from 2016, [Layer Normalization](https://arxiv.org/pdf/1607.06450.pdf). It consists of subtracting the mean and dividing by the variance of the data coming into the layer. You get to choose what you want dimensions you would like to use when computing the mean and variance. Unlike Batch normalization, this layer conducts normalization during training and infernce.

In [115]:
x = torch.randn(3, 5)
# With Learnable Parameters
m = nn.LayerNorm(5, elementwise_affine=False)

In [116]:
x

tensor([[-0.9569,  0.2346, -0.1040, -1.5393, -1.0113],
        [-0.0372,  1.7077, -2.5073, -0.8248,  0.7692],
        [-0.3095,  0.7462,  0.1451,  1.7440, -0.3375]])

In [118]:
E_x = np.mean(x[0,:].detach().numpy())
E_x

-0.67539585

In [119]:
V_x = np.var(x[0,:].detach().numpy())
V_x

0.41861182

In [120]:
(x[0,0] - E_x) / np.sqrt(V_x + m.eps)

tensor(-0.4351)

In [121]:
m(x)

tensor([[-0.4351,  1.4065,  0.8831, -1.3353, -0.5192],
        [ 0.0984,  1.3131, -1.6212, -0.4499,  0.6597],
        [-0.9071,  0.4471, -0.3240,  1.7271, -0.9430]])