# Convolutional Language Models

In this notebook we explore convolutional language models, i.e., language models that make use of convolutional layers.

## 1. Convolutions and Convolutional Layers

In order to understand convolutional language models, we first need to undrestand convolutional layers. What's a convolutional layer? A convolutional layer is nothing more than a parameterized convolution. What's a convolution? Well, convolutions can be approached from various perspectives (e.g. signal processing, probability theory etc), however most often in deep learning, convolutions provide a way to search for occurances of a particular pattern within a larger signal. They can be seen as sliding window feature detectors (as apposed to the global window features detectors of MLPs). This is implemented as follows:

- Let $k : \mathbb{R}^n \times \mathbb{R}^n \to \mathbf{R}^+$ be a kernel (i.e. a function giving some notion of *similarity* between vectors. Here, $n$ is the *window size* of the kernel.

- Let $\mathbf{s} \in \mathbb{R}^m$ be a signal.

- The convolution of $\mathbf{s}$ with $k(\cdot, \cdot)$ is given by sliding the kernel across $\mathbf{s}$, producing a new *convolved* signal. The convolved signal will have higher levels in locations where $\mathbf{s}$ had features corresponding to what the kernel was looking for.

Most often, the kernel is implemented by computing the innenr product between the current window and a weight vector $\mathbf{f}$. The weights can then be interpreted much like those of a linear layer, each corresponding to linear correlations between the features, and the kernel output. This process can be generalized to multi-dimensional and multi-channel signals. For instance, when considering images, we have signals with $C$ channels and two spatial dimensions. In this case, the kernel is typically specified using a *cube* of weights. Each slice corresponds to a channel. The product-sum still results in a scalar score however. In convolutional neural networks, we often run multiple such kernels over a given signal, producing another multi-channel 2D signal. Hence, convolutional layers can be stacked to develop rich hierarchical representations of high dimensional data.

### 1.1. Hyperparameters: Kernel Size, Padding, Stride, Dialation, Groups

Beyond the basic convolution operation described above, deep learning frameworks such as PyTorch provide a number of additional hyperparameters that can be varied for specific applications. These typically include: kernel size, padding, stride, dialation and groups. We'll discuss each of these in turn.


#### 1.1.1. Kernel Size

Kernel size is perhaps the simplest hyperparameter to understand. It simply describes the spatial dimensions of the kernel that's used to slide accross the input signal. In the diagram below, a kernel of size 3x3 is slid accross a 7x7 image (note that the channel dimension is not shown).

![](https://i.ibb.co/xLkpMs0/full-padding-no-strides-transposed-3633935968.gif)

Notice that the convolved signal lower resolution than the input signal. This comes from the fact that once the kernel reaches the end of one spatial dimension, it does not "run over" the edge. Specifically, if we have a kernel of size $(N, M)$ and we slide it over a $(K, L)$ sized signal (one unit per step), then the output size is:

$$\left(K - N + 1, L - M + 1\right).$$

To see why, consider the starting position of the kernel. At this point, there are $K - N$ horizontal spaces left to move an $L - M$ vertical spaces left to move. In this case, the total number of times we can apply the kernel is just 1 + each of those shifts. This gives the expression above.

#### 1.1.2. Padding

In many cases, we may wish to have a little more *explicit* control over the size of the output. *Padding* can help with this by allowing the kernel to "run over the edge" of the input signal. This is often used to keep the spatial dimensions identical before and after applying a convolution. Consider again the previous equation:

$$(K - N + 1, L - M + 1).$$

In order to retain the input dimensions, we need to essentially add $N - 1$ and $M - 1$ extra shifts via padding. In the diagram this has been done with a kernel of size $(3, 3)$ and a signal of size $(5, 5)$. Applying what we've just said, there should be $3 - 1 = 2$ units of padding added to the signal in order to retain its dimensions. Indeed, we see this below, with one unit added on either side.


![](https://i.ibb.co/yFGdKG5/image12-2355761825.gif)

Note that in deep learning frameworks, padding is usually assumed to to symmetric. In this case, `padding = 1` indicates adding one unit of padding to *each side* of the input signal.

Another commonly desired behaviour is for every partial or complete superimposition of the kernel to be accounted for. In this case, we need to add enough padding for so that a the kernel can be pushed until it almost "falls off the edge."

![](https://i.ibb.co/pjv9d09/Capture.png)

This is usually called *full padding*, as apposed to *half* or *same padding*, as described in the previous section. For full padding, the output should have size $(K + N - 1, L + M - 1)$. Hence, the amount of padding required can be worked out as $2N - 2$ and $2M - 2$ respectively. In the diagram above, we have $2N - 2 = 2\times 3 - 2 = 4$ split into symmetric padding of $2$ on either side.

#### 1.1.3. Stride

Another key hyperparameter is *stride*. So far we've only shown convolutions where the kernel is shifted one unit at each step. However, there's nothing stopping us from increasing or decreasing it. The diagram below shows a convolution with kernel size $(2, 2)$, no padding and a *stride* of two.

![](https://i.ibb.co/mHx6WFG/8ebb38993cb39631ee16a7ae27904381-4158148158.gif)


When using a stride greater than one, the spatial dimensions of the output are more dramatically reduced compared to the input. Specifically, an input of size $(K, L)$ with symmetric padding $P$ and a kernel of size $(N, M)$ with stride $S$ gives an output of size

$$\left(\text{floor}\left(\dfrac{K + 2P - N}{S}\right) + 1, \text{floor}\left(\dfrac{L + 2P - M}{S}\right) + 1\right).$$

So for the example above, we have $K = L = 5$ and $N = M = 3$. We also have $P = 0$ and $S = 2$. Then, the output has size

$$
    \begin{align}
\left(\text{floor}\left(\dfrac{5 + 0 - 3}{2}\right) + 1, \text{floor}\left(\dfrac{5 + 0 - 3}{2}\right) + 1\right) &= \left(\text{floor}\left(\dfrac{2}{2}\right) + 1, \text{floor}\left(\dfrac{2}{2}\right) + 1\right)\\ &= (2, 2).
    \end{align}
$$

This matches the diagram!

We mentioned that the stride can be increased *or decreased*. How can we decrease the stride below one? Well, we can actually have *fractional strides*. In fractional strides, we image padding *between* the levels of the input signal. This is best understood visually.

![](https://i.ibb.co/ncz2qdK/2aSir.gif)

This diagram above shows a 3x3 kernel with a *fractional stride* of $1/2$. Due to the virtual padding between input levels, it only moves $1/2$ of a unit in every step. Therefore it has fractional stride. Fractional strides are frequently used in *transposed convolutions*. When we unroll a convolution into a single linear transformation, the transpose of that matrix is the transposed convolution. Transposed convolutions are spatially inverted convolutions. If we have a convolution mapping inputs of size $(K, L)$ to outputs of size $(K', L')$, then its transpose does the inverse. Note however that transposed convolutions are not *functional* inverses.

#### 1.1.4. Groups

Grouped convolutions split the input feature map into $k$ chunks along the channel dimension. Then, $k$ filter groups are learned for each chunk, producing $k$ output feature maps. The output feature maps are then combined to produce a single feature map. The primary effect of grouping is increased sparsity (since each filter group can only access part of the input feature map); as well as increased model paralellism.

![](https://i.ibb.co/kQbXybf/groups.png)

## 2. Causal Convolutions

Autoregressive language models are inherently causal in nature - past observations can influence future observations, however *not* visa-vera. In MLP-based models, this can be achieved applying masks to the weights of each layer such that there's no information flow from future to past. A similar approach can be taken for CNNs using *causal convolutions*.

![](https://i.ibb.co/ydRHCvc/causal.png)

In a standard convolutional layer, information flows bidirectionaly. For instance, in figure (a), the purple units have access to information from two past *and* future positions. However, in order to enforce causality we need to ensure that units only have access to *past* positions. This change is shown in figure (b).

In practice this can be achived by *padding* the input sequence before applying the convolution.

![](https://i.ibb.co/0ZwjsTk/padding.jpg)

In this instance, padding the input with two zeros has the effect of re-aligning the input and output feature map such that $x_i'$ is influenced only by $x_{j \le i}$. The final two units are also removed as they don't correspond to any input position. Of course, this isn't yet causality preserving as $x_i$ can influence $x_i'$. To fix this, we can simply shift the input one more space to the right.

![](https://i.ibb.co/0FWyBCW/padding.png)

The specific amount of padding required is the kernel size minus one (for the input layer it's just the kernel size). Also, the notice that as we increase the depth of the stack, the receptive field (context size) increases. While the first hidden layer units can only condition on two previous positions, the second hidden layer units can condition on four. This differs from causal MLPs which generally have at least one unit able to condition on all past positions.

And with that we're done; we now have a mechanism for preserving causality in CNNs.

### 2.1. Dialated Causal Convolutions


One extension to the above scheme is to increase the receptive field of the stack through dialated convolutions. Dialation increases the receptive field without requiring additional layers.

![](https://i.ibb.co/FmyT92B/dialated.png)

In the above diagram, the convolutional layers are set to have exponentially increasing dialation from $2^0$ to $2^3$.

> Additionally the stride is set so as to ignore unconnected units (e.g., some units of the feature map in hidden layer 1 are unconnected to the input, so we ignore these in hidden layer 2 by using an appropriate stride). We do the same for the remaining layers.

When working with dialated and/or strided causal convolutions, the causal padding must be adapted accordingly.

### 2.3. Multi-dimensional Causal Convolutions

So far we've constrained discussion to 1-dimensional convolutions. These are appropriate for sequence modelling. However, modalities such as images can also be considered to have two spatial dimensions. In this case certain relationships may be easier to learn if we consider past positions along both spatial axes. This is the approch taken by models such as PixelCNN.

To preserve causality in higher dimensions, we simply extend the padding scheme to the relevant number of axes. For instance, in 2D we can pad both the left and top side of each feature map by (e.g. by `kernel_size - 1`). The following is an exerpt from the PixelCNN paper.

![](https://i.ibb.co/wcKpq7v/exerpt.png)

We can see that their approach is a direct extension of the 1D masking scheme to two dimensions. They mention that the masks can be implemented by zeroing the kernel weights. This is indeed an alternative way to implement masks rather than padding.

## 3. Implementation

### 3.1. Generating IMDb Reviews

In this section we leverage 1D causal convolutions to train an autoregressive language model on IMDb reviews.

#### 3.1.1. Download the IMDb Reviews Dataset

In [None]:
#@markdown

!pip -q install nltk normality

In [None]:
#@markdown

import nltk

nltk.download('movie_reviews')
nltk.download('punkt')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#### 3.3.2. Preprocess and Tokenize the Reviews

In [None]:
#@markdown

!pip -q install tokenizers

In [None]:
#@markdown

from normality import normalize

from nltk.corpus import movie_reviews
from tokenizers import SentencePieceBPETokenizer

from tqdm import tqdm


positive_file_ids = movie_reviews.fileids('pos')
negative_file_ids = movie_reviews.fileids('neg')

keep_punctuation = {
    '. ': 'keepperiod',
    ', ': 'keepcomma',
    '!': 'keepexclamation',
    '\'': 'keepquote',
}


def preprocess(text: str) -> str:

    for character, placeholder in keep_punctuation.items():
        text = text.replace(character, placeholder)

    text = normalize(text)

    for character, placeholder in keep_punctuation.items():
        text = text.replace(f' {placeholder} ', character)
        text = text.replace(f' {placeholder}', character)
        text = text.replace(f'{placeholder} ', character)
        text = text.replace(f'{placeholder}', character)

    return text

# For tokenization we use SentencePiece

corpus = []

print('Generating corpus...')

for file_id in tqdm(positive_file_ids):
    corpus.append(preprocess(movie_reviews.raw(file_id)))

print('Training tokenizer...')

tokenizer = SentencePieceBPETokenizer()
tokenizer.train_from_iterator(corpus, vocab_size=5000)
tokenizer.add_special_tokens(['<bos>', '<eos>', '<pad>'])
tokenizer.save('./tokenizer.json')

Generating corpus...


100%|██████████| 1000/1000 [00:02<00:00, 475.93it/s]


Training tokenizer...


#### 3.3.3. Define the Model

In [None]:
#@markdown

from dataclasses import dataclass

import torch
import torch.nn as nn
import torch.nn.functional as F


@dataclass
class CausalCNNConfiguration:
    """Causal CNN configuration."""

    vocabulary_size: int
    embedding_dimension: int
    channels: int
    kernel_size: int
    pad_id: int


class CausalCNN(nn.Module):
    """Causal CNN."""

    def __init__(self, configuration: CausalCNNConfiguration) -> None:
        """Initializes the module."""

        super(CausalCNN, self).__init__()

        self.configuration = configuration

        self.embedding = nn.Embedding(num_embeddings=configuration.vocabulary_size, embedding_dim=configuration.embedding_dimension, padding_idx=configuration.pad_id)

        self.convolution_0 = nn.Conv1d(in_channels=configuration.embedding_dimension, out_channels=configuration.channels, kernel_size=configuration.kernel_size, stride=1, padding=0)  # We want to handle padding manually.
        self.convolution_1 = nn.Conv1d(in_channels=configuration.channels, out_channels=configuration.channels, kernel_size=configuration.kernel_size, stride=1, padding=0)
        self.convolution_2 = nn.Conv1d(in_channels=configuration.channels, out_channels=configuration.channels, kernel_size=configuration.kernel_size, stride=1, padding=0)
        self.convolution_3 = nn.Conv1d(in_channels=configuration.channels, out_channels=configuration.channels, kernel_size=configuration.kernel_size, stride=1, padding=0)
        self.convolution_4 = nn.Conv1d(in_channels=configuration.channels, out_channels=configuration.channels, kernel_size=configuration.kernel_size, stride=1, padding=0)
        self.convolution_5 = nn.Conv1d(in_channels=configuration.channels, out_channels=configuration.channels, kernel_size=configuration.kernel_size, stride=1, padding=0)
        self.convolution_6 = nn.Conv1d(in_channels=configuration.channels, out_channels=configuration.vocabulary_size, kernel_size=configuration.kernel_size, stride=1, padding=0)


    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass."""

        padding = (self.configuration.kernel_size - 1, 0, 0, 0)
        input_padding = (self.configuration.kernel_size, -1)

        x = F.pad(x, input_padding, value=self.configuration.pad_id)  # Pad and shift right.
        x = self.embedding(x)
        x = x.transpose(-2, -1)  # Convert from (B, L, C) to (B, C, L)
        x = self.convolution_0(x)

        x = F.relu(self.convolution_1(F.pad(x, padding)))
        x = F.relu(self.convolution_2(F.pad(x, padding)))
        x = F.relu(self.convolution_3(F.pad(x, padding)))
        x = F.relu(self.convolution_4(F.pad(x, padding)))
        x = F.relu(self.convolution_5(F.pad(x, padding)))
        x = self.convolution_6(F.pad(x, padding))

        logits = F.log_softmax(x.transpose(-2, -1), dim=-1)  # Convert from (B, C, L) to (B, L, C)
                                      # Can also be done with x.permute(0, 2, 1).contiguous()

        return logits

In [None]:
#@markdown

configuration = CausalCNNConfiguration(
    vocabulary_size = tokenizer.get_vocab_size(),
    embedding_dimension = 256,
    channels=128,
    kernel_size=3,
    pad_id=tokenizer.token_to_id('<pad>'),
)

model = CausalCNN(configuration).cuda()

#### 3.3.4. Train the Model

In [None]:
#@markdown

from torch.optim import Adam

optimizer = Adam(model.parameters(), lr=1e-3)
criterion = nn.NLLLoss()

In [None]:
#@markdown

epochs = 20
sequences = len(corpus)

for epoch in range(epochs):
    losses = []

    for sequence_index, sequence in enumerate(corpus):

        tokens = tokenizer.encode(f'<bos>{sequence}<eos>')
        tokens = torch.tensor(tokens.ids)
        tokens = tokens.cuda()

        sequence_length = tokens.size(0)

        optimizer.zero_grad()
        logits = model(tokens.view(1, -1)).view(sequence_length, -1)
        loss = criterion(logits, tokens)
        loss.backward()
        optimizer.step()

        losses.append(loss.item())

        if (sequence_index % 10) == 0:
            mean_loss = sum(losses) / len(losses)
            losses.clear()

            print(f'Epoch {epoch}/{epochs}, Sequence {sequence_index}/{sequences} - loss: {loss}')

Epoch 0/20, Sequence 0/1000 - loss: 8.51842975616455
Epoch 0/20, Sequence 10/1000 - loss: 7.877602577209473
Epoch 0/20, Sequence 20/1000 - loss: 7.4164137840271
Epoch 0/20, Sequence 30/1000 - loss: 7.476500988006592
Epoch 0/20, Sequence 40/1000 - loss: 7.0692057609558105
Epoch 0/20, Sequence 50/1000 - loss: 7.266458511352539
Epoch 0/20, Sequence 60/1000 - loss: 7.045554161071777
Epoch 0/20, Sequence 70/1000 - loss: 7.349377155303955
Epoch 0/20, Sequence 80/1000 - loss: 7.187324047088623
Epoch 0/20, Sequence 90/1000 - loss: 7.198117733001709
Epoch 0/20, Sequence 100/1000 - loss: 7.118460178375244
Epoch 0/20, Sequence 110/1000 - loss: 7.125805854797363
Epoch 0/20, Sequence 120/1000 - loss: 7.254831790924072
Epoch 0/20, Sequence 130/1000 - loss: 7.211723327636719
Epoch 0/20, Sequence 140/1000 - loss: 7.159432411193848
Epoch 0/20, Sequence 150/1000 - loss: 7.230100154876709
Epoch 0/20, Sequence 160/1000 - loss: 6.9448442459106445
Epoch 0/20, Sequence 170/1000 - loss: 7.242609977722168
Epoc

KeyboardInterrupt: ignored

#### 3.3.5. Sample from the Model

In [None]:
#@markdown

def sample_top_k(prompt: str, sequence_length: int, k: int = 5) -> str:
    """Samples from the model using top-k sampling."""

    prompt_tokens = torch.tensor(tokenizer.encode(f'<bos>{prompt}').ids).cuda()
    prompt_length = len(prompt_tokens)

    sequence = torch.zeros(sequence_length).to(int).cuda()
    sequence[: prompt_length] = prompt_tokens  # Insert the prompt.

    # Sample the completion.

    for token_position in range(prompt_length, sequence_length):
        logits = model(sequence.view(1, -1)).detach()
        logits = logits[0][token_position]

        top_k_logits = torch.topk(logits, k=k)
        top_k_distribution = F.softmax(top_k_logits.values, dim=-1)

        index = torch.multinomial(top_k_distribution, num_samples=1, replacement=True)
        token = top_k_logits.indices[index]

        sequence[token_position] = token

    return sequence


In [None]:
tokenizer.decode(sample_top_k('i think', 100, k=5).tolist())

'i think the film would be fine of the movie are the film is so really a good sense of pace.  and was a lot to be a lot of the fact that he is nowhere for the film is not as much as it is so really it would be anything to do that i was surprised that i was surprised that the film would not forget the film to get the way to see a film that will have been really forget the whole time and it is a fine job of the'