# 3. ELMo Implementation

### About this notebook

This notebook was used in the 50.039 Deep Learning course at the Singapore University of Technology and Design.

**Author:** Matthieu DE MARI (matthieu_demari@sutd.edu.sg)

**Version:** 1.1 (29/08/2023)

This notebook discusses a possible implementation of the ELMo language model architecture, which is based on the paper from Peters et al., "Deep contextualized word representations", 2018.

https://arxiv.org/abs/1802.05365

This code is not designed for running (did not even bother coming with imports for this), and training code (dataset, loss, trainer functions, etc.) is also not provided.

This is just for illustrative purposes, as we do not expect students to train such heavy language models (costly in time and resources). The purpose of this notebook is also to show how we may design multiple classes of nn.Module and combine them together in a very sophisticated architecture.

**Requirements:**
- Python 3 (tested on v3.11.4)
- Matplotlib (tested on v3.7.2)
- Numpy (tested on v1.25.2)
- Torch (tested on v2.0.1+cu118)
- We also strongly recommend setting up CUDA on your machine!

### Imports

In [1]:
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import functools
import matplotlib.pyplot as plt
CUDA = torch.cuda.is_available()
device = torch.device("cuda" if CUDA else "cpu")

### Character-level 1D-CNNs

In this implementation of the Character-level convolutions, we have a succession of 1D-Conv layers (using Conv2d, but look at the kernel size), interleaved with some Max-Pooling. Eventually, we obtain a 128-length vector for representing words.

In [2]:
class CharConv(nn.Module):
    
    def __init__(self):
        super().__init__()
        
        # Embedding layer to start with
        self.char_embedding = nn.Embedding(CHAR_VOCAB_SIZE, CHAR_EMBED_DIM)
        
        # Some convolution layers
        self.conv1 = nn.Conv2d(CHAR_EMBED_DIM, 2, 1)
        self.conv2 = nn.Conv2d(CHAR_EMBED_DIM, 2, (1, 2))
        self.conv3 = nn.Conv2d(CHAR_EMBED_DIM, 4, (1, 3))
        self.conv4 = nn.Conv2d(CHAR_EMBED_DIM, 8, (1, 4))
        self.conv5 = nn.Conv2d(CHAR_EMBED_DIM, 16, (1, 5))
        self.conv6 = nn.Conv2d(CHAR_EMBED_DIM, 32, (1, 6))
        self.conv7 = nn.Conv2d(CHAR_EMBED_DIM, 64, (1, 7))
        self.convs = [self.conv1, self.conv2, self.conv3, self.conv4,
                      self.conv5, self.conv6, self.conv7,]
        
    
    def forward(self, x):
        # Character-level convolution
        # Starts with embeddings and some reshaping
        x = self.char_embedding(x).permute(0,3,1,2)
        # Go through all convolution layers
        x = [conv(x) for conv in self.convs]
        # Max Pooling
        x = [F.max_pool2d(x_c, kernel_size = (1, x_c.shape[3])) for x_c in x]
        # Concatenate/Squeeze into final vector
        # Final vector will be of size (1, n_batch, concat_length)
        x = [torch.squeeze(x_p, dim = 3) for x_p in x]
        x = torch.hstack(x) 
        return x

### Bi-Directional LSTMs

A possible implementation of a Bi-LSTM is given below. Note that we show also how to repeat the forward and backward LSTMs multiple times if needed.

In [3]:
class BiLSTM(nn.Module):
    
    def __init__(self):
        
        super().__init__()
        # To build a bi-directional LSTM, we will need a few LSTM layers
        self.lstm_f1 = nn.LSTM(128, 128)
        self.lstm_r1 = nn.LSTM(128, 128)
        self.dropout = nn.Dropout(0.1)
        self.proj = nn.Linear(128, 64, bias = False)
        self.lstm_f2 = nn.LSTM(64, 128)
        self.lstm_r2 = nn.LSTM(64, 128)
    
    def forward(self, x):
        # Note: we expect word embeddings of size 128 (as the result of the
        # previous character-level CNN network!)
        # input shape is then: (seq_len, batch_size, 128)
        
        # 1st LSTM layer - Forward feed LSTM + Dropout
        x_f = x
        o_f1, (h_f1, __) = self.lstm_f1(x_f)
        o_f1 = self.dropout(o_f1)
        
        # 2nd LSTM layer - Backward feed LSTM + Dropout
        x_r = x.flip(dims=[0])
        o_r1, (h_r1, __) = self.lstm_r1(x_r)
        o_r1 = self.dropout(o_r1)
        h1 = torch.stack((h_f1, h_r1)).squeeze(dim = 1)
        
        # Assemble
        x2_f = self.proj(o_f1 + x_f)
        x2_r = self.proj(o_r1 + x_r)
        
        # If we want, we can repeat the bi-directional LSTM
        # a second time (or more, if needed), as such.
        _, (h_f2, __) = self.lstm_f2(x2_f)
        _, (h_r2, __) = self.lstm_r2(x2_r)
        h2 = torch.stack((h_f2, h_r2)).squeeze(dim = 1)
        
        # Return both
        return h1, h2

### Full model assembly

Assembling previous blocks together, along with some highway layers in the middle for a smoother transition.

We will not train this model, as it would require a massive dataset and heavy computation power.

This notebook only serves to show what the implementation of said model could look like.

In [4]:
class BiLangModel(nn.Module):

    def __init__(self, char_cnn, bi_lstm):
        
        super(BiLangModel, self).__init__()
        # Blocks to be used for the highway connection
        self.highway = nn.Linear(128, 128)
        self.transform = nn.Linear(128, 128)
        
        # Character level CNN model
        self.char_cnn = char_cnn
        
        # Bi-LSTM model
        self.bi_lstm = bi_lstm
        
    def forward(self, x):
        
        # 1. Character-level convolution
        x = self.char_cnn(x).permute(2, 0, 1)
        
        # 2 Some Highway layers
        h = self.highway(x)
        t_gate = torch.sigmoid(self.transform(x))
        c_gate = 1 - t_gate
        x_ = h * t_gate + x * c_gate
        
        # 3. Bi-LSTM
        x1, x2 = self.bi_lstm(x_)
        
        # Feel free to play around and have a look
        # at the x, x1 and x2 vectors!
        return x, x1, x2