# Lesson 9: Introduction to Sequence Modelling

Intro - lots of things come as sequences, and text is common!

## Embeddings: Working With Tokens

Concept of embeddings

Reference https://deeplearning.neuromatch.io/tutorials/W2D5_TimeSeriesAndNaturalLanguageProcessing/student/W2D5_Tutorial1.html and maybe link Lyle's 'Embeddings Rule' video?


In [None]:
import torch
from torch import nn

## Modelling Sequences: Language Models

Explain the objective

## Tokenizing Text

How do we split up text into tokens? We often talk about 'words' being the unit of text, but if we just go with a token for each word that we might encounter you'll end up with a massive (1M) vocabulary filled with mostly obscure/misspelled words. But on the other hand letters would mean using far more tokens to represent the same sentence. 

One solution.. explain wordpiece and co

Tokenizers: https://huggingface.co/docs/tokenizers/index



In [None]:
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

In [None]:
encoding = tokenizer.encode('What a nice flooble!')
print('Encoding:', encoding)

ids = encoding.ids
print('ids:', ids)

Encoding: Encoding(num_tokens=9, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
ids: [101, 2054, 1037, 3835, 13109, 9541, 3468, 999, 102]


In [None]:
for t in ids:
    print(f'{t}:{tokenizer.decode([t])}')

101:
2054:what
1037:a
3835:nice
13109:fl
9541:##oo
3468:##ble
999:!
102:


We have special tokens for start (101), end (102), symbols like '!' (999) and separate tokens for a string like 'oo' or 'ble' if they don't occur at the start of a word. Common words get a token, uncommon ones like flooble are broken down into components. THe full vocabulary size of this tokenizer is about 30,000 tokens:

In [None]:
vocab_size = len(tokenizer.get_vocab().items())
vocab_size

30522

## An Embedding Layer

In [None]:
emb_dim = 256
emb_layer = nn.Embedding(vocab_size, emb_dim)
emb_layer

Embedding(30522, 256)

In [None]:
emb_layer(torch.tensor(ids)).shape # Passing our tokens through

torch.Size([9, 256])

## A simple MLP

Similar to Karpathy's makemore demo (TODO link)

How do we work with sequences of different lengths? Padding + truncation seem non-ideal...

In [None]:
batch_size=32
seq_len=64
batch_ids = torch.randint(vocab_size, (batch_size,seq_len))
batch_ids.shape

torch.Size([32, 64])

In [None]:
emb_layer(batch_ids).shape

torch.Size([32, 64, 256])

In [None]:
# A minimal model (output sizes shown
model = nn.Sequential(
    nn.Embedding(vocab_size, emb_dim), # (batch_size, seq_length, emb_dim)
    nn.Flatten(), # (batch_size, seq_length*emb_dim)
    nn.Linear(emb_dim*seq_len, 64), # (batch_size, 64)
    nn.ReLU(), # (batch_size, 64)
    nn.Linear(64, 2), # (batch_size, 2)
    
)
model(batch_ids).shape

torch.Size([32, 2])

Q: What happens when word position changes?
Q: Would this work on different length sequences?
Q: think of more Qs

## Recurrent Neural Networks and LSTMs

Explain the basic architecture

https://pytorch.org/docs/stable/generated/torch.nn.RNN.html

![an unrolled RNN](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png)

Maybe demo?

Brief hand-wave explanation of LSTMs and link ULMFIT and co for the curious

Great blog by colah https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Karpathy on RNN effectiveness: http://karpathy.github.io/2015/05/21/rnn-effectiveness/



In [None]:
# Create the RNN
input_size = 10 # Number of features in the input (embedding dim)
hidden_size = 20 # Number of features in the hidden state h
num_layers = 1 # Set to 2 for a 'stacked' RNN with 2 layers
rnn = nn.RNN(input_size, hidden_size, num_layers) # The model

# Run some dummy data through
# Create the model with batch_first=True if you'd like the batch dimension to come first
batch_size = 8
input_length = 5
x = torch.randn(5, batch_size, input_size)
h0 = torch.randn(num_layers, batch_size, hidden_size)
output, hn = rnn(x, h0)

# Check the output shapes
output.shape, hn.shape

(torch.Size([5, 8, 20]), torch.Size([1, 8, 20]))

In [None]:
class MyRNNClassifier(nn.Module):
    def __init__(self, input_size=10, hidden_size=20, num_layers=2):
        super().__init__()
        self.emb_layer = nn.Embedding(vocab_size, input_size)
        self.rnn = nn.RNN(input_size, hidden_size, num_layers=num_layers)
        self.mlp = nn.Linear(hidden_size, 2)
        
    def forward(self, x):
        x = self.emb_layer(x) # TO embeddings (batch_size, seq_len, input_size)
        net_output, h = self.rnn(x) # Through RNN (batch_size, seq_len, hidden_size)
        averaged_output = net_output.mean(dim=1) # Take the mean of the outputs ('mean pooling')
        result = self.mlp(averaged_output) # THrough the linear layer or MLP to get 2 outputs (assuming binary classification)
        return result
    


torch.Size([32, 2])

In [None]:
net = MyRNNClassifier()
net(batch_ids).shape

In [None]:
sum([p.numel() for p in net.parameters()])

306742

In [None]:
[p.shape for p in net.parameters()]

[torch.Size([30522, 10]),
 torch.Size([20, 10]),
 torch.Size([20, 20]),
 torch.Size([20]),
 torch.Size([20]),
 torch.Size([20, 20]),
 torch.Size([20, 20]),
 torch.Size([20]),
 torch.Size([20]),
 torch.Size([2, 20]),
 torch.Size([2])]

In [None]:
# TODO try on some data

**LSTMs**

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

something something memory

In [None]:
class MyLSTMClassifier(nn.Module):
    def __init__(self, input_size=10, hidden_size=20, num_layers=2):
        super().__init__()
        self.emb_layer = nn.Embedding(vocab_size, input_size)
        self.rnn = nn.LSTM(input_size, hidden_size, num_layers=num_layers)
        self.mlp = nn.Linear(hidden_size, 2)
        
    def forward(self, x):
        x = self.emb_layer(x) # TO embeddings (batch_size, seq_len, input_size)
        net_output, h = self.rnn(x) # Through RNN (batch_size, seq_len, hidden_size)
        averaged_output = net_output.mean(dim=1) # Take the mean of the outputs ('mean pooling')
        result = self.mlp(averaged_output) # THrough the linear layer or MLP to get 2 outputs (assuming binary classification)
        return result
    


torch.Size([32, 2])

In [None]:
net = MyLSTMClassifier()
net(batch_ids).shape

In [None]:
sum([p.numel() for p in net.parameters()])

311182

## Using Learned Representations

Do review classification or something using a learned embedding combined with an RNN? Or model tunes and then classify into type/key/mode?

Yeah tunes will be good. LM objective first, then re-training

In [None]:
# TODO try on some data (LM first then new classification head)

In [None]:
# TODO talk about efficiency of training vs sampling
# TODO demo different sampling approaches

Page stats: Total Hits: [![HitCount](https://hits.dwyl.com/johnowhitaker/tglcourse.svg?style=flat-square&show=unique)](http://hits.dwyl.com/johnowhitaker/tglcourse)
Page visitors:
![visitor badge](https://page-views.glitch.me/badge?page_id=tglcourse.l09)