# 03. SEQ2SEQ

1. Seq2seq models
2. Tokenization: text
3. Tokenization: code
4. Exercise
5. References

# 1. Seq2seq models

Paper: [Sutskever et al - Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215)

- English to French translation
- two recurrent neural networks work together to transform one sequence to another
- an encoder network condenses an input sequence into a vector
- a decoder network eats that vector to produce an output sequence

![](https://pytorch.org/tutorials/_images/seq2seq.png)

Unlike sequence prediction with a single RNN, where every input corresponds to an output, the seq2seq model frees us from sequence length and order, which makes it ideal for translation between two languages.

Consider the sentence *"Je ne suis pas le chat noir"* --- *"I am not the black cat"*. Most of the words in the input sentence have a direct translation in the output sentence, but are in slightly *different* orders, e.g. chat noir and black cat. Because of the *ne/pas* construction there is also one more word in the input sentence. It would be difficult to produce a correct translation directly from the sequence of input words.

With a seq2seq model the encoder creates a single vector which, in the ideal case, encodes the "meaning" of the input sequence into a single vector --- a single point in some N dimensional space of sentences.

More details: [Robertson - NLP From Scratch: Translation with a Sequence to Sequence Network and Attention](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)

#### Encoder

- takes input
- capture essential information
- creates hidden states

#### Decoder

- takes hidden states
- generates the output sequence
- operates in an autoregressive manner, producing one element of the output sequence at a time. At each step, it considers the previously generated elements, hidden states, and the input to make predictions for the next element in the output sequence

#### Transformers

![Alammar - The Illustrated Transformer](http://jalammar.github.io/images/t/transformer_resideual_layer_norm_3.png)


More:
- [Alammar - The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Voita - Sequence to Sequence (seq2seq) and Attention](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html)
- [pytorch.org - Language Modeling with nn.Transformer and torchtext](https://pytorch.org/tutorials/beginner/transformer_tutorial.html#language-modeling-with-nn-transformer-and-torchtext)

# 2. Tokenization: text

- Word-level tokenization
- Char-level tokenization
- Subword tokenization
    - [Byte pair encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) (BPE)
    - [Sentencepiece](https://github.com/google/sentencepiece)

Tokenization $\rightarrow$ Vocabulary $\rightarrow$ Embeddings

![](./res/03_vocabulary.png)

#### Word-level tokenization

In [None]:
text = 'Convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors'
words = text.split(' ')
tokens = {v: k for k, v in enumerate(words)}

print(tokens)

Drawbacks?

- huge vocabulary
- Out Of Vocabulary (OOV): new words which are encountered at testing (UNK)

#### Char-level tokenization

In [None]:
tokens = {v: k for k, v in enumerate(text)}

print(tokens)

- simple
- solve the OOV problem
- makes it much harder for the model to learn meaningful input representations: "a" vs "apple" --- loss of performance

#### Subword tokenization

![](https://www.oreilly.com/api/v2/epubs/9781492062561/files/assets/anlp_0401.png)

- Increase the amount of information per token.
- Decrease the total number of tokens (vocabulary size).

Subword tokenizations:

- Byte pair encoding (BPE). See R. Sennrich et al., “Neural Machine Translation of Rare Words with Subword Units,” arXiv, 2015, https://arxiv.org/abs/1508.07909
- WordPiece. See M. Schuster and K. Nakajima, “Japanese and Korean Voice Search,” International Conference on Acoustics, Speech and Signal Processing, IEEE (2012), https://research.google/pubs/japanese-and-korean-voice-search/
- SentencePiece. See T. Kudo and J. Richardson, “SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing,” arXiv, 2018, https://arxiv.org/abs/1808.06226

Reference:
- https://www.oreilly.com/library/view/applied-natural-language/9781492062561/ch04.html

Visualization of tokenization: [tiktokenizer.vercel.app](https://tiktokenizer.vercel.app/)

GPT-4o, English
![](res/03_to_be_or_not_to_be_en.png)

GPT-4o, Russian
![](res/03_to_be_or_not_to_be_ru.png)

GPT-3.5, Russian
![](res/03_to_be_or_not_to_be_ru_3.5.png)

#### BPE --- Byte pair encoding

#### [Original BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding)

Suppose the data to be encoded is

> aaabdaaabac

The byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, such as "Z".
Now there is the following data and replacement table:

> ZabdZabac
> 
> Z=aa

Then the process is repeated with byte pair "ab", replacing it with "Y":

> ZYdZYac
> 
> Y=ab
> 
> Z=aa

The only literal byte pair left occurs only once, and the encoding might stop here. Alternatively, the process could continue with recursive byte pair encoding, replacing "ZY" with "X":

> XdXac
> 
> X=ZY
> 
> Y=ab
> 
> Z=aa

This data cannot be compressed further by byte pair encoding because there are no pairs of bytes that occur more than once.

To decompress the data, simply perform the replacements in the reverse order. 


#### Implementation

Source: https://gist.github.com/bigsnarfdude/8e99709d5c3d9d58b3831221fcbdaf68

In [None]:
text = "Ｕｎｉｃｏｄｅ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺‌🇳‌🇮‌🇨‌🇴‌🇩‌🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode’s inception."
tokens = text.encode('utf-8') # raw bytes
tokens = list(map(int, tokens)) # convert to a list of integers in range 0..255 for convenience
print('---')
print(text)
print('length:', len(text))
print('---')
print(tokens)
print('length:', len(tokens))

In [None]:
def get_stats(ids):
    counts = {}
    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts

def merge(ids, pair, idx):
  newids = []
  i = 0
  while i < len(ids):
    if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
      newids.append(idx)
      i += 2
    else:
      newids.append(ids[i])
      i += 1
  return newids

# ---
vocab_size = 276 # the desired final vocabulary size
num_merges = vocab_size - 256
ids = list(tokens) # copy so we don't destroy the original list

merges = {} # (int, int) -> int
for i in range(num_merges):
  stats = get_stats(ids)
  pair = max(stats, key=stats.get)
  idx = 256 + i
  print(f"merging {pair} into a new token {idx}")
  ids = merge(ids, pair, idx)
  merges[pair] = idx

In [None]:
print("tokens length:", len(tokens))
print("ids length:", len(ids))
print(f"compression ratio: {len(tokens) / len(ids):.2f}X")

#### Forced splits (pre-tokenization)

In [None]:
import regex as re
gpt2pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

print(re.findall(gpt2pat, "Hello've world123 how's are you!!!?"))

In [None]:
example = """
for i in range(1, 101):
    if i % 3 == 0 and i % 5 == 0:
        print("FizzBuzz")
    elif i % 3 == 0:
        print("Fizz")
    elif i % 5 == 0:
        print("Buzz")
    else:
        print(i)
"""
print(re.findall(gpt2pat, example))

In [None]:
import tiktoken

# GPT-2 (does not merge spaces)
enc = tiktoken.get_encoding("gpt2")
print(enc.encode("    hello world!!!"))

# GPT-4 (merges spaces)
enc = tiktoken.get_encoding("cl100k_base")
print(enc.encode("    hello world!!!"))

GPT-2 tokenizer
![](./res/03_gpt2.png)

GPT-4 tokenizer
![](./res/03_cl100kbase.png)

#### Special symbols

Special tokens are additional tokens added during the tokenization process to serve specific purposes in natural language processing tasks.

#### Sentencepiece

Commonly used because (unlike tiktoken) it can efficiently both train and inference BPE tokenizers. It is used in both Llama and Mistral series.

[sentencepiece on Github link](https://github.com/google/sentencepiece).

**The big difference**: sentencepiece runs BPE on the Unicode code points directly! It then has an option `character_coverage` for what to do with very very rare codepoints that appear very few times, and it either maps them onto an UNK token, or if `byte_fallback` is turned on, it encodes them with utf-8 and then encodes the raw bytes instead.

TLDR:

- tiktoken encodes to utf-8 and then BPEs bytes
- sentencepiece BPEs the [code points](https://en.wikipedia.org/wiki/Code_point) and optionally falls back to utf-8 bytes for rare code points (rarity is determined by character_coverage hyperparameter), which then get translated to byte tokens.


In [None]:
# Source: [Andrej Karpathy - Tokenization :(](https://youtu.be/zduSFxRajkE?si=f40nDHXAPYIKWGlS)

import sentencepiece as spm

In [None]:
# write a toy.txt file with some random text
with open('toy.txt', 'w', encoding='utf-8') as f:
  f.write('SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.')

In [None]:
# train a sentencepiece model on it
# the settings here are (best effort) those used for training Llama 2
import os

options = dict(
  # input spec
  input='toy.txt',
  input_format='text',
  # output spec
  model_prefix='tok400', # output filename prefix
  # algorithm spec
  # BPE alg
  model_type='bpe',
  vocab_size=400,
  # normalization
  normalization_rule_name='identity', # turn off normalization
  remove_extra_whitespaces=False,
  input_sentence_size=200000000, # max number of training sentences
  max_sentence_length=4192, # max number of bytes per sentence
  seed_sentencepiece_size=1000000,
  shuffle_input_sentence=True,
  # rare word treatment
  character_coverage=0.99995,
  byte_fallback=True,
  # merge rules
  split_digits=True,
  split_by_unicode_script=True,
  split_by_whitespace=True,
  split_by_number=True,
  max_sentencepiece_length=16,
  add_dummy_prefix=True,
  allow_whitespace_only_pieces=True,
  # special tokens
  unk_id=0, # the UNK token MUST exist
  bos_id=1, # the others are optional, set to -1 to turn off
  eos_id=2,
  pad_id=-1,
  # systems
  num_threads=os.cpu_count(), # use ~all system resources
)

spm.SentencePieceTrainer.train(**options)

In [None]:
sp = spm.SentencePieceProcessor()
sp.load('tok400.model')
vocab = [[sp.id_to_piece(idx), idx] for idx in range(sp.get_piece_size())]
vocab

#### Integers in GPT2

![](https://www.beren.io/assets/figures/number_tokenization_weirdness_gpt2.png)

- each row here represents 100 integers
- if a square is colored yellow it means a unique token is assigned to that integer
- if it is blue then the integer is coded by a composite set of tokens

![](https://www.beren.io/assets/figures/gpt2_number_composition.png)

#### Huggingface

- [Tokenizers.Quicktour](https://huggingface.co/docs/tokenizers/en/quicktour)
- [Building a tokenizer, block by block](https://huggingface.co/learn/nlp-course/en/chapter6/8)
- [Summary of the tokenizers](https://huggingface.co/docs/transformers/en/tokenizer_summary)

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('google-bert/bert-base-cased')

encoded_input = tokenizer('Do not meddle in the affairs of wizards, for they are subtle and quick to anger.')
print(encoded_input)

The tokenizer returns a dictionary with three important items:
- input_ids are the indices corresponding to each token in the sentence.
- attention_mask indicates whether a token should be attended to or not.
- token_type_ids identifies which sequence a token belongs to when there is more than one sequence.

In [None]:
tokenizer.decode(encoded_input['input_ids'])

#### Llama

- https://arxiv.org/abs/2302.13971
- BPE algorithm (Sennrich et al., 2015), using the implementation from SentencePiece (Kudo and Richardson, 2018)
- split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters.

#### Llama 2

- https://arxiv.org/abs/2307.09288
- the same tokenizer as Llama 1

# 3. Tokenization: code

- https://pypi.org/project/code-tokenize/
- [CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code](https://arxiv.org/abs/2308.00683)


#### InCoder

- https://arxiv.org/abs/2204.05999
- a byte-level BPE tokenizer Sennrich et al. (2016); Radford et al. (2019)
- allow tokens to extend across whitespace (excluding newline characters) so that common code idioms (e.g., import numpy as np) are represented as single tokens in the vocabulary.

This substantially improves the tokenizer’s efficiency reducing the total number of tokens required to encode our training corpus by 45% relative to the byte-level BPE tokenizer and vocabulary of GPT-2

#### VulBERTa

- https://arxiv.org/abs/2205.12424
- remove comments from the source code of each function using several regular expressions
- parse the source code
- the tokens produced by parser are further processed by the BPE algorithm, modified to take
into account the pre-defined tokens
- vocabulary size  is 50000
- pre-defined 451 tokens are tokens that explicitly included in the vocabulary

![](./res/03_vulberta_pre-defined_tokens.png)

#### Code Llama

- https://arxiv.org/abs/2308.12950
- data is tokenized via BPE (Sennrich et al. (2016)), employing the same tokenizer as Llama and Llama 2

#### Star Coder

- https://arxiv.org/abs/2305.06161
- Hugging Face Tokenizers library (MOI et al., 2022) to train a byte-level Byte-Pair-Encoding with a vocabulary size of 49,152 tokens, including the sentinel tokens
- the pre-tokenization step includes a digit-splitter and the regex splitter from the GPT-2 pre-tokenizer

#### DeepSeek Coder

- https://arxiv.org/abs/2401.14196
- HuggingFace Tokenizer library to train BPE tokenizers
- vocabulary size is 32,000

#### Star Coder 2

- https://arxiv.org/abs/2402.19173
- follow the procedure of StarCoderBase and train a byte-level Byte-Pair-Encoding tokenizer
- vocabulary size is 49,152 tokens, including the sentinel tokens
- The pre-tokenization step includes a digit-splitter and the regex splitter from the GPT-2 pre-tokenizer

#### Problems caused by tokenization:

- problems related to individual symbols (count the number of symbols, write in reverse)
- arithmetic problems

# 4. Exercise

Let's assume that we have a snippet of Python code (3.11) as a string as input.
You need to:
1. define the rules for splitting the string into substrings for tokenization
2. implement these rules in Python
3. write a program that tokenizes (Sentencepiece or BPE according to these rules)

For example, indents and keywords should become separate tokens.

# 5. References

- [Andrej Karpathy - Tokenization :(](https://youtu.be/zduSFxRajkE?si=f40nDHXAPYIKWGlS)
- [Andrej Karpathy - Tokenization :( Colab](https://colab.research.google.com/drive/1y0KnCFZvGVf_odSfcNAws6kcDD7HsI0L)
- [Andrej Karpathy - minbpe](https://github.com/karpathy/minbpe)
- [HF playground](https://huggingface.co/spaces/Xenova/the-tokenizer-playground)
- [Exploring BERT's Vocabulary](https://juditacs.github.io/2019/02/19/bert-tokenization-stats.html)
- [A Programmer's Introduction to Unicode](https://www.reedbeta.com/blog/programmers-intro-to-unicode/)
- [Huggingface: Byte-Pair Encoding tokenization](https://huggingface.co/learn/nlp-course/en/chapter6/5)
- [Patel Arasanipalai - Applied Natural Language Processing in the Enterprise](https://www.oreilly.com/library/view/applied-natural-language/9781492062561/ch04.html)
- [Willison - Understanding GPT tokenizers](https://simonwillison.net/2023/Jun/8/gpt-tokenizers/)
- [Millidge - Integer tokenization is insane](https://www.beren.io/2023-02-04-Integer-tokenization-is-insane/)
- [gpt-tokenizer](https://github.com/niieani/gpt-tokenizer)
- [Free GPT Tokenizer](https://koala.sh/tools/free-gpt-tokenizer)
- [Grudzień - GPT Tokens Explained: what they are and how they work](https://www.quickchat.ai/post/tokens-entropy-question)
- [Rumbelow - SolidGoldMagikarp (plus, prompt generation)](https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation)
- [MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers](https://arxiv.org/abs/2305.07185)