# Subword Tokenization

The library [transformers](https://github.com/huggingface/transformers) from HuggingFace implements several types of subword tokenization mainly depending on the associated model. Here we will not go too deep in the library, but just look at the effect of different pretrained tokenizer and our data.

In [13]:
from transformers import pipeline

device = -1

Let's start with the GPT2 model (which we'll see in the second module). GPT2 uses [byte level BPE](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf) tokenization.

In [14]:
text = "At first, historical linguistics served as the cornerstone of comparative linguistics primarily as a tool for linguistic reconstruction.[5] Scholars were concerned chiefly with establishing language families and reconstructing prehistoric proto-languages, using the comparative method and internal reconstruction."
text

'At first, historical linguistics served as the cornerstone of comparative linguistics primarily as a tool for linguistic reconstruction.[5] Scholars were concerned chiefly with establishing language families and reconstructing prehistoric proto-languages, using the comparative method and internal reconstruction.'

In [15]:
embedder = pipeline("feature-extraction",
                    model="gpt2",
                    device=device)
inputs = embedder.tokenizer(
                [text.lower()],
                return_tensors="pt",
            )
" ".join([embedder.tokenizer.convert_ids_to_tokens(token) for token in inputs["input_ids"]][0])

'at Ġfirst , Ġhistorical Ġlingu istics Ġserved Ġas Ġthe Ġcornerstone Ġof Ġcomparative Ġlingu istics Ġprimarily Ġas Ġa Ġtool Ġfor Ġlinguistic Ġreconstruction .[ 5 ] Ġscholars Ġwere Ġconcerned Ġchiefly Ġwith Ġestablishing Ġlanguage Ġfamilies Ġand Ġreconstruct ing Ġprehistoric Ġproto - l anguages , Ġusing Ġthe Ġcomparative Ġmethod Ġand Ġinternal Ġreconstruction .'

In [16]:
embedder.tokenizer.vocab_size

50257

Let's try with another subword tokenization technique. ALBERT uses SentencePiece with [Unigram Language Model tokenization](https://arxiv.org/abs/1804.10959). Let's compare both results.

In [17]:
embedder = pipeline("feature-extraction",
                    model="albert-base-v2",
                    device=device)
inputs = embedder.tokenizer(
                [text.lower()],
                return_tensors="pt",
            )
" ".join([embedder.tokenizer.convert_ids_to_tokens(token) for token in inputs["input_ids"]][0])

Some weights of the model checkpoint at albert-base-v2 were not used when initializing AlbertModel: ['predictions.decoder.weight', 'predictions.bias', 'predictions.dense.bias', 'predictions.LayerNorm.weight', 'predictions.decoder.bias', 'predictions.LayerNorm.bias', 'predictions.dense.weight']
- This IS expected if you are initializing AlbertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


'[CLS] ▁at ▁first , ▁historical ▁linguistics ▁served ▁as ▁the ▁cornerstone ▁of ▁comparative ▁linguistics ▁primarily ▁as ▁a ▁tool ▁for ▁linguistic ▁reconstruction . [ 5 ] ▁scholars ▁were ▁concerned ▁chiefly ▁with ▁establishing ▁language ▁families ▁and ▁reconstruct ing ▁prehistoric ▁proto - language s , ▁using ▁the ▁comparative ▁method ▁and ▁internal ▁reconstruction . [SEP]'

In [18]:
embedder.tokenizer.vocab_size

30000

## Mulitlingual tokenization

So far the tokenization models we saw were built for English only. If we try a multilingual model, the vocabulary is shared across languages. The resulting model will have much smaller words within its vocabulary.

In [19]:
embedder = pipeline("feature-extraction",
                    model="xlm-roberta-base",
                    device=device)

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaModel: ['lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
inputs = embedder.tokenizer(
                [text.lower()],
                return_tensors="pt",
            )
" ".join([embedder.tokenizer.convert_ids_to_tokens(token) for token in inputs["input_ids"]][0])

'<s> ▁at ▁first , ▁historical ▁linguis tics ▁served ▁as ▁the ▁corner stone ▁of ▁comparativ e ▁linguis tics ▁primari ly ▁as ▁a ▁tool ▁for ▁linguis tic ▁re construction . [5] ▁scholar s ▁were ▁concerned ▁chief ly ▁with ▁establish ing ▁language ▁families ▁and ▁reconstru c ting ▁pre histori c ▁proto - language s , ▁using ▁the ▁comparativ e ▁method ▁and ▁internal ▁re construction . </s>'

Since this model covers a lot more languages (around [100](https://aclanthology.org/2020.acl-main.747.pdf)) it needs a much larger vocabulary.

In [20]:
embedder.tokenizer.vocab_size

250002

Since the vocabulary for English much smaller, we see a lot more cut words.

We can also apply it on Japanese.

In [7]:
text = "チンドン屋は、チンドン太鼓と呼ばれる楽器を鳴らすなどして人目を集め、その地域の商品や店舗などの宣伝を行う日本の請負広告業である。披露目屋・広目屋・東西屋と呼ぶ地域もある。"

inputs = embedder.tokenizer(
                [text.lower()],
                return_tensors="pt",
            )
" -- ".join([embedder.tokenizer.convert_ids_to_tokens(token) for token in inputs["input_ids"]][0])

'<s> -- ▁ -- チン -- ド -- ン -- 屋 -- は -- 、 -- チン -- ド -- ン -- 太 -- 鼓 -- と呼ばれる -- 楽 -- 器 -- を -- 鳴 -- ら -- す -- など -- して -- 人 -- 目 -- を集め -- 、 -- その -- 地域の -- 商品 -- や -- 店舗 -- などの -- 宣 -- 伝 -- を行う -- 日本の -- 請 -- 負 -- 広告 -- 業 -- である -- 。 -- 披露 -- 目 -- 屋 -- ・ -- 広 -- 目 -- 屋 -- ・ -- 東西 -- 屋 -- と -- 呼 -- ぶ -- 地域 -- もある -- 。 -- </s>'