# Tokenizer test

This notebook serves to test the behaviour of a tokenizer trained in english in portuguese text. 

In [1]:
from pathlib import Path

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

from src.utils import compute_perplexity, train_tokenizer


DATA_DIR = Path("data")
RESOURCES_DIR = Path("resources")
MODEL = "microsoft/phi-1_5"

  from .autonotebook import tqdm as notebook_tqdm


In [64]:
llm = AutoModelForCausalLM.from_pretrained(MODEL)
tokenizer = AutoTokenizer.from_pretrained(MODEL)

In [3]:
en_tkns = tokenizer.tokenize("Hello, I'm a single sentence!")
en_tkns

['Hello', ',', 'ĠI', "'m", 'Ġa', 'Ġsingle', 'Ġsentence', '!']

Spaces are converted in a special character (the Ġ ) in the tokenizer prior to BPE splitting mostly to avoid digesting spaces since the standard BPE algorithm used spaces in its process. [link](https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475/2?u=joaogante)

In [4]:
pt_tkns = tokenizer.tokenize("Olá, eu sou uma frase simples!")
pt_tkns

['Ol',
 'Ã¡',
 ',',
 'Ġe',
 'u',
 'Ġsou',
 'Ġu',
 'ma',
 'Ġfr',
 'ase',
 'Ġsim',
 'ples',
 '!']

Note that the tokenizer splited the word `"eu"` into `"Ġe"` and `"u"` which is strange, since the `"eu"` is a very common word in Portuguese. Also note that the word `"sentence"` is keepet as a unique token while its equivilant in portugueses `"frase"` is splited into two tokens `"Ġfr"` and `"ase"`.

In [5]:
print(f"Number of tokens in English: {len(en_tkns)}")
print(f"Number of tokens in Portuguese: {len(pt_tkns)}")

Number of tokens in English: 8
Number of tokens in Portuguese: 13


As a last remark, note that the number of tokens produced for the portuguese sentence is almost double the aomount of tokens produced for english. This is problem in the efeciency of the system as it requires much more compute to produce the text in portuguese than the text in english.

Is there any way to limit this phenomne?

## Compute Preplexity

In [6]:
ppl_en = compute_perplexity(llm, tokenizer, "Hello, I'm a single sentence!")
ppl_pt = compute_perplexity(llm, tokenizer, "Olá, eu sou uma frase simples!")
print(f"Perplexity of English: {ppl_en}")
print(f"Perplexity of Portuguese: {ppl_pt}")

Perplexity of English: 28.975448608398438
Perplexity of Portuguese: 160.80172729492188


The portuguese sentence has a lower preplexity than the english sentence meaning that the sequence of words in the portuguesese sentence is less surprising than sequence of words in the english sentence. This is expected as the preplexity mesuare is used to evaluate how well the language model learned the training set. Since the phi model was only trained on english text it is normal that the portuguese text to have a much higher preplexity. The question is: can we further maintain or lower this value of preplexity for the portuguese text while lowering the amount of tokens generated?

As a first approach let's test the following approach. We will start by selecting a portuguese corpus (lusa news probably). Second we will compute the preplexity of the phi-2 model on that corpus. This will give us a baseline to take as a reference. As a third step, we will train a tokenizer on the portuguse corpus. Then, we will check the tokens that are on the new vocabolary that were missing in the original one. The following step is to access what is the best way to cerate the embeddings for this new tokens to the orignal tokenizer so that the preplexity of the model gets lower on the portuguese corpus.

The stratagy to create the new embeddings migth be by employing an aggregation strategy or by training the model. 

## Train the tokenizer in Portuguese text

### Read data

In [7]:
corpus = (DATA_DIR / "sample.txt").read_text()
print(corpus[:1000])

                                        Câmara dos Senhores Deputados da Nação Portugueza 1822-1910
                                Câmara dos Senhores Deputados da Nação Portugueza 1822-1910
                        O texto apresentado é obtido de forma automática, não levando em conta elementos gráficos e podendo conter erros. Se encontrar algum erro, por favor informe os serviços através da página de contactos.
II Série — Número 104Quarta-feira, 26 de Junho de 1985DIÁRIOda Assembleia da RepúblicaIII LEGISLATURA2.a SESSÃO LEGISLATIVA (1984-1985)SUMÁRIOPropostas da lei:N.° 107/III [autoriza o Governo, através do Ministério das Finanças e do Plano, a contrair junto do Banco Internacional para a Reconstrução e Desenvolvimento (BIRD) um empréstimo externo até ao montante glo-bal equivalente a 66 milhões dc dólares dos Estados Unidos da América]:Relatório e parecer da Comissão de Economia, Finanças e Plano sobre a proposta de lei.N.° 108/111 (cooperação fi

In [8]:
lines = corpus.split("\n")
print(f"Number of lines: {len(lines)}")

lines = list(set(lines))
lines = [line.strip() for line in lines if line.strip()]
print(f"Number of unique lines {len(set(lines))}")

Number of lines: 1000
Number of unique lines 927


In [9]:
for line in lines[:10]:
    print(line)

984-(24)II SÉRIE — NÚMERO 39"VER DIÁRIO ORIGINAL"
984-(178)II SÉRIE — NÚMERO 39"VER DIÁRIO ORIGINAL"
10 DE ABRIL DE 19852585CAPÍTULO II Serviço cívico Artigo 4.° (Concei o de serviço cívico)1 — Entende-se por serviço cívico adequado à situação de objector de consciência aquele que, sendo exclusivamente de natureza civil, não esteja vinculado ou subordinado a instituições militares ou militarizadas e que constitua uma participação útil em tarefas necessárias à colectividade, possibilitando uma adequada aplicação das habilitações e interesses vocacionais dos objectores.2 — O serviço cívico será organizado nos termos do diploma previsto no artigo 44.° e efectuar-se-á preferentemente nos seguintes domínios:a) Assistência em hospitais e outros estabelecimentos de saúde;b) Rastreio de doenças e acções de defesa da saúde pública;c) Luta contra o tabagismo, o alcoolismo e a droga;,d) Assistência a deficientes, crianças e idosos;e) Prevenção e combat

### Train the tokenizer

In [10]:
tokenizer_pt = train_tokenizer(tokenizer, lines)






Check the number of tokens with this tokenizer.

In [11]:
pt_tkns = tokenizer.tokenize("Olá, eu sou uma frase simples!")
print(f"Number of tokens in with original tokenizer: {len(pt_tkns)}")

pt_tkns = tokenizer_pt.tokenize("Olá, eu sou uma frase simples!")
print(f"Number of tokens with new tokenizer: {len(pt_tkns)}")


Number of tokens in with original tokenizer: 13
Number of tokens with new tokenizer: 12


This is good. The number of tokens with the new tokenizer is lower than the original one. 

What are the tokens in new tokenizer that are not on the original?

In [12]:
vocab_org = tokenizer.vocab.keys()
vocab_new = tokenizer_pt.vocab.keys()

In [13]:
new_tokens = list(set(vocab_new) - set(vocab_org))
print(f"Number of new tokens: {len(new_tokens)}")
print(f"(some) New tokens:\n{new_tokens[:10]}")

Number of new tokens: 25186
(some) New tokens: ['cumprimento', 'Ġcontrato', 'Ġprolongada', 'Ġfacultar', 'Ġcreditando', 'Ġgerar', 'Ġges', 'BDisc', 'nter', 'giu']


This are pretty frquent portuguese words that were missing from the original vocab. Lets now try to add the new tokens to the original vocab.

In [14]:
print(f"A sample of the tokens to be added:\n{new_tokens[:15]}")

A sample of the tokens to be added:
['cumprimento', 'Ġcontrato', 'Ġprolongada', 'Ġfacultar', 'Ġcreditando', 'Ġgerar', 'Ġges', 'BDisc', 'nter', 'giu', 'Ġequiparado', 'reas', 'Ġsurpreender', 'demos', 'Ġremetido']


Lets first take one token as an example and see how it would be tokenized by the oroginal tokenizer.

In [55]:
example = new_tokens[1]
example = " contrato"
example

' contrato'

In [65]:
tokens = tokenizer.tokenize(example)
print(f"Previous tokens: {tokens}")

new_token = "".join(tokens)
print(f"New token: {new_token}")


Previous tokens: ['Ġcontr', 'ato']
New token: Ġcontrato


Lets now get the embeddings for this tokens.

In [66]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids

[3445, 5549]

In [67]:
model = llm.base_model
token_embs = model.embed_tokens(torch.tensor(token_ids))
token_embs

tensor([[-0.0078,  0.0262,  0.0003,  ...,  0.0161,  0.0155, -0.0227],
        [ 0.0025,  0.0034, -0.0124,  ..., -0.0029, -0.0109,  0.0031]],
       grad_fn=<EmbeddingBackward0>)

In [68]:
token_embs_agg = token_embs.mean(dim=0)
token_embs_agg

tensor([-0.0026,  0.0148, -0.0060,  ...,  0.0066,  0.0023, -0.0098],
       grad_fn=<MeanBackward1>)

Lets add this a new token to the tokenizer and the new embedding to the model.

In [62]:
print(f"Number of tokens before adding the token: {len(tokenizer)}")

Number of tokens before adding the token: 50298


In [69]:
tokenizer.add_tokens([new_token])
new_token_id = tokenizer.vocab[new_token]
print(f"New token id: {new_token_id}")


New token id: 50295


In [71]:
tokenizer.tokenize(example)

['Ġcontr', 'ato']

In [73]:
tokenizer.tokenize("O tipo nao tem contrato")

['O', 'Ġtip', 'o', 'Ġn', 'ao', 'Ġtem', 'Ġcontr', 'ato']

In [24]:
len(tokenizer)

50296

The new token has been added with token id 50295. Now we need to add that id to the model.

In [25]:
embed = model.embed_tokens
type(embed)

torch.nn.modules.sparse.Embedding

Miss match between the embeddings and the vocab size explained in this [chat](https://huggingface.co/bigscience/bloom/discussions/120).

In [26]:
weight = embed.weight.data
print(f"Shape of weight matrix: {weight.shape}")

Shape of weight matrix: torch.Size([51200, 2048])


In [27]:
# add new tokens to the model
weight = torch.cat([weight, token_embs_agg.unsqueeze(0)], dim=0)
print(f"Shape of weight matrix: {weight.shape}")

Shape of weight matrix: torch.Size([51201, 2048])


In [28]:
weight[new_token_id] = token_embs_agg

In [29]:
embed.weight.data = weight

In [30]:
assert  torch.equal(llm.model.embed_tokens(torch.tensor(new_token_id)), token_embs_agg)

Lets now test if this reduces the preplexity of the model.

In [31]:
llm_original = AutoModelForCausalLM.from_pretrained(MODEL)


In [32]:
tokenizer_original = AutoTokenizer.from_pretrained(MODEL)

In [33]:
compute_perplexity(llm_original, tokenizer_original, "Olá, eu sou uma frase simples com a palavra incapacidade!")

tensor(72.2419, grad_fn=<ExpBackward0>)

In [34]:
compute_perplexity(llm, tokenizer, "Olá, eu sou uma frase simples com a palavra incapacidade!")

tensor(248.0512, grad_fn=<ExpBackward0>)

In [35]:
test_sentence = "Olá, eu sou uma frase simples com a palavra incapacidade!"
tokenizer_original.tokenize(test_sentence)

['Ol',
 'Ã¡',
 ',',
 'Ġe',
 'u',
 'Ġsou',
 'Ġu',
 'ma',
 'Ġfr',
 'ase',
 'Ġsim',
 'ples',
 'Ġcom',
 'Ġa',
 'Ġpal',
 'av',
 'ra',
 'Ġincapac',
 'id',
 'ade',
 '!']

In [36]:
tokenizer.tokenize(test_sentence)

['Ol',
 'Ã¡',
 ',',
 'Ġe',
 'u',
 'Ġsou',
 'Ġu',
 'ma',
 'Ġfr',
 'ase',
 'Ġsim',
 'ples',
 'Ġcom',
 'Ġa',
 'Ġpal',
 'av',
 'ra',
 'Ġ',
 'incapacidade',
 '!']