## Training a new tokenizer from an old one

* You can train a tokenizer
* This is not the same as training a model (which uses gradient descent).
* Tokenizer training is a statistical process that tries to identify which subwords are the best to pick for a given corpus.
* Tokenizer needs to be trained on a corpus that is similar to your task
* Make sure it is a similar language, similar characters, similar domain (medical vs legal), and similar style

In [8]:
from datasets import load_dataset
from transformers import BertTokenizerFast

In [9]:
tokenizer = BertTokenizerFast.from_pretrained(
    "huggingface-course/bert-base-uncased-tokenizer-without-normalizer")

In [10]:
text = 'here is a sentence adapted to our tokenizer'
print(tokenizer.tokenize(text))

['here', 'is', 'a', 'sentence', 'adapted', 'to', 'our', 'token', '##izer']


In [12]:
text2 = 'afewqcinqweirnvq3obni3p0p3 qvrmnqerbnq3r09q qvwn0qw3nrbq03rq'
print(tokenizer.tokenize(text2))

# unknown tokens are a problem

['af', '##ew', '##q', '##cin', '##q', '##wei', '##rn', '##v', '##q', '##3', '##ob', '##ni', '##3', '##p', '##0', '##p', '##3', 'q', '##vr', '##m', '##n', '##q', '##er', '##bn', '##q', '##3', '##r', '##0', '##9', '##q', 'q', '##v', '##wn', '##0', '##q', '##w', '##3', '##nr', '##b', '##q', '##0', '##3', '##r', '##q']


In [14]:
text3 = 'the medical vocabulary is divided into many sub-tokens: paracetamol phrayngitis'
print(tokenizer.tokenize(text3))

['the', 'medical', 'vocabulary', 'is', 'divided', 'into', 'many', 'sub', '-', 'token', '##s', ':', 'para', '##ce', '##tam', '##ol', 'ph', '##ray', '##ng', '##itis']


## Steps to train a new tokenizer
* Gather a corpus of texts
* Choose a tokenizer architecture
* Train the tokenizer on the corpus
* Save the result


In [15]:
# You can use "AutoTokenizer.train_new_from_iterator" method to train a tokenizer using a known architecture on a new corpus

In [16]:
# Gather your training corpus
from datasets import load_dataset

raw_datasets = load_dataset('code_search_net', 'python')

def get_training_corpus():
    dataset = raw_datasets['train']
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx: start_idx + 1000]
        yield samples['whole_func_string']



Downloading builder script:   0%|          | 0.00/8.49k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/19.1k [00:00<?, ?B/s]

Downloading and preparing dataset code_search_net/python (download: 897.32 MiB, generated: 1.62 GiB, post-processed: Unknown size, total: 2.49 GiB) to /Users/jmarlowe/.cache/huggingface/datasets/code_search_net/python/1.0.0/80a244ab541c6b2125350b764dc5c2b715f65f00de7a56107a28915fac173a27...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/941M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/412178 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/22176 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/23107 [00:00<?, ? examples/s]

Dataset code_search_net downloaded and prepared to /Users/jmarlowe/.cache/huggingface/datasets/code_search_net/python/1.0.0/80a244ab541c6b2125350b764dc5c2b715f65f00de7a56107a28915fac173a27. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [18]:
from transformers import AutoTokenizer

training_corpus = get_training_corpus()

old_tokenizer = AutoTokenizer.from_pretrained('gpt2')

new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)

new_tokenizer.save_pretrained('code-search-net-tokenizer')






('code-search-net-tokenizer/tokenizer_config.json',
 'code-search-net-tokenizer/special_tokens_map.json',
 'code-search-net-tokenizer/vocab.json',
 'code-search-net-tokenizer/merges.txt',
 'code-search-net-tokenizer/added_tokens.json',
 'code-search-net-tokenizer/tokenizer.json')

In [1]:
my_list = [i for i in range(10)]

print(type(my_list))
print(my_list)
print(my_list)

<class 'list'>
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [3]:
my_generator = (i for i in range(10))

print(type(my_generator))
print(list(my_generator))
print(list(my_generator))

<class 'generator'>
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[]


In [4]:
# Note AutoTokenizer.train_new_from_iterator()  only works if you are using a "fast" tokenizer. 
# The Transformers library contains 2 types of tokenizers: some in Python (slow) and some in Rust (fast)

# Python is the language used most often for Data Science and Deep Learning but it is super slow to do parallelized stuff in

In [25]:
# most of the Transformer models have a "fast" tokenizer available. 
# The AutoTokenizer API always selects the fast tokenizer for you. 

list

In [None]:
# Fast tokenizer - need to use batched=True to get the real performance benefits

### Normalization and pre-tokenization
Before splitting a text into subtokens, the tokenizer performs 1- normalization, and 2- pre-tokenization.  
Then come steps 3- model, and 4- postprocessing.  

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
print(type(tokenizer.backend_tokenizer))

<class 'tokenizers.Tokenizer'>


In [9]:
text = 'This is a text with Héllò hôw are ü? and CAPITAL LETTERS too Hey?'

print(tokenizer.backend_tokenizer.normalizer.normalize_str(text))

this is a text with hello how are u? and capital letters too hey?


In [10]:
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

[('Hello', (0, 5)),
 (',', (5, 6)),
 ('how', (7, 10)),
 ('are', (11, 14)),
 ('you', (16, 19)),
 ('?', (19, 20))]

In [11]:
# sentence_piece = tokenization algorithm for preprocessing text 
# it replaces spaces with "_" character
# it has reversible tokenization

## Byte-Pair Encoding tokenization
* initially developed as an algorithm to compress texts
* then used by OpenAI for tokenization when pretraining GPT
* used by many Transformer models including GPT-2

In [13]:
# If the example you are tokenizing contains a character that is not in the training corpus, that character will be 
# converted to the unknown token. Be careful when handling emojis.

In [14]:
# "merges" rules to combine two elements (letters or parts of word) of the existing vocab together into a new one
# start with two word tokens which then get combined into longer subwords

In [15]:
# normalization, pre-tokenization, split words into individual characters, and
# applying the merge rules learned in order on those splits

## WordPiece tokenization
* Google used this to pretrain BERT
* Google never open-sourced the implementation
* similar to BPE but actual tokenization is done differently
* it only saves the final vocabulary, not the merge rules learned

## Unigram tokenization
* Used by T5
* Unigram model is a type of statistical language model - assumes that the occurrence of each word is independent of its previous word

# Tokenizer - building from scratch

* Normalizers
* PreTokenizers
* Models
* Trainers
* PostProcessors
* Decoders



1. Create a training dataset
2. Create a backend_tokenizer with HF tokenizers
3. Load the backend_tokenizer in a HF transformers tokenizer

In [17]:
from datasets import load_dataset

dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")



Downloading builder script:   0%|          | 0.00/8.48k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/6.84k [00:00<?, ?B/s]

Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.90 MiB, post-processed: Unknown size, total: 17.40 MiB) to /Users/jmarlowe/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126...


Downloading data:   0%|          | 0.00/4.72M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Dataset wikitext downloaded and prepared to /Users/jmarlowe/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126. Subsequent calls will reuse this data.


In [18]:
def get_training_corpus():
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]

In [27]:
with open("wikitext-2.txt", "w", encoding="utf-8") as f:
    for i in range(len(dataset)):
        f.write(dataset[i]["text"] + "\n")

In [19]:
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

In [20]:
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)

In [22]:
print(tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))

hello how are u?


In [23]:
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

In [24]:
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)

In [25]:
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)






In [28]:
tokenizer.model = models.WordPiece(unk_token="[UNK]")
tokenizer.train(["wikitext-2.txt"], trainer=trainer)




