# Training a new tokenizer from an old one.

Transformer models use a subword tokenization algorithm. 

Training a tokenizer is a **statistical process** that tries to identify which subwords are the best to pick for a given corpus, and the exact rules used to pick them depend on the tokenization algorithm. 

It’s **deterministic**, meaning you always get the same results when training with the same algorithm on the same corpus.

When to train a new tokenizer for model train from scrach?
* if corpus in new language, new chars, new domain, new style  
* tokenizer suitable training corpus used to train a lang model from scratch.
* excessive splitting of words may impact the performance.

**Steps**
* Gather a corpus of texts
* Choose a tokenizer architecture (use the tokenizer used by pretrained model)
* Train the tokenizer on the corpus
* Save the result 

Example : GPT2 model on python code

* Text is very specfic 
* `AutoTokenizer.train_new_from_iterator` -> expects iterator
* Dataset used: code_search_net

https://youtu.be/DJimQynXZsQ

## Setup

In [1]:
%%capture
!pip install datasets transformers[sentencepiece]

In [34]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.


In [35]:
# setup git
!git config --global user.email "manisnesan@users.noreply.github.com"
!git config --global user.name "Manikandan Sivanesan"

In [3]:
# hub login
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


## Assembling a corpus


Task is to train GPT-2 from scratch, but in a language other than English (Python Code). 

In [5]:
from datasets import load_dataset

In [6]:
raw_datasets = load_dataset('code_search_net', 'python')

Downloading:   0%|          | 0.00/2.60k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading and preparing dataset code_search_net/python (download: 897.32 MiB, generated: 1.62 GiB, post-processed: Unknown size, total: 2.49 GiB) to /root/.cache/huggingface/datasets/code_search_net/python/1.0.0/80a244ab541c6b2125350b764dc5c2b715f65f00de7a56107a28915fac173a27...


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/941M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset code_search_net downloaded and prepared to /root/.cache/huggingface/datasets/code_search_net/python/1.0.0/80a244ab541c6b2125350b764dc5c2b715f65f00de7a56107a28915fac173a27. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [8]:
print(raw_datasets['train'])

Dataset({
    features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
    num_rows: 412178
})


In [9]:
raw_datasets['train'][123456]['whole_func_string']

"def build_cert_chain(self, flags=SSL_BUILD_CHAIN_FLAG_NONE):\n        u'''\n        Used for server side only!\n\n        :param flags:\n        :return: 1 for success and 0 for failure\n        '''\n        retVal = SSL_CTX_build_cert_chain(self._ctx, flags)\n        return retVal"

we’ll just use the whole_func_string column to train our tokenizer

* transform the dataset into an iterator of lists of texts a list of list of texts
 


In [10]:
# Using a Python generator, we can avoid Python loading anything into memory until it’s actually necessary.

training_corpus = (
    raw_datasets['train'][i : i+1000]['whole_func_string'] 
    for i in range(0, len(raw_datasets['train']), 1000)
)


In [14]:
#The problem with a generator object is that it can only be used once. So, instead of this giving us the list of the first 10 digits twice:

gen = (i for i in range(10))
print(list(gen))
print(list(gen))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[]


In [15]:
# That’s why we define a function that returns a generator instead:

def get_training_corpus():
  return (
    raw_datasets['train'][i : i+1000]['whole_func_string'] 
    for i in range(0, len(raw_datasets['train']), 1000)
)

training_corpus = get_training_corpus()  

In [16]:
## Alternatively we can also define a for loop using yield, which returns a iterator of batches of texts

def get_training_corpus():
  dataset = raw_datasets['train']
  for start_idx in range(0, len(dataset), 1000):
    samples = dataset[start_idx : start_idx+1000]
    yield samples['whole_func_string']

## Training a new tokenizer


In [17]:
from transformers import AutoTokenizer

In [18]:
old_tokenizer = AutoTokenizer.from_pretrained('gpt2')

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [19]:
example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''

In [21]:
tokens = old_tokenizer.tokenize(example); print(tokens)

['def', 'Ġadd', '_', 'n', 'umbers', '(', 'a', ',', 'Ġb', '):', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`', '."', '""', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']


few special symbols, like Ċ and Ġ, which denote spaces and newlines

In [22]:
# Why: 52000
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)

**AutoTokenizer.train_new_from_iterator()** only works if the tokenizer you are using is a *“fast”* tokenizer.

two types of tokenizers: some are written purely in Python and others (the fast ones) are backed by the 🤗 Tokenizers library, written in Rust.

In [24]:
tokens = tokenizer.tokenize(example); tokens

['def',
 'Ġadd',
 '_',
 'numbers',
 '(',
 'a',
 ',',
 'Ġb',
 '):',
 'ĊĠĠĠ',
 'Ġ"""',
 'Add',
 'Ġthe',
 'Ġtwo',
 'Ġnumbers',
 'Ġ`',
 'a',
 '`',
 'Ġand',
 'Ġ`',
 'b',
 '`."""',
 'ĊĠĠĠ',
 'Ġreturn',
 'Ġa',
 'Ġ+',
 'Ġb']

In [25]:
print(len(tokens))
print(len(old_tokenizer.tokenize(example)))

27
36


In [27]:
example = """class LinearLayer():
    def __init__(self, input_size, output_size):
        self.weight = torch.randn(input_size, output_size)
        self.bias = torch.zeros(output_size)

    def __call__(self, x):
        return x @ self.weights + self.bias
    """
print(tokenizer.tokenize(example))

['class', 'ĠLinear', 'Layer', '():', 'ĊĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__(', 'self', ',', 'Ġinput', '_', 'size', ',', 'Ġoutput', '_', 'size', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'weight', 'Ġ=', 'Ġtorch', '.', 'randn', '(', 'input', '_', 'size', ',', 'Ġoutput', '_', 'size', ')', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'bias', 'Ġ=', 'Ġtorch', '.', 'zeros', '(', 'output', '_', 'size', ')', 'ĊĊĠĠĠ', 'Ġdef', 'Ġ__', 'call', '__(', 'self', ',', 'Ġx', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġreturn', 'Ġx', 'Ġ@', 'Ġself', '.', 'weights', 'Ġ+', 'Ġself', '.', 'bias', 'ĊĠĠĠĠ']


## Saving the tokenizer


In [28]:
tokenizer.save_pretrained('code-search-net-tokenizer')

('code-search-net-tokenizer/tokenizer_config.json',
 'code-search-net-tokenizer/special_tokens_map.json',
 'code-search-net-tokenizer/vocab.json',
 'code-search-net-tokenizer/merges.txt',
 'code-search-net-tokenizer/added_tokens.json',
 'code-search-net-tokenizer/tokenizer.json')

In [30]:
from huggingface_hub import notebook_login
notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


In [None]:
# Alternatively if notebook is not used
# huggingface-cli login

In [38]:
# Faced `ValueError: If not specifying `clone_from`, you need to pass Repository a valid git clone.`
# Workaround: `use_temp_dir=True` from https://discuss.huggingface.co/t/chapter-4-questions/6801/4
tokenizer.push_to_hub('code-search-net-tokenizer', use_temp_dir=True)

Cloning https://huggingface.co/msivanes/code-search-net-tokenizer into local empty directory.
To https://huggingface.co/msivanes/code-search-net-tokenizer
   2f3cfe7..ce71688  main -> main



'https://huggingface.co/msivanes/code-search-net-tokenizer/commit/ce71688438b352cb49b54d78b7d5bd168b48817f'

In [39]:
# Replace "huggingface-course" below with your actual namespace to use your own tokenizer
tokenizer = AutoTokenizer.from_pretrained("msivanes/code-search-net-tokenizer")

Downloading:   0%|          | 0.00/236 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/803k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.0 [00:00<?, ?B/s]