# Training a new tokenizer from an old one

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [2]:
%%capture
!pip install datasets evaluate transformers[sentencepiece]

Also, log into Hugging face.

In [17]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

If a <font color='blue'>language model</font> is <font color='blue'>not available</font> in the language you are interested in, or if your <font color='blue'>corpus</font> is very <font color='blue'>different</font> from the one your language model was trained on, you will most likely want to <font color='blue'>retrain</font> the model from <font color='blue'>scratch</font> using a tokenizer <font color='blue'>adapted</font> to your <font color='blue'>data</font>. That will require <font color='blue'>training</font> a <font color='blue'>new tokenizer</font> on your dataset. But what exactly does that mean? When we first looked at tokenizers in [Chapter 2](https://huggingface.co/learn/llm-course/chapter2/4), we saw that most Transformer models use a <font color='blue'>subword tokenization algorithm</font>. To identify which subwords are of interest and occur most frequently in the corpus at hand, the tokenizer needs to take a hard look at all the texts in the corpus -- a process we call <font color='blue'>training</font>. The exact rules that govern this training depend on the <font color='blue'>type of tokenizer</font> used, and we'll go over the three main algorithms later in this chapter.


<Tip warning={true}>

⚠️ <font color='blue'>Training</font> a <font color='blue'>tokenizer</font> is <font color='blue'>not</font> the <font color='blue'>same</font> as <font color='blue'>training</font> a <font color='blue'>model</font>! <font color='blue'>Model training</font> uses <font color='blue'>stochastic gradient descent</font> to make the loss a little bit smaller for each batch. It's randomized by nature (meaning you have to set some seeds to get the same results when doing the same training twice). Training a <font color='blue'>tokenizer</font> is a <font color='blue'>statistical process</font> that tries to <font color='blue'>identify</font> which <font color='blue'>subwords</font> are the <font color='blue'>best to pick</font> for a given corpus, and the exact rules used to pick them depend on the tokenization algorithm. It's deterministic, meaning you always get the same results when training with the same algorithm on the same corpus.

</Tip>



## Assembling a corpus

There's a very simple API in 🤗 Transformers that you can use to train a new tokenizer with the same characteristics as an existing one: `AutoTokenizer.train_new_from_iterator()`. To see this in action, let's say we want to train <font color='blue'>GPT-2</font> from <font color='blue'>scratch</font>, but in a <font color='blue'>different language</font> than <font color='blue'>English</font>. Our <font color='blue'>first task</font> will be to gather <font color='blue'>lots of data</font> in that language in a <font color='blue'>training corpus</font>. To provide examples everyone will be able to understand, we won't use a language like Russian or Chinese here, but rather a specialized English language: <font color='blue'>Python code</font>.

The [🤗 Datasets](https://github.com/huggingface/datasets) library can help us assemble a corpus of Python source code. We'll use the usual `load_dataset()` function to download and cache the [CodeSearchNet](https://huggingface.co/datasets/code_search_net) dataset. This dataset was created for the [CodeSearchNet challenge](https://wandb.ai/github/CodeSearchNet/benchmark) and contains millions of functions from open source libraries on GitHub in several programming languages. Here, we will load the Python part of this dataset:

In [7]:
from datasets import load_dataset

raw_datasets = load_dataset("code_search_net", "python", download_mode="force_redownload")

code_search_net.py:   0%|          | 0.00/8.44k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/12.9k [00:00<?, ?B/s]

python.zip:   0%|          | 0.00/941M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/412178 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/22176 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/23107 [00:00<?, ? examples/s]

We can have a look at the <font color='blue'>training split</font> to see which columns we have access to:

In [5]:
raw_datasets["train"]

Dataset({
    features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
    num_rows: 412178
})

We can see the dataset <font color='blue'>separates docstrings</font> from <font color='blue'>code</font> and suggests a <font color='blue'>tokenization of both</font>. Here. we'll just use the `whole_func_string` column to train our tokenizer. We can look at an example of one these functions by indexing into the `train` split:

In [8]:
print(raw_datasets["train"][123456]["whole_func_string"])

def get_new_token(self, netloc):
        """Get a new token from BIG-IP and store it internally.

        Throws relevant exception if it fails to get a new token.

        This method will be called automatically if a request is attempted
        but there is no authentication token, or the authentication token
        is expired.  It is usually not necessary for users to call it, but
        it can be called if it is known that the authentication token has
        been invalidated by other means.
        """
        login_body = {
            'username': self.username,
            'password': self.password,
        }

        if self.auth_provider:
            if self.auth_provider == 'local':
                login_body['loginProviderName'] = 'local'
            elif self.auth_provider == 'tmos':
                login_body['loginProviderName'] = 'tmos'
            elif self.auth_provider not in ['none', 'default']:
                providers = self.get_auth_providers(netloc)
         

The first thing we need to do is <font color='blue'>transform</font> the <font color='blue'>dataset</font> into an <font color='blue'>iterator</font> of <font color='blue'>lists of texts</font> -- for instance, a list of list of texts. Using lists of texts will enable our <font color='blue'>tokenizer</font> to go <font color='blue'>faster</font> (training on batches of texts instead of processing individual texts one by one), and it should be an iterator if we want to <font color='blue'>avoid</font> having <font color='blue'>everything in memory</font> at once. If your corpus is huge, you will want to take advantage of the fact that 🤗 Datasets <font color='blue'>does not load everything</font> into <font color='blue'>RAM</font> but stores the elements of the dataset on disk.

Doing the following would create a <font color='blue'>list of lists</font> of <font color='blue'>1,000 texts</font> each, but would <font color='blue'>load everything</font> in <font color='blue'>memory</font>:

In [9]:
# Don't uncomment the following line unless your dataset is small!
# training_corpus = [raw_datasets["train"][i: i + 1000]["whole_func_string"] for i in range(0, len(raw_datasets["train"]), 1000)]

Using a <font color='blue'>Python generator</font>, we can avoid Python <font color='blue'>loading</font> anything <font color='blue'>into memory</font> <font color='blue'>until</font> it's actually <font color='blue'>necessary</font>. To create such a generator, you just to need to replace the brackets with parentheses:

In [10]:
training_corpus = (
    raw_datasets["train"][i : i + 1000]["whole_func_string"]
    for i in range(0, len(raw_datasets["train"]), 1000)
)

This line of code doesn't fetch any elements of the dataset; it just <font color='blue'>creates</font> an <font color='blue'>object</font> you can <font color='blue'>use</font> in a Python <font color='blue'>for loop</font>. The texts will only be loaded when you need them (that is, when you're at the step of the `for` loop that requires them), and <font color='blue'>only 1,000 texts</font> at a time will be <font color='blue'>loaded</font>. This way you won't exhaust all your memory even if you are processing a huge dataset.

The problem with a <font color='blue'>generator</font> object is that it can <font color='blue'>only</font> be <font color='blue'>used once</font>. So, instead of this giving us the list of the first 10 digits twice:

In [11]:
gen = (i for i in range(10))
print(list(gen))
print(list(gen))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[]


That's why we define a <font color='blue'>function</font> that <font color='blue'>returns</font> a <font color='blue'>generator</font> instead:

In [12]:
def get_training_corpus():
    return (
        raw_datasets["train"][i : i + 1000]["whole_func_string"]
        for i in range(0, len(raw_datasets["train"]), 1000)
    )


training_corpus = get_training_corpus()

You can also define your <font color='blue'>generator</font> inside a <font color='blue'>for loop</font> by using the `yield` statement:

In [13]:
def get_training_corpus():
    dataset = raw_datasets["train"]
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples["whole_func_string"]

which will produce the <font color='blue'>exact</font> same <font color='blue'>generator</font> as <font color='blue'>before</font>, but allows you to use more complex logic than you can in a list comprehension.

## Training a new tokenizer

Now that we have our corpus in the form of an <font color='blue'>iterator</font> of <font color='blue'>batches of texts</font>, we are ready to <font color='blue'>train</font> a <font color='blue'>new tokenizer</font>. To do this, we first need to load the tokenizer we want to pair with our model (here, GPT-2):

In [14]:
from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("gpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Even though we are going to train a new tokenizer, it's a good idea to do this to avoid starting entirely from scratch. This way, we won't have to <font color='blue'>specify anything</font> about the <font color='blue'>tokenization algorithm</font> or the <font color='blue'>special tokens</font> we want to use; our <font color='blue'>new tokenizer</font> will be <font color='blue'>exactly</font> the <font color='blue'>same</font> as <font color='blue'>GPT-2</font>, and the only thing that will <font color='blue'>change</font> is the <font color='blue'>vocabulary</font>, which will be determined by the training on our corpus.

First let's have a look at how this tokenizer would treat an example function:

In [15]:
example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''

tokens = old_tokenizer.tokenize(example)
tokens

['def',
 'Ġadd',
 '_',
 'n',
 'umbers',
 '(',
 'a',
 ',',
 'Ġb',
 '):',
 'Ċ',
 'Ġ',
 'Ġ',
 'Ġ',
 'Ġ"""',
 'Add',
 'Ġthe',
 'Ġtwo',
 'Ġnumbers',
 'Ġ`',
 'a',
 '`',
 'Ġand',
 'Ġ`',
 'b',
 '`',
 '."',
 '""',
 'Ċ',
 'Ġ',
 'Ġ',
 'Ġ',
 'Ġreturn',
 'Ġa',
 'Ġ+',
 'Ġb']

This tokenizer has a few <font color='blue'>special symbols</font>, like `Ġ` and `Ċ`, which denote <font color='blue'>spaces</font> and <font color='blue'>newlines</font>, respectively. As we can see, this is <font color='blue'>not too efficient</font>: the tokenizer returns <font color='blue'>individual tokens</font> for <font color='blue'>each space</font>, when it <font color='blue'>could group</font> together <font color='blue'>indentation levels</font> (since having sets of four or eight spaces is going to be very common in code). It also <font color='blue'>split</font> the <font color='blue'>function name</font> a <font color='blue'>bit weirdly</font>, not being used to seeing words with the `_` character.

Let's <font color='blue'>train</font> a <font color='blue'>new tokenizer</font> and see if it solves those issues. For this, we'll use the method `train_new_from_iterator()`:

In [16]:
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)

This command might take a bit of time if your corpus is very large, but for this dataset of 1.6 GB of texts it's  blazing fast (1 minute 16 seconds on an AMD Ryzen 9 3900X CPU with 12 cores).

Note that `AutoTokenizer.train_new_from_iterator()` <font color='blue'>only works</font> if the <font color='blue'>tokenizer</font> you are using is a <font color='blue'>fast tokenizer</font>. As you'll see in the next section, the 🤗 Transformers library contains two types of tokenizers: some are written purely in Python and others (the <font color='blue'>fast ones</font>) are backed by the 🤗 <font color='blue'>Tokenizers library</font>, which is written in the [Rust](https://www.rust-lang.org) programming language. Python is the language most often used for data science and deep learning applications, but when <font color='blue'>anything</font> needs to be <font color='blue'>parallelized</font> to <font color='blue'>be fast</font>, it has to be written in <font color='blue'>another language</font>. For instance, the matrix multiplications that are at the core of the model computation are written in CUDA, an optimized C library for GPUs.

Training a brand new tokenizer in <font color='blue'>pure Python</font> would be <font color='blue'>excruciatingly slow</font>, which is why we developed the 🤗 Tokenizers library. Note that just as you didn't have to learn the CUDA language to be able to execute your model on a batch of inputs on a GPU, you won't need to learn Rust to use a fast tokenizer. The 🤗 Tokenizers library provides Python bindings for many methods that internally call some piece of code in Rust; for example, to parallelize the training of your new tokenizer or, as we saw in [Chapter 3](https://huggingface.co/learn/llm-course/chapter3/2), the tokenization of a batch of inputs.

Most of the <font color='blue'>Transformer models</font> have a <font color='blue'>fast tokenizer available</font> (there are some exceptions that you can check [here](https://huggingface.co/transformers/#supported-frameworks)), and the `AutoTokenizer` <font color='blue'>API</font> <font color='blue'>always selects</font> the <font color='blue'>fast tokenizer</font> for you if it's available. In the next section we'll take a look at some of the <font color='blue'>other special features</font> fast tokenizers have, which will be really <font color='blue'>useful</font> for tasks like <font color='blue'>token classification</font> and <font color='blue'>question answering</font>. Before diving into that, however, let's try our brand new tokenizer on the previous example:


In [18]:
tokens = tokenizer.tokenize(example)
tokens

['def',
 'Ġadd',
 '_',
 'numbers',
 '(',
 'a',
 ',',
 'Ġb',
 '):',
 'ĊĠĠĠ',
 'Ġ"""',
 'Add',
 'Ġthe',
 'Ġtwo',
 'Ġnumbers',
 'Ġ`',
 'a',
 '`',
 'Ġand',
 'Ġ`',
 'b',
 '`."""',
 'ĊĠĠĠ',
 'Ġreturn',
 'Ġa',
 'Ġ+',
 'Ġb']

Here we again see the <font color='blue'>special symbols</font> `Ġ` and `Ċ` that <font color='blue'>denote spaces and newlines</font>, but we can also see that our tokenizer <font color='blue'>learned some tokens</font> that are <font color='blue'>highly specific</font> to a corpus of <font color='blue'>Python functions</font>: for example, there is a `ĊĠĠĠ` token that represents an <font color='blue'>indentation</font>, and a `Ġ"""` token that represents the <font color='blue'>three quotes</font> that <font color='blue'>start</font> a <font color='blue'>docstring</font>. The tokenizer also correctly split the function name on `_`. This is quite a compact representation; comparatively, using the plain English tokenizer on the same example will give us a longer sentence:


In [19]:
print(len(tokens))
print(len(old_tokenizer.tokenize(example)))

27
36


Let's look at <font color='blue'>another example</font>:

In [26]:
example = """class LinearLayer():
    def __init__(self, input_size, output_size):
        self.weight = torch.randn(input_size, output_size)
        self.bias = torch.zeros(output_size)

    def __call__(self, x):
        return x @ self.weights + self.bias
    """
tokenizer.tokenize(example)

['class',
 'ĠLinear',
 'Layer',
 '():',
 'ĊĠĠĠ',
 'Ġdef',
 'Ġ__',
 'init',
 '__(',
 'self',
 ',',
 'Ġinput',
 '_',
 'size',
 ',',
 'Ġoutput',
 '_',
 'size',
 '):',
 'ĊĠĠĠĠĠĠĠ',
 'Ġself',
 '.',
 'weight',
 'Ġ=',
 'Ġtorch',
 '.',
 'randn',
 '(',
 'input',
 '_',
 'size',
 ',',
 'Ġoutput',
 '_',
 'size',
 ')',
 'ĊĠĠĠĠĠĠĠ',
 'Ġself',
 '.',
 'bias',
 'Ġ=',
 'Ġtorch',
 '.',
 'zeros',
 '(',
 'output',
 '_',
 'size',
 ')',
 'ĊĊĠĠĠ',
 'Ġdef',
 'Ġ__',
 'call',
 '__(',
 'self',
 ',',
 'Ġx',
 '):',
 'ĊĠĠĠĠĠĠĠ',
 'Ġreturn',
 'Ġx',
 'Ġ@',
 'Ġself',
 '.',
 'weights',
 'Ġ+',
 'Ġself',
 '.',
 'bias',
 'ĊĠĠĠĠ']

In addition to the <font color='blue'>token</font> corresponding to an <font color='blue'>indentation</font>, here we can also see a <font color='blue'>token</font> for a <font color='blue'>double indentation</font>: `ĊĠĠĠĠĠĠĠ`. The special <font color='blue'>Python words</font> like `class`, `init`, `call`, `self`, and `return` are each <font color='blue'>tokenized</font> as <font color='blue'>one token</font>, and we can see that as well as splitting on `_` and `.` the tokenizer <font color='blue'>correctly splits</font> even <font color='blue'>camel-cased</font> names: `LinearLayer` is tokenized as `["ĠLinear", "Layer"]`.


## Saving the tokenizer

To make sure we can <font color='blue'>use it later</font>, we need to <font color='blue'>save</font> our <font color='blue'>new tokenizer</font>. Like for models, this is done with the `save_pretrained()` method:

In [21]:
tokenizer.save_pretrained("code-search-net-tokenizer")

('code-search-net-tokenizer/tokenizer_config.json',
 'code-search-net-tokenizer/special_tokens_map.json',
 'code-search-net-tokenizer/vocab.json',
 'code-search-net-tokenizer/merges.txt',
 'code-search-net-tokenizer/added_tokens.json',
 'code-search-net-tokenizer/tokenizer.json')

This will create a new folder named <font  color='blue'>code-search-net-tokenizer</font>, which will contain <font color='blue'>all</font> the <font color='blue'>files</font> the <font color='blue'>tokenizer needs</font> to be reloaded. If you want to share this tokenizer with your colleagues and friends, you can <font  color='blue'>upload it</font> to the <font  color='blue'>Hub</font> by logging into your account. If you're working in a notebook, there's a convenience function to help you with this:


In [22]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

This will display a widget where you can enter your Hugging Face login credentials. If you aren't working in a notebook, just type the following line in your terminal:

In [23]:
#!huggingface-cli login

Once you've logged in, you can <font  color='blue'>push your tokenizer</font> by executing the following command:

In [24]:
tokenizer.push_to_hub("code-search-net-tokenizer")

CommitInfo(commit_url='https://huggingface.co/Axion004/code-search-net-tokenizer/commit/de7f1afa8e0f97cb40db3360d02cf6bb0d68ae66', commit_message='Upload tokenizer', commit_description='', oid='de7f1afa8e0f97cb40db3360d02cf6bb0d68ae66', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Axion004/code-search-net-tokenizer', endpoint='https://huggingface.co', repo_type='model', repo_id='Axion004/code-search-net-tokenizer'), pr_revision=None, pr_num=None)


This will create a <font  color='blue'>new repository</font> in your <font  color='blue'>namespace</font> with the name `code-search-net-tokenizer`, containing the <font  color='blue'>tokenizer file</font>. You can then load the tokenizer from anywhere with the `from_pretrained()` method:

In [25]:
# Replace "Axion004" below with your actual namespace to use your own tokenizer
tokenizer = AutoTokenizer.from_pretrained("Axion004/code-search-net-tokenizer")

tokenizer_config.json:   0%|          | 0.00/471 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/822k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/467k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.67M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

To confirm that the tokenizer works as expected, we can once again tokenize the second example.

In [27]:
example = """class LinearLayer():
    def __init__(self, input_size, output_size):
        self.weight = torch.randn(input_size, output_size)
        self.bias = torch.zeros(output_size)

    def __call__(self, x):
        return x @ self.weights + self.bias
    """
tokenizer.tokenize(example)

['class',
 'ĠLinear',
 'Layer',
 '():',
 'ĊĠĠĠ',
 'Ġdef',
 'Ġ__',
 'init',
 '__(',
 'self',
 ',',
 'Ġinput',
 '_',
 'size',
 ',',
 'Ġoutput',
 '_',
 'size',
 '):',
 'ĊĠĠĠĠĠĠĠ',
 'Ġself',
 '.',
 'weight',
 'Ġ=',
 'Ġtorch',
 '.',
 'randn',
 '(',
 'input',
 '_',
 'size',
 ',',
 'Ġoutput',
 '_',
 'size',
 ')',
 'ĊĠĠĠĠĠĠĠ',
 'Ġself',
 '.',
 'bias',
 'Ġ=',
 'Ġtorch',
 '.',
 'zeros',
 '(',
 'output',
 '_',
 'size',
 ')',
 'ĊĊĠĠĠ',
 'Ġdef',
 'Ġ__',
 'call',
 '__(',
 'self',
 ',',
 'Ġx',
 '):',
 'ĊĠĠĠĠĠĠĠ',
 'Ġreturn',
 'Ġx',
 'Ġ@',
 'Ġself',
 '.',
 'weights',
 'Ġ+',
 'Ġself',
 '.',
 'bias',
 'ĊĠĠĠĠ']

Loading the tokenizer using the `from_pretrained()` method works as expected.

You're now all set for <font  color='blue'>training</font> a <font  color='blue'>language model</font> from <font  color='blue'>scratch</font> and fine-tuning it on your task at hand! We'll get to that in [Chapter 7](https://huggingface.co/learn/llm-course/chapter7/1), but first, in the rest of this chapter we'll take a <font  color='blue'>closer look</font> at <font  color='blue'>fast tokenizers</font> and explore in detail what actually happens when we call the method `train_new_from_iterator()`.
