<a href="https://colab.research.google.com/github/kailas711/AI-Origins/blob/main/Training%20a%20new%20tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Installation

In [None]:
!pip install -q datasets transformers[sentencepiece]

# Training a Tokenizer from an old one

**Note**
- Training a tokenizer is not the same as training a model!. Model training is very randomized, we use SGD to reduce loss for each batch( each time we get different result hence we use seed to maintain consistecy ).
- Training tokenizer involves a statistical method to identify the best subwords for given corpus, and the exact rules used to pick them.
It‚Äôs deterministic, meaning you always get the same results when training with the same algorithm on the same corpus.

In ü§ó Transformers you can use an API to train a new tokenizer with the same characteristics as an existing one:

`AutoTokenizer.train_new_from_iterator()`

### 1. Assembling a corpus

The CodeSearchNet dataset will be used here.

In [None]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

In [13]:
from datasets import load_dataset

# This can take a few minutes to load, so grab a coffee or tea while you wait!
raw_datasets = load_dataset("claudios/code_search_net", "python")

In [14]:
raw_datasets["train"]

Dataset({
    features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_documentation_string', 'func_code_url'],
    num_rows: 412178
})

In [15]:
print(raw_datasets["train"][45]["whole_func_string"])

def json(self,attribs =None, recurse=True, ignorelist=False):
        """See :meth:`AbstractElement.json`"""
        if not attribs: attribs = {}
        if self.idref:
            attribs['id'] = self.idref
        return super(AbstractTextMarkup,self).json(attribs,recurse, ignorelist)


Here. we‚Äôll just use the `whole_func_string` column to train our tokenizer.

- The first thing we need to do is transform the dataset into an iterator of lists of texts, Using lists of texts will enable our tokenizer to go faster (training on batches of texts instead of processing individual texts one by one).
-  iterators avoid having everything in memory at once, we need this!.
- ü§ó Datasets does not load everything into RAM but stores the elements of the dataset on disk.


Using a Python generator, we can avoid Python loading anything into memory until it‚Äôs actually necessary.

In [16]:
training_corpus = (
    raw_datasets["train"][i : i + 1000]["whole_func_string"]
    for i in range(0, len(raw_datasets["train"]), 1000)
)
print(training_corpus)
# print(list(training_corpus))[:10]

<generator object <genexpr> at 0x7f38726e50e0>


- The problem with a generator object is that it can only be used once.
- That‚Äôs why we define a function that returns a generator instead
- You can also define your generator inside a for loop by using the yield statement:

In [17]:
def get_training_corpus():
    dataset = raw_datasets["train"]
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples["whole_func_string"]

**Training a new tokenizer**

Even though we are going to train a new tokenizer, it‚Äôs a good idea to do this to avoid starting entirely from scratch. This way, we won‚Äôt have to specify anything about the tokenization algorithm or the special tokens we want to use; our new tokenizer will be exactly the same as Qwen

In [18]:
from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")

In [19]:
example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''

tokens = old_tokenizer.tokenize(example)
tokens

['def',
 'ƒ†add',
 '_',
 'n',
 'umbers',
 '(',
 'a',
 ',',
 'ƒ†b',
 '):',
 'ƒä',
 'ƒ†',
 'ƒ†',
 'ƒ†',
 'ƒ†"""',
 'Add',
 'ƒ†the',
 'ƒ†two',
 'ƒ†numbers',
 'ƒ†`',
 'a',
 '`',
 'ƒ†and',
 'ƒ†`',
 'b',
 '`',
 '."',
 '""',
 'ƒä',
 'ƒ†',
 'ƒ†',
 'ƒ†',
 'ƒ†return',
 'ƒ†a',
 'ƒ†+',
 'ƒ†b']

This tokenizer has a few special symbols, like ƒ† and ƒä, which denote spaces and newlines, respectively. As we can see, this is not too efficient: the tokenizer returns individual tokens for each space, when it could group together indentation levels

Note that `AutoTokenizer.train_new_from_iterator()` only works if the tokenizer you are using is a ‚Äúfast‚Äù tokenizer.

In [20]:
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)

In [22]:
example = """class LinearLayer():
    def __init__(self, input_size, output_size):
        self.weight = torch.randn(input_size, output_size)
        self.bias = torch.zeros(output_size)

    """
tokenizer.tokenize(example)

['class',
 'ƒ†Linear',
 'Layer',
 '():',
 'ƒäƒ†ƒ†ƒ†',
 'ƒ†def',
 'ƒ†__',
 'init',
 '__(',
 'self',
 ',',
 'ƒ†input',
 '_',
 'size',
 ',',
 'ƒ†output',
 '_',
 'size',
 '):',
 'ƒäƒ†ƒ†ƒ†ƒ†ƒ†ƒ†ƒ†',
 'ƒ†self',
 '.',
 'weight',
 'ƒ†=',
 'ƒ†torch',
 '.',
 'randn',
 '(',
 'input',
 '_',
 'size',
 ',',
 'ƒ†output',
 '_',
 'size',
 ')',
 'ƒäƒ†ƒ†ƒ†ƒ†ƒ†ƒ†ƒ†',
 'ƒ†self',
 '.',
 'bias',
 'ƒ†=',
 'ƒ†torch',
 '.',
 'zeros',
 '(',
 'output',
 '_',
 'size',
 ')',
 'ƒäƒäƒ†ƒ†ƒ†ƒ†']

**Saving the tokenizer**

In [24]:
tokenizer.save_pretrained("code-search-net-tokenizer")

('code-search-net-tokenizer/tokenizer_config.json',
 'code-search-net-tokenizer/special_tokens_map.json',
 'code-search-net-tokenizer/vocab.json',
 'code-search-net-tokenizer/merges.txt',
 'code-search-net-tokenizer/added_tokens.json',
 'code-search-net-tokenizer/tokenizer.json')

# Training a Tokenizer from scratch