## GPT Tokenizer files
---

## Learning Objectives

The goal of this lab is to examine the difference between BPE and GPTBPE Tokenizer.

Later on, we will use the observations from this notebook to train a GPT Tokenizer with our own raw text data.

We will load and verify GPTBPE Tokenizer and make sure the output tokens and token ids are as expected. 


Let's review the source code of [gpt2 tokenizer](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html)

    This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
    be encoded differently whether it is at the beginning of the sentence (without space) or not:

    

         from transformers import GPT2Tokenizer
         tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
        
         tokenizer(" Hello world")['input_ids']
        [18435, 995]


Install necessary python libraries.

In [None]:
!pip install tokenizers transformers ipywidgets
!jupyter nbextension enable --py widgetsnbextension

Next, we proceed to fetch pretrained GPT Tokenizer files, namely the vocab and merge files.

In [None]:
!wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
!wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt

Examine the vocab and merge files, noted the presence of Ġ character.
Ġ = space + 256 , this character is used as a control letter.

In [None]:
import json
import random
with open('gpt2-vocab.json') as ip_file:
    o = json.load(ip_file)
    take=20
    rn=random.randint(0,len(o)-1)
    print("noted that the Ġ = space + 256 is the control letter")
    print(list(o.keys())[rn:rn+take])            

In [None]:
!tail -n 5 gpt2-merges.txt

The following code block will load GPT2Tokenizer from HuggingFace transformer library, we verify the following :

            from transformers import GPT2Tokenizer
            tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
        
            tokenizer(" Hello world")['input_ids']
            expected token ids for " Hello world" is [18435, 995]

In [None]:
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

print('\n notice the **SPACE** in front of ** Hello world** \n')
sample_text=" Hello world"
print(sample_text)
out=tokenizer.tokenize(sample_text)
print("tokens:",out)
ids=tokenizer(sample_text)['input_ids']
print("ids:",ids)
## expected output :
## [18435, 995]

Below is the expected outputs :
    
         Hello world
        tokens: ['ĠHello', 'Ġworld']
        ids: [18435, 995]

Next code block will load tokenizer library from huggingFace, we will observe the difference when setting `use_gpt` to True or False. 

Setting `use_gpt` to True will evoke the following : 

        tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
        tokenizer.decoder = ByteLevelDecoder()
        
This is the expected tokenizer behavior for GPT models, namely GPTBPE Tokenizer, this GPTBPE tokenizer will load the vocab.json and merges.txt files and tokenize as expected. Whereas setting `use_gpt` to False, will result in a normal BPE Tokenizer, the tokenization will behave differently.

In [None]:
from tokenizers import Tokenizer, models, pre_tokenizers, trainers
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
from tokenizers.models import BPE
import json

def load_tokenizer(vocab_file,merge_file, use_gpt):
    tokenizer = Tokenizer(BPE())
    tokenizer.model = BPE.from_file(vocab_file, merge_file)
    with open(vocab_file, 'r') as f2:
        vocab = json.loads(f2.read())  
    if use_gpt:
        tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
        tokenizer.decoder = ByteLevelDecoder()
    return tokenizer , vocab
vocab_file='./gpt2-vocab.json'
merge_file='./gpt2-merges.txt'
tokenizers_gpt,_=load_tokenizer(vocab_file,merge_file,True)
sample_text=' Hello world' 
output=tokenizers_gpt.encode(sample_text)
ids=output.ids
tokens=output.tokens
#print(tokens ,'\n')
print("tokens: ",tokens)
print("ids:",ids)

tokenizers_bpe,_=load_tokenizer(vocab_file,merge_file, False)
sample_text=' Hello world'
output=tokenizers_bpe.encode(sample_text)
ids=output.ids
tokens=output.tokens
print("---"*10)
print('\nnotice the difference when using BPE as tokenizer instead of GPT2BPE tokenizer')
print("tokens: ",tokens)
print("ids:",ids)

Below is the expected outputs :

        tokens:  ['ĠHello', 'Ġworld']
        ids: [18435, 995]
        ------------------------------

        notice the difference when using BPE as tokenizer instead of GPT2BPE tokenizer
        tokens:  ['H', 'ellow', 'orld']
        ids: [39, 5037, 1764]

What did we observed ? 

Setting `use_gpt` to True will give us the expected behavor of GPTBPE tokenization. 

It will ensure the presence of Ġ : 

    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
    tokenizer.decoder = ByteLevelDecoder()


Therefore, we will enforce having :

    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
    tokenizer.decoder = ByteLevelDecoder()
When training our own GPTBPETokenizer with our own raw text data.
    

We will now move the gpt-vocab.json and gpt2-merges.txt to the correct data folder as a preparation for the next step.

In [None]:
!mv gpt2-vocab.json ../dataset/EN/50k/
!mv gpt2-merges.txt ../dataset/EN/50k/
!ls ../dataset/EN/50k/

---

## Links and Resources
Don't forget to check out additional resources such as [HuggingFace Tokenizer Documentation](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) and [Train GPT-2 in your own langauge](https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171).


-----
## <p style="text-align:center;border:3px; padding: 1em"> <a href=../Start_Here.ipynb>HOME</a>&nbsp; &nbsp; &nbsp; <a href=./Lab1-5_jsonfy_and_process2mmap.ipynb>NEXT</a></p>

-----


## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). 