# 

# 3_About GPT vocab and merge files
---

## Learning Objectives
- **The goal of this lab is to:**
    - the difference between BPE and GPTBPE Tokenizer
    - load and verify GPTBPE Tokenizer can do tokenization as expected 


Download the GPT vocab and merge files 

Download vocab file [English_vocab](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json)

Download merge file [English_merge](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt)


#### let's review the source code of [gpt2 tokenizer](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html)

Construct a GPT-2 tokenizer. Based on byte-level Byte-Pair-Encoding.

    This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
    be encoded differently whether it is at the beginning of the sentence (without space) or not:

    ::

         from transformers import GPT2Tokenizer
         tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
        
         tokenizer(" Hello world")['input_ids']
        [18435, 995]


In [1]:
!pip install tokenizers  transformers ipywidgets

Defaulting to user installation because normal site-packages is not writeable


In [2]:
!wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
!wget https://huggingface.co/openai-gpt/resolve/main/vocab.json
!wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
!wget https://huggingface.co/openai-gpt/resolve/main/merges.txt

--2021-09-15 09:29:57--  https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.95.125
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.95.125|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1042301 (1018K) [application/json]
Saving to: ‘gpt2-vocab.json’


2021-09-15 09:29:58 (1.53 MB/s) - ‘gpt2-vocab.json’ saved [1042301/1042301]

--2021-09-15 09:29:58--  https://huggingface.co/openai-gpt/resolve/main/vocab.json
Resolving huggingface.co (huggingface.co)... 107.23.77.87, 34.200.164.230, 34.195.144.223, ...
Connecting to huggingface.co (huggingface.co)|107.23.77.87|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 815973 (797K) [application/json]
Saving to: ‘vocab.json’


2021-09-15 09:29:59 (1.78 MB/s) - ‘vocab.json’ saved [815973/815973]

--2021-09-15 09:30:00--  https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
Resolving s3.amazonaws.com (s3

In [3]:
!jupyter nbextension enable --py widgetsnbextension

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


## examine the vocab and merge files

In [4]:
import json
import random
with open('gpt2-vocab.json') as ip_file:
    o = json.load(ip_file)
    take=20
    rn=random.randint(0,len(o)-1)
    print("noted that the Ġ = space + 256 to form that control letter")
    print(list(o.keys())[rn:rn+take])            

noted that the Ġ = space + 256 to form that control letter
['Ġassorted', 'ĠRevision', 'ĠPiano', 'ĠGideon', 'Ocean', 'Ġsalon', 'Ġbustling', 'ognitive', 'ĠRahman', 'Ġwaiter', 'Ġpresets', 'ĠOsh', 'ĠGHC', 'operator', 'Ġreptiles', 'Ġ413', 'ĠGarr', 'ĠChak', 'Ġhashes', 'Ġfailings']


In [5]:
!tail -n 5 gpt2-merges.txt

om inated
Ġreg ress
ĠColl ider
Ġinform ants
Ġg azed


## sanity check load from transformer GPT2Tokenizer 

In [6]:
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

print('\n notice the **SPACE** in front of ** Hello world** \n')
sample_text=" Hello world"
print(sample_text)
out=tokenizer.tokenize(sample_text)
print("tokens:",out)
ids=tokenizer(sample_text)['input_ids']
print("ids:",ids)
## expected output :
## [18435, 995]


 notice the **SPACE** in front of ** Hello world** 

 Hello world
tokens: ['ĠHello', 'Ġworld']
ids: [18435, 995]


In [7]:
from tokenizers import Tokenizer, models, pre_tokenizers, trainers
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
from tokenizers.models import BPE
import json


def load_tokenizer(vocab_file,merge_file, gpt2):
    tokenizer = Tokenizer(BPE())
    tokenizer.model = BPE.from_file(vocab_file, merge_file)
    with open(vocab_file, 'r') as f2:
        vocab = json.loads(f2.read())  
    if gpt2:
        tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
        tokenizer.decoder = ByteLevelDecoder()
    return tokenizer , vocab
vocab_file='./gpt2-vocab.json'
merge_file='./gpt2-merges.txt'
tokenizers_gpt,_=load_tokenizer(vocab_file,merge_file,True)
sample_text=' Hello world' 
output=tokenizers_gpt.encode(sample_text)
ids=output.ids
tokens=output.tokens
#print(tokens ,'\n')
print("tokens: ",tokens)
print("ids:",ids)

tokenizers_bpe,_=load_tokenizer(vocab_file,merge_file, False)
sample_text=' Hello world'
output=tokenizers_bpe.encode(sample_text)
ids=output.ids
tokens=output.tokens

print("---"*10)
print('\nnotice the difference when using BPE as tokenizer instead of GPT2BPE tokenizer')
print("tokens: ",tokens)
print("ids:",ids)


tokens:  ['ĠHello', 'Ġworld']
ids: [18435, 995]
------------------------------

notice the difference when using BPE as tokenizer instead of GPT2BPE tokenizer
tokens:  ['H', 'ellow', 'orld']
ids: [39, 5037, 1764]


In [8]:
## clean up
!rm merges.txt
!rm vocab.json

---
## Up Next : 

[Jsonfy and convert to mmap ](./Day2-4_jsonfy_and_process2mmap.ipynb)

## Back To Start Menu
[start menu](../Start_Here.ipynb)

-----


## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). 