## 7. GPT Tokenizer

Train a word-level tokenizer for GPT. The tokenizer simply splits on whitespace and indexes the `vocab_size` most common tokens.

In [3]:
from tokenizers import Tokenizer, Regex
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers.pre_tokenizers import WhitespaceSplit, Split
from pathlib import Path
import os

Define where to load text files and save tokenizer

In [6]:
TXT_LOCATION = Path('chords-txt-augmented/')
TOKENIZER_SAVEDIR = Path('tokenizers/chord-augmented-tokenizer')
TOKENIZER_SAVEDIR.mkdir(exist_ok=True, parents=True)

In [3]:
tokenizer = Tokenizer(WordLevel(unk_token="<unk>"))

Define special tokens and vocabulary size. For the raw text and note pair GPT models, I use a large vocab size of about `30000`. For the cleaned chord dataset, I intentionally restrict the vocabulary size to `8000` because of how it is preprocessed (see `09-make-cleaned-chord-dataset.ipynb`).

In [4]:
SPECIAL_TOKENS = [
    "<start>",
    "</start>",
    "<pad>",
    "<unk>",
]
VOCAB_SIZE = 8000
trainer = WordLevelTrainer(show_progress=True, special_tokens=SPECIAL_TOKENS, vocab_size=VOCAB_SIZE)

### Create and train word-level tokenizer

In [None]:
tokenizer.pre_tokenizer = WhitespaceSplit()

In [8]:
files = [str(TXT_LOCATION / path) for path in os.listdir(TXT_LOCATION)]

In [7]:
tokenizer.train(files, trainer)

Let's do a sanity check to make sure the tokenizer works as expected:

In [9]:
with open(files[11], "r") as f:
    text = f.read()

In [10]:
text_sample = ' '.join(text.split()[:50])
print(text_sample)

000000000000001010010000001010000000 000010000010000100000000000000000000 000000000000000100010100001000000001 000000000000001000010000000000010000 000001000000000010000100000000000000 000010000000001000010001000000000000 000010000100100010010100000000000000 000000100000100010000000000010001001 000000010100100001000000000000000000 000001001001000010000000000000000000 000001000110001000010000000000000000 000000001000010000100010000000000000 000000100001000010001000000000000000 000010000100001000010000000000000000 100000000100100000000000000000000000 000000001000001000010000100010010000 000000000000000010000000100010000000 000000010000001010010000000100000000 000000000000001010000000001010000000 000000010000000001010000000000000000 000000010010010010000000000000000000 000000100000000000000000101000000000 000000010000000000100100000000000000 000001000001000100010000000000000000 000000000100001010000000000000000000 000010000010000100000000000000000000 000000000010000010000100000000000000 0

In [11]:
# Encode the sample
encoding = tokenizer.encode(text_sample)
print(encoding.ids[:10])

[2970, 189, 4363, 892, 1870, 885, 4908, 1044, 186, 322]


In [12]:
# Decode the sample to get the original
decoded = tokenizer.decode(encoding.ids)
decoded == text_sample

True

### Save tokenizer to file

In [13]:
tokenizer.save(str(TOKENIZER_SAVEDIR / 'tokenizer.json'))

In [7]:
tokenizer = Tokenizer.from_file(str(TOKENIZER_SAVEDIR / 'tokenizer.json'))