## Training the tokenizer
This notebook is an example on how to use the tokenizer

first we have to import the tokenizer, the tokenizer user a BPE algorithm and is inside the class BPETokenizer

You can train the algorithm in multiple documents or using a single text file.
In this exemple we are going to use multiple documents

You also have to decide the number of target tokens. for this exemple we are using 1000

In [1]:
from BPE import BPETokenizer
import pickle

tokenizer = BPETokenizer()
tokenizer.tokenizer_fit_multidocs(1000, 'corpus')

Opening files text


  0%|          | 0/10000 [00:00<?, ?it/s]

Applying regex rules on files text...
Selecting tokens


  0%|          | 0/10000 [00:00<?, ?it/s]

training...


  0%|          | 0/744 [00:00<?, ?it/s]

pair (32, 100) -> 256
pair (32, 101) -> 257
pair (114, 97) -> 258
pair (32, 97) -> 259
pair (256, 101) -> 260
pair (111, 115) -> 261
pair (32, 112) -> 262
pair (110, 116) -> 263
pair (32, 99) -> 264
pair (101, 115) -> 265
pair (111, 114) -> 266
pair (101, 114) -> 267
pair (97, 115) -> 268
pair (32, 115) -> 269
pair (105, 97) -> 270
pair (105, 115) -> 271
pair (97, 100) -> 272
pair (105, 110) -> 273
pair (195, 163) -> 274
pair (114, 101) -> 275
pair (274, 111) -> 276
pair (109, 97) -> 277
pair (105, 99) -> 278
pair (32, 111) -> 279
pair (264, 111) -> 280
pair (116, 97) -> 281
pair (195, 167) -> 282
pair (32, 110) -> 283
pair (114, 111) -> 284
pair (109, 101) -> 285
pair (113, 117) -> 286
pair (105, 100) -> 287
pair (32, 102) -> 288
pair (97, 114) -> 289
pair (110, 100) -> 290
pair (195, 169) -> 291
pair (116, 101) -> 292
pair (97, 108) -> 293
pair (32, 49) -> 294
pair (32, 67) -> 295
pair (105, 111) -> 296
pair (32, 65) -> 297
pair (256, 111) -> 298
pair (32, 124) -> 299
pair (195, 173)

### saving the tokenizer

You may have notice that this task take some time to finish.
This is partialy because the code is not optimized to large documents.
since we are spliting the texts into words, a simple solution to optimize the code would be
to build a dictionary of tokens instead of a lonk list of tokens, thus we would not need to
iterate over the same word multiple times.

Either way, we sugest you to save the tokenizer, this would let you reuse the trained tokenizer
and perform the encode and decode functionalities latter on.

In [2]:
with open('tokenizer1000.pkl', 'wb') as file:
    pickle.dump(tokenizer, file)

### Checking the vocabulary

You can check the vocabulary by using the tokenizer.vocab
the vocab is is stored in a dictionary contioning its token id and its bynary value.

In [3]:
tokenizer.vocab

{0: b'\x00',
 1: b'\x01',
 2: b'\x02',
 3: b'\x03',
 4: b'\x04',
 5: b'\x05',
 6: b'\x06',
 7: b'\x07',
 8: b'\x08',
 9: b'\t',
 10: b'\n',
 11: b'\x0b',
 12: b'\x0c',
 13: b'\r',
 14: b'\x0e',
 15: b'\x0f',
 16: b'\x10',
 17: b'\x11',
 18: b'\x12',
 19: b'\x13',
 20: b'\x14',
 21: b'\x15',
 22: b'\x16',
 23: b'\x17',
 24: b'\x18',
 25: b'\x19',
 26: b'\x1a',
 27: b'\x1b',
 28: b'\x1c',
 29: b'\x1d',
 30: b'\x1e',
 31: b'\x1f',
 32: b' ',
 33: b'!',
 34: b'"',
 35: b'#',
 36: b'$',
 37: b'%',
 38: b'&',
 39: b"'",
 40: b'(',
 41: b')',
 42: b'*',
 43: b'+',
 44: b',',
 45: b'-',
 46: b'.',
 47: b'/',
 48: b'0',
 49: b'1',
 50: b'2',
 51: b'3',
 52: b'4',
 53: b'5',
 54: b'6',
 55: b'7',
 56: b'8',
 57: b'9',
 58: b':',
 59: b';',
 60: b'<',
 61: b'=',
 62: b'>',
 63: b'?',
 64: b'@',
 65: b'A',
 66: b'B',
 67: b'C',
 68: b'D',
 69: b'E',
 70: b'F',
 71: b'G',
 72: b'H',
 73: b'I',
 74: b'J',
 75: b'K',
 76: b'L',
 77: b'M',
 78: b'N',
 79: b'O',
 80: b'P',
 81: b'Q',
 82: b'R',
 83: b'

### encoding and decoding

Here we will test the tokenizer encode and decode methods

we can see how the tokenizer would split the text when you encode it

In [9]:
encoded_text = tokenizer.encode("As raízes etimológicas do termo Brasil são de difícil reconstrução.")
for i, value in enumerate(encoded_text):
    display(tokenizer.decode(encoded_text[i:i+1]))

'A'

's'

' ra'

'í'

'z'

'es'

' e'

'ti'

'm'

'ol'

'ó'

'g'

'icas'

' do'

' ter'

'mo'

' Brasil'

' são'

' de'

' d'

'if'

'íc'

'il'

' re'

'con'

'st'

'ru'

'ção'

'.'

We can also use the encode to generate a list of tokens 

In [11]:
text = "gostaria que meu tokenizer conhecesse mais palavras"
display(tokenizer.encode(text))

[896,
 281,
 419,
 324,
 456,
 117,
 519,
 107,
 325,
 416,
 267,
 881,
 265,
 389,
 472,
 262,
 293,
 97,
 118,
 392]

In a similar way, we decode a list of tokens

In [16]:
display(tokenizer.decode([896,281,419,324,456,117,]))

'gostaria que meu'

Finally, we can test if the encoder and decoer are returning the sema set of words.

In [17]:
display(tokenizer.decode(tokenizer.encode(text)))

'gostaria que meu tokenizer conhecesse mais palavras'