# Tokenizers (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

[![Video Title](https://img.youtube.com/vi/VFp38yj8h3A/0.jpg)](https://www.youtube.com/watch?v=VFp38yj8h3A)

We'll look at 3 tokenizer algo's

![image.png](attachment:image.png)

> ### Word-base tokenizers

[![Video Title](https://img.youtube.com/vi/nhJxYji1aho/0.jpg)](https://www.youtube.com/watch?v=nhJxYji1aho)

In [1]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


> ### Character-based tokenizers

[![Video Title](https://img.youtube.com/vi/ssLq_EK2jLE/0.jpg)](https://www.youtube.com/watch?v=ssLq_EK2jLE)

![image.png](attachment:image.png)

* character-based vocabs more complete than word-based
* characters do not hold as much semantic info as words
* tokens for character-based processed sequences much larger than word-based --> impacts size on *context* model can carry around (limited context-window)
* Not perfect but solves a lot of the word-based tokenizer issues --> consider for new problems

In [2]:
from transformers import BertTokenizer

# using specific tokenizer object
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
from transformers import AutoTokenizer

# auto-detect tokenizer object
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [None]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 11303, 1200, 2443, 1110, 3014, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [4]:
import os

if 'models' not in os.listdir():
    os.mkdir('models')

In [None]:
tokenizer.save_pretrained("models")

In [2]:
from transformers import AutoModel, AutoTokenizer

# model = AutoModel.from_pretrained("bert-base-cased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

  from .autonotebook import tqdm as notebook_tqdm


['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']




In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 11303, 1200, 2443, 1110, 3014]

In [None]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

'Using a Transformer network is simple'

> ### Sub-word tokenization

* Generlly used by models achieving best-in-class English performance!


[![Video Title](https://img.youtube.com/vi/zHvTiHr506c/0.jpg)](https://www.youtube.com/watch?v=zHvTiHr506c)

![image.png](attachment:image.png)

> ### Goal: is to find middle-ground

![image-2.png](attachment:image-2.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

> ### Tokenization Pipeline

[![Video Title](https://img.youtube.com/vi/Yffk5aydLzg/0.jpg)](https://www.youtube.com/watch?v=Yffk5aydLzg)

# Encoding

> #### Tokenization: `sequence` to `token`

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


> #### From `tokens` to `Input_IDs`

In [6]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


# Decoding

Decoding is going the other way around: from vocabulary indices

In [7]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple
