<a href="https://colab.research.google.com/github/not-sid-29/transformers_huggingface/blob/main/4_Working_With_Tokenizers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook: Working with Tokenizers:



> Tokenizers are an integral part of transformer workflow, since transformers understand only numeric values, it's the tokenizers that are responsible that convert raw text into input IDs.

Raw Text → Tokens → Input IDs<br>
ex: ["Transformers use attention blocks"] → ["Transformer", "s", "use", "attention", "block"] → [122, 19, 88, 367, 246]

## 1. Setting up & Installing libraries:

In [None]:
!pip install datasets
!pip install evaluate
!pip install transformers[sentencepiece]

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.5.0,>=2023.1.0 (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[

## 2. Word-Based Tokenization approach: <br>
⇒ The key idea is to split raw texts based on `whitespaces`, `punctuations`, `seperators`; into their respective word formats.

In [None]:
#approach code
input_string = "The cat is climbing the tree"
tokens = input_string.split()
print(tokens)

['The', 'cat', 'is', 'climbing', 'the', 'tree']


In [None]:
#Using BERT-Transformers:
from transformers import AutoTokenizer
model = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
tokenizer(input_string)

{'input_ids': [101, 1996, 4937, 2003, 8218, 1996, 3392, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
#demo of saving tokenizer
tokenizer.save_pretrained("bert-base-tknzr")

('bert-base-tknzr/tokenizer_config.json',
 'bert-base-tknzr/special_tokens_map.json',
 'bert-base-tknzr/vocab.txt',
 'bert-base-tknzr/added_tokens.json',
 'bert-base-tknzr/tokenizer.json')

In [None]:
#Checking how do these tokenizers understand raw text's sequence:
sequence = "Transformers main function is to generate the next word."
token_seq = tokenizer.tokenize(sequence)
print(token_seq)

['transformers', 'main', 'function', 'is', 'to', 'generate', 'the', 'next', 'word', '.']


In [None]:
#converting tokens into input_ids:
ids = tokenizer.convert_tokens_to_ids(token_seq)
print(ids)

[19081, 2364, 3853, 2003, 2000, 9699, 1996, 2279, 2773, 1012]


→ These numbers represent the position of words in input sequence to the position of words present in the vocabulary of model

In [None]:
#Converting input_ids into tokens:
decoded_tokens = tokenizer.decode(ids)
print(decoded_tokens)

transformers main function is to generate the next word.
