# Tokenizers (PyTorch)

The explanation of this notebook is in the Hugging Face course, chapter 2, section 4: [Tokenizers](https://huggingface.co/course/chapter2/4?fw=pt)

The original code of this notebook is in the Hugging Face's SageMaker repository: [section4_pt.ipynb](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/en/chapter2/section4_pt.ipynb)

## Run conditions

This notebook has been tested in the following environment:
- Environment: Project created in [Paperspace Gradient](https://gradient.paperspace.com) with Python 3.9.13.
- Machine: P5000 (30GiB RAM 8 CPU 16GiB GPU) (more details on [Paperspace Machines](https://docs.paperspace.com/gradient/machines/)).
- IDE: Visual Studio Code using remote Jupyter server.

## Install dependencies

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
# Install the libraries datasets v2.7.1, evaluate v0.3.0, and transformers v4.25.1 with quiet and upgrade flags.
%pip install -q datasets==2.7.1 evaluate==0.3.0 transformers==4.25.1 --upgrade

[0mNote: you may need to restart the kernel to use updated packages.


## Loading and saving

In [2]:
# Import BertTokenizerFast from Transformers.
from transformers import BertTokenizerFast

# Create a tokenizer object.
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
# Tokenizer a text.
tokenizer('Using a Transformer network is simple')

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [3]:
# Save the tokenizer to disk in the path "hugging_face_course/2_using_transformers/tokenizer/section_4".
tokenizer.save_pretrained('hugging_face_course/2_using_transformers/tokenizer/section_4')

('hugging_face_course/2_using_transformers/tokenizer/section_4/tokenizer_config.json',
 'hugging_face_course/2_using_transformers/tokenizer/section_4/special_tokens_map.json',
 'hugging_face_course/2_using_transformers/tokenizer/section_4/vocab.txt',
 'hugging_face_course/2_using_transformers/tokenizer/section_4/added_tokens.json',
 'hugging_face_course/2_using_transformers/tokenizer/section_4/tokenizer.json')

## Tokenization

In [4]:
# Import the AutoTokenizer from Transformers.
from transformers import AutoTokenizer

# Create a tokenizer object.
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
# Tokenizer a text.
tokenizer('Using a Transformer network is simple')
# Get tokens to print them.
tokens = tokenizer.tokenize('Using a Transformer network is simple')
# Print the tokens.
print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


## From tokens to input IDs

In [5]:
# Convert the tokens to IDs.
ids = tokenizer.convert_tokens_to_ids(tokens)
# Print the IDs.
print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


## Decoding

In [6]:
# Decode the IDs to a string of text.
text = tokenizer.decode(ids)
# Print the text.
print(text)

Using a Transformer network is simple
