# 🤖 **Understanding Tokenizers with BERT**

This notebook shows how to use BERT tokenizers to turn text into data that the model can understand.

## 🛠️ Setup and Installation

First, we need to install the libraries we will use.

In [1]:
!pip install pandas==2.0.1
!pip install transformers==4.29.2

Collecting pandas==2.0.1
  Downloading pandas-2.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 2.0.3
    Uninstalling pandas-2.0.3:
      Successfully uninstalled pandas-2.0.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==2.0.3, but you have pandas 2.0.1 which is incompatible.[0m[31m
[0mSuccessfully installed pandas-2.0.1
Collecting transformers==4.29.2
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,


## 📚 Importing Libraries

We import the libraries necessary for our tasks.

In [2]:
# Import required libraries
from transformers import BertModel, AutoTokenizer
import pandas as pd


## 🤖 Model Setup

We load a pre-trained BERT model and its tokenizer.

In [3]:
# Specify the pre-trained model to use: BERT-base-cased
model_name = "bert-base-cased"

In [4]:
# Instantiate the model and tokenizer for the specified pre-trained model
model = BertModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)


## 📝 Tokenizing Text

We use the tokenizer to turn a sentence into tokens.

In [5]:
# Set a sentence for analysis
sentence = "When life gives you lemons, don't make lemonade."


In [6]:
# Tokenize the sentence
tokens = tokenizer.tokenize(sentence)
tokens

['When',
 'life',
 'gives',
 'you',
 'lemon',
 '##s',
 ',',
 'don',
 "'",
 't',
 'make',
 'lemon',
 '##ade',
 '.']


## 📘 Vocabulary and Token IDs

We create a DataFrame to see the tokenizer's vocabulary and sort it by token IDs.

In [7]:
# Create a DataFrame with the tokenizer's vocabulary
vocab = tokenizer.vocab
vocab_df = pd.DataFrame({"token": vocab.keys(), "token_id": vocab.values()})
vocab_df = vocab_df.sort_values(by="token_id").set_index("token_id")

vocab_df

Unnamed: 0_level_0,token
token_id,Unnamed: 1_level_1
0,[PAD]
1,[unused1]
2,[unused2]
3,[unused3]
4,[unused4]
...,...
28991,##）
28992,##，
28993,##－
28994,##／


## 🔍 Encoding and Decoding

Encode the sentence into IDs and then decode it back to text.

In [8]:
# Encode the sentence into token_ids using the tokenizer
token_ids = tokenizer.encode(sentence)
token_ids

[101,
 1332,
 1297,
 3114,
 1128,
 22782,
 1116,
 117,
 1274,
 112,
 189,
 1294,
 22782,
 6397,
 119,
 102]


## 🔎 Compare Token Lengths

Compare the length of tokens and token IDs.


In [9]:

# Print the length of tokens and token_ids
print("Number of tokens:", len(tokens))
print("Number of token IDs:", len(token_ids))


14
16



## 🔄 Explore Token Data

Look at specific tokens by their IDs.

In [10]:
# Access the tokens in the vocabulary DataFrame by index
print("Token at position 101:", vocab_df.iloc[101])
print("Token at position 102:", vocab_df.iloc[102])

token    [SEP]
Name: 102, dtype: object

## 📃 Token and ID Pairing

Show pairs of tokens and their IDs.

In [11]:
# Zip tokens and token_ids (excluding the first and last token_ids for [CLS] and [SEP])
list(zip(tokens, token_ids[1:-1]))

[('When', 1332),
 ('life', 1297),
 ('gives', 3114),
 ('you', 1128),
 ('lemon', 22782),
 ('##s', 1116),
 (',', 117),
 ('don', 1274),
 ("'", 112),
 ('t', 189),
 ('make', 1294),
 ('lemon', 22782),
 ('##ade', 6397),
 ('.', 119)]

In [12]:
# Decode the token_ids (excluding the first and last token_ids for [CLS] and [SEP]) back into the original sentence
tokenizer.decode(token_ids[1:-1])

"When life gives you lemons, don't make lemonade."

In [13]:
# Tokenize the sentence using the tokenizer's `__call__` method
tokenizer_out = tokenizer(sentence)
tokenizer_out

{'input_ids': [101, 1332, 1297, 3114, 1128, 22782, 1116, 117, 1274, 112, 189, 1294, 22782, 6397, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


## 🧩 Handling Multiple Sentences

Tokenize two sentences with and without padding, and decode them.

In [14]:
# Create a new sentence by removing "don't " from the original sentence
sentence2 = sentence.replace("don't ", "")
sentence2

'When life gives you lemons, make lemonade.'

In [15]:
# Tokenize both sentences with padding
tokenizer_out2 = tokenizer([sentence, sentence2], padding=True)
tokenizer_out2

{'input_ids': [[101, 1332, 1297, 3114, 1128, 22782, 1116, 117, 1274, 112, 189, 1294, 22782, 6397, 119, 102], [101, 1332, 1297, 3114, 1128, 22782, 1116, 117, 1294, 22782, 6397, 119, 102, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]}

In [16]:
# Decode the tokenized input_ids for both sentences
tokenizer.decode(tokenizer_out2["input_ids"][0])

"[CLS] When life gives you lemons, don't make lemonade. [SEP]"

In [17]:
tokenizer.decode(tokenizer_out2["input_ids"][1])

'[CLS] When life gives you lemons, make lemonade. [SEP] [PAD] [PAD] [PAD]'


## 🌟 Conclusion

This notebook walked you through how to use a BERT tokenizer to process text, turning it into tokens and IDs, and how to handle multiple sentences. Feel free to change the sentences or explore more functions of the tokenizer.