# Tokenization

The tokenization process is done by the `tokenize()` method of the tokenizer:


<a href="https://colab.research.google.com/github/miztiik/llm-bootcamp/blob/main/chapters/tokenization_vs_embeddings/tokenization_vs_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="./images/tokenize.png" width=50%>

In [None]:
%%capture
# Comment the above line to see the installation logs

# Install the latest version of the transformers library
!pip install -qU transformers
!pip install -qU torch

In [23]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [75]:
tokens = tokenizer.tokenize("Using a Transformer Model is simple")
print(tokens)

['using', 'a', 'transform', '##er', 'model', 'is', 'simple']


This tokenizer is a subword tokenizer: it splits the words until it obtains tokens that can be represented by its vocabulary. That’s the case here with `transformer`, which is split into two tokens: `transform` and `##er`.

## From tokens to input IDs

In [76]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[2478, 1037, 10938, 2121, 2944, 2003, 3722]


## Decoding the input IDs

<img src="./images/tokenize_encode_decode.png" width=50%>



In [78]:
decoded_string = tokenizer.decode(ids)
# decoded_string = tokenizer.decode([7993, 170, 13809, 23763, 6747, 1110, 3014])

print(decoded_string)

using a transformer model is simple


In [28]:
tokenizer("Using a Transformer Model is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 6747, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

## Embeddings

<img src="./images/text_embeddings.png" width=50%>

In [32]:
from transformers import DistilBertTokenizerFast, DistilBertModel

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
tokenizer.tokenize("Using a Transformer Model is simple")

['using', 'a', 'transform', '##er', 'model', 'is', 'simple']

In [33]:
tokens = tokenizer.encode(
    'Using a Transformer Model is simple', return_tensors='pt')
print("Tokens:", tokens)
print("Decoded tokens:", tokenizer.decode(tokens[0]))
# for token in tokens[0]:
#     print("These are decoded tokens!", tokenizer.decode([token]))

Tokens: tensor([[  101,  2478,  1037, 10938,  2121,  2944,  2003,  3722,   102]])
Decoded tokens: [CLS] using a transformer model is simple [SEP]


In [52]:
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

txt_embeddings = model.embeddings.word_embeddings(tokens)


# for e in model.embeddings.word_embeddings(tokens)[0]:
#     print("This is an embedding!", e)

In [53]:
txt_embeddings.shape

torch.Size([1, 9, 768])

he output `torch.Size([1, 9, 768])` indicates the shape of the embeddings `txt_embeddings`. Here's an explanation of each dimension:

`1` : This dimension corresponds to the batch size. Since we only have one sample in our case (a single sentence), the first dimension has a size of `1`.
`9`: This dimension signifies the sequence length, which refers to the number of tokens in the input sequence including any padding sequences added by tokenizer.
`768`: Lastly, this dimension shows the size of each individual embedding vector. Each token in the input sequence receives an embedding representation, and these representations form the third dimension. In this example, the `DistilBERT` model produces embeddings of size `768` for each token.

In [54]:
txt_embeddings

tensor([[[ 0.0390, -0.0123, -0.0208,  ...,  0.0607,  0.0230,  0.0238],
         [-0.0591,  0.0156, -0.0018,  ...,  0.0122,  0.0091, -0.0047],
         [ 0.0062,  0.0100,  0.0071,  ..., -0.0043, -0.0132,  0.0166],
         ...,
         [-0.0440, -0.0236, -0.0283,  ...,  0.0053, -0.0081,  0.0170],
         [ 0.0253, -0.0128, -0.0227,  ..., -0.0412, -0.0077, -0.0559],
         [-0.0199, -0.0095, -0.0099,  ..., -0.0235,  0.0071, -0.0071]]],
       grad_fn=<EmbeddingBackward0>)

In [57]:
len(txt_embeddings[0][0])

768

In [59]:
tokenizer.decode(txt_embeddings[0])

TypeError: argument 'ids': 'list' object cannot be interpreted as an integer