# Representing Text

We can represent text in many ways: character strings are a standard representation, but we can also create numerical representations of text. In this notebook we will discuss embeddings.

## Features

Features to any (machine learning) model can be continuous or categorical.

- We use continuous features to represent numerical values: income, number of times the user clicked on a link, prices, etc.
- Categorical features represent an instance of a class or category. They have a finite number of possible values: job title, genre of a movie, breed of a dog, etc.

## Embeddings

An **embedding** is a trained numerical representation of a categorical feature:

- We use the word *trained* to highlight that embeddings are learned during model training.
- Different models and training procedures can be used to obtained embeddings. Word2Vec and BERT embeddings, for example, are different and capture different characteristics of the features.

[OpenAI's documentation](https://platform.openai.com/docs/guides/embeddings) include a few uses of embeddings:


- Search: results are ranked by relevance to a query string.
- Clustering:  text strings are grouped by similarity.
- Recommendations:  items with related text strings are recommended.
- Anomaly detection:  outliers with little relatedness are identified.
- Diversity measurement: similarity distributions are analyze.
- Classification: text strings are classified by their most similar label.

## BERT Embeddings

Bert embeddings are computed based on three ingredients discussed below:

+ Token embeddings (also called word embeddings)
+ Positional embeddings
+ Token type embeddings (also called sentence embeddings)

![](img/02_bert_architecture.png)

### Tokenization

Embedding computations start with tokenization: representing the original text as tokens in a vocabulary. 

To illustrate the process, we can use the [`transformers`](https://huggingface.co/docs/transformers/en/index) library from [HuggingFace](https://huggingface.co/).

In [1]:
import transformers

documents = ["cats are fun"]

tokenizer = transformers.BertTokenizer.from_pretrained(
    'bert-base-uncased')
tokens = tokenizer(documents)
print(tokens)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

{'input_ids': [[101, 8870, 2024, 4569, 102]], 'token_type_ids': [[0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1]]}


In the code snippet above, we used the `transformers` library to obtain the tokens that represent the phrase 'cats are fun'. The tokenizer returns a dictionary with an entry called `'input_ids'`, which contains an array of four integers. These integers are the positions of each token in the model's vocabulary. The vocabulary is [`'bert-base-uncased'`](https://huggingface.co/bert-base-uncased/blob/main/vocab.txt) and it is applied using the method `.from_pretrained()`.

We can show the vocabulary entries with:

In [None]:
print(f"Index of 'cats': {tokenizer.vocab['cats']}")
print(f"Index of 'are': {tokenizer.vocab['are']}")
print(f"Index of 'fun': {tokenizer.vocab['fun']}")

IDs 101 and 102 are special tokens:

- ID 101 is the `[CLS]` token, indicating the begginning of a sequence.
- ID 102 is the `[SEP]` token, indicating the end of a sequence.

They are inserted automatically to the output of the BERT tokenizer. The BERT Tokenizer includes 30,522 unique tokens. In addition, the BERT tokenizer handles unkown tokens, `[UNK]`, using techniques such as WordPiece. You can read more about this tokenizer in [BERT Tokenization (Nowak, 2023)](https://tinkerd.net/blog/machine-learning/bert-tokenization/) and Mastering Text Similarity ([Guadagnolo, 2024](https://medium.com/eni-digitalks/mastering-text-similarity-combining-embedding-techniques-and-distance-metrics-98d3bb80b1b6)).



### Token Embeddings

The tokens obtained from the previous step are mapped to the model's precomputed embeddings. For each token in the model vocabulary, there is an embedding vector.

<div>
<img src="img/02_skip_gram_architecture.png" width="700"/>
</div>

Image source: Mastering Text Similarity ([Guadagnolo, 2024](https://medium.com/eni-digitalks/mastering-text-similarity-combining-embedding-techniques-and-distance-metrics-98d3bb80b1b6))

In [2]:
model = transformers.BertModel.from_pretrained('bert-base-uncased')
embedding_layer = model.embeddings
embedding_layer.word_embeddings.weight

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Parameter containing:
tensor([[-0.0102, -0.0615, -0.0265,  ..., -0.0199, -0.0372, -0.0098],
        [-0.0117, -0.0600, -0.0323,  ..., -0.0168, -0.0401, -0.0107],
        [-0.0198, -0.0627, -0.0326,  ..., -0.0165, -0.0420, -0.0032],
        ...,
        [-0.0218, -0.0556, -0.0135,  ..., -0.0043, -0.0151, -0.0249],
        [-0.0462, -0.0565, -0.0019,  ...,  0.0157, -0.0139, -0.0095],
        [ 0.0015, -0.0821, -0.0160,  ..., -0.0081, -0.0475,  0.0753]],
       requires_grad=True)

In [3]:
embedding_layer.word_embeddings.weight.shape

torch.Size([30522, 768])

The attribute `.weight` of the embedding layer shows the actual embeddings. It is a matrix of 30,522 rows and 768 columns (the object, in reality, is a 2-dimenional vector). 

+ The number of rows is equal to the size of the model vocabulary.
+ The number of columns is the hidden size or the size of the model's internal representation. 

In [4]:
tokens = tokenizer(['cats are fun'])
input_ids = tokens.input_ids[0]
input_ids

[101, 8870, 2024, 4569, 102]

In [5]:
doc_embeddings = embedding_layer.word_embeddings.weight[input_ids]
doc_embeddings

tensor([[ 0.0136, -0.0265, -0.0235,  ...,  0.0087,  0.0071,  0.0151],
        [-0.0590, -0.0339,  0.0108,  ..., -0.0328, -0.0285,  0.0624],
        [-0.0134, -0.0135,  0.0250,  ...,  0.0013, -0.0183,  0.0227],
        [-0.0073, -0.0459,  0.0314,  ..., -0.0196, -0.0372, -0.0150],
        [-0.0145, -0.0100,  0.0060,  ..., -0.0250,  0.0046, -0.0015]],
       grad_fn=<IndexBackward0>)

### Position Embeddings

In addition to token embeddings, the BERT model also keeps track of positions through position embeddings. In contrast with token embeddings, position embeddings have shape (512, 768). This is because the BERT model can only take up to 512 tokens at a time.

In [6]:
embedding_layer.position_embeddings.weight

Parameter containing:
tensor([[ 1.7505e-02, -2.5631e-02, -3.6642e-02,  ...,  3.3437e-05,
          6.8312e-04,  1.5441e-02],
        [ 7.7580e-03,  2.2613e-03, -1.9444e-02,  ...,  2.8910e-02,
          2.9753e-02, -5.3247e-03],
        [-1.1287e-02, -1.9644e-03, -1.1573e-02,  ...,  1.4908e-02,
          1.8741e-02, -7.3140e-03],
        ...,
        [ 1.7418e-02,  3.4903e-03, -9.5621e-03,  ...,  2.9599e-03,
          4.3435e-04, -2.6949e-02],
        [ 2.1687e-02, -6.0216e-03,  1.4736e-02,  ..., -5.6118e-03,
         -1.2590e-02, -2.8085e-02],
        [ 2.6413e-03, -2.3298e-02,  5.4922e-03,  ...,  1.7537e-02,
          2.7550e-02, -7.7656e-02]], requires_grad=True)

In [7]:
embedding_layer.position_embeddings.weight.shape

torch.Size([512, 768])

### Token Type Embeddings

Token or Segment Type embeddings. BERT was trained to solve Next Sentence Prediciton. Given two sentences, A and B, BERT was trained to determine if B logically follows A.

In [8]:
embedding_layer.token_type_embeddings

Embedding(2, 768)

### The Embedding Layer

The embedding layer takes a list of token ids and converts them to embeddings that combine the three types discussed above: token, position, and type embeddings.

In [9]:
import torch
tokens = tokenizer(['cats are fun'])
final_embeddings = embedding_layer(input_ids = torch.tensor(tokens.input_ids))
final_embeddings

tensor([[[ 0.1686, -0.2858, -0.3261,  ..., -0.0276,  0.0383,  0.1640],
         [-0.4827,  0.0312,  0.3199,  ...,  0.1614,  0.3668,  1.2662],
         [-0.2330,  0.1164,  0.5087,  ...,  0.3019,  0.1804,  0.3744],
         [ 0.1432, -0.4452,  0.5792,  ...,  0.2212, -0.2110, -0.0521],
         [-0.3643, -0.1617,  0.0902,  ..., -0.1785,  0.1282, -0.0451]]],
       grad_fn=<NativeLayerNormBackward0>)