# Embeddings

In a previous post about [tokens](https://maximofn.com/tokens/), we already saw the minimum representation of each word. Which corresponds to giving a number to the minimum division of each word.

However, the transformers, and therefore the LLMs, do not represent the information of the words in this way, but do so by means of `embeddings`.

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

We are going to see first two ways of representing words inside transformers, the `ordinal encoding` and the `one hot encoding`. And seeing the problems of these two types of representations we will be able to get to the `embeddings`.

## Ordinal encoding

This is the most basic way to represent the words inside the transformers. It consists of giving a number to each word, or keeping the numbers already assigned to the tokens.

However, this type of representation has two problems

 * Let us imagine that table corresponds to token 3, cat to token 1 and dog to token 2. One could assume that `table = cat + dog`, but it is not so. There is no such relationship between these words. We might even think that by assigning the correct tokens, this type of relationship could occur. However, this thought falls apart with words that have more than one meaning, such as the word `bank`, for example.

 * The second problem is that neural networks internally do a lot of numerical calculations, so it could be the case that if mesa has token 3, it has internally more importance than the word cat which has token 1.

So this type of word representation can be discarded very quickly.

## One hot encoding

Here what is done is to use `N` dimensional vectors. For example we saw that OpenAI has a vocabulary of `100277` distinct tokens. So if we use `one hot encoding`, each word would be represented with a vector of `100277` dimensions.

However, the one hot encoding has two other major problems

 * It does not take into account the relationship between words. So if we have two words that are synonyms, such as `cat` and `feline`, we would have two different vectors to represent them.
 In language the relationship between words is very important, and not taking this relationship into account is a big problem.

 * The second problem is that vectors are very large. If we have a vocabulary of `100277` tokens, each word would be represented by a vector of `100277` dimensions. This makes the vectors very large and computationally very expensive. In addition these vectors are going to be all zeros, except in the position corresponding to the word token. So most of the calculations are going to be multiplications by zero, which are calculations that don't add anything. So we're going to have a lot of memory allocated to vectors where you only have a 1 in a given position.

## Word embeddings

In word embeddings we try to solve the problems of the two previous types of representations. For this purpose vectors of `N` dimensions are used, but in this case vectors of 100277 dimensions are not used, but vectors of much less dimensions are used. For example we will see that OpenAI uses `1536` dimensions.

Each of the dimensions of these vectors represents a characteristic of the word. For example one of the dimensions could represent whether the word is a verb or a noun. Another dimension might represent whether the word is an animal or not. Another dimension might represent whether the word is a proper noun or not. And so on.

However, these features are not defined by hand, but are learned automatically. During the training of the transformers, the values of each of the dimensions of the vectors are adjusted, so that the characteristics of each of the words are learned.

By making each of the word dimensions represent a characteristic of the word, words that have similar characteristics will have similar vectors. For example the words `cat` and `feline` will have very similar vectors, since they are both animals. And the words `table` and `chair` will have similar vectors, since both are furniture.

In the following image we can see a 3-dimensional representation of words, and we can see that all words related to `school` are close, all words related to `food` are close and all words related to `ball` are close.

![word_embedding_3_dimmension](http://maximofn.com/wp-content/uploads/2023/12/word_embedding_3_dimmension.webp)

Having each of the dimensions of the vectors represent a characteristic of the word allows us to perform operations with words. For example, if we subtract the word `king` from the word `man` and add the word `woman`, we get a word very similar to the word `queen`. We will check it later with an example

### Similarity between words

As each word is represented by a vector of N dimensions, we can calculate the similarity between two words. The cosine similarity function or `cosine similarity` is used for this purpose.

If two words are close in vector space, it means that the angle between their vectors is small, so their cosine is close to 1. If there is an angle of 90 degrees between the vectors, the cosine is 0, meaning that there is no similarity between the words. And if there is an angle of 180 degrees between the vectors, the cosine is -1, that is, the words are opposites.

![cosine similarity](http://maximofn.com/wp-content/uploads/2023/12/cosine_similarity-scaled.webp)

### Example with OpenAI embeddings

Now that we know what `embeddings` are, let's see some examples with the `embeddings` provided by the `API` of `OpenAI`.

To do this we first need to have the `OpenAI` package installed.

````bash
pip install openai
```

We import the necessary libraries

In [1]:
from openai import OpenAI
import torch
from torch.nn.functional import cosine_similarity

We use an OpenAI `API key`. To do this, go to the [OpenAI](https://openai.com/) page, and register. Once registered, go to the [API Keys](https://platform.openai.com/api-keys) section, and create a new `API Key`.

![open ai api key](https://raw.githubusercontent.com/maximofn/alfred/main/gifs/openaix2.gif)

In [2]:
api_key = "Pon aquí tu API key"

We select which embeddings model we want to use. In this case we are going to use `text-embedding-ada-002` which is the one recommended by `OpenAI` in its [embeddings](https://platform.openai.com/docs/guides/embeddings/) documentation.

In [None]:
model_openai = "text-embedding-ada-002"

Create an `API` client

In [None]:
client_openai = OpenAI(api_key=api_key, organization=None)

Let's see how are the `embeddings` of the word `King`.

In [7]:
word = "Rey"
embedding_openai = torch.Tensor(client_openai.embeddings.create(input=word, model=model_openai).data[0].embedding)

embedding_openai.shape, embedding_openai

(torch.Size([1536]),
 tensor([-0.0103, -0.0005, -0.0189,  ..., -0.0009, -0.0226,  0.0045]))

As we can see we obtain a vector of `1536` dimensions

### Operations with words

Let's get the embeddings of the words `king`, `man`, `woman` and `Queen`.

In [19]:
embedding_openai_rey = torch.Tensor(client_openai.embeddings.create(input="rey", model=model_openai).data[0].embedding)
embedding_openai_hombre = torch.Tensor(client_openai.embeddings.create(input="hombre", model=model_openai).data[0].embedding)
embedding_openai_mujer = torch.Tensor(client_openai.embeddings.create(input="mujer", model=model_openai).data[0].embedding)
embedding_openai_reina = torch.Tensor(client_openai.embeddings.create(input="reina", model=model_openai).data[0].embedding)

In [20]:
embedding_openai_reina.shape, embedding_openai_reina

(torch.Size([1536]),
 tensor([-0.0110, -0.0084, -0.0115,  ...,  0.0082, -0.0096, -0.0024]))

Let's obtain the embedding resulting from subtracting the embedding of `man` from `king` and adding the embedding of `woman` to `king`.

In [21]:
embedding_openai = embedding_openai_rey - embedding_openai_hombre + embedding_openai_mujer

In [22]:
embedding_openai.shape, embedding_openai

(torch.Size([1536]),
 tensor([-0.0226, -0.0323,  0.0017,  ...,  0.0014, -0.0290, -0.0188]))

Finally we compare the result obtained with the embedding of `reina`. For this we use the `cosine_similarity` function provided by the `pytorch` library.

In [23]:
similarity_openai = cosine_similarity(embedding_openai.unsqueeze(0), embedding_openai_reina.unsqueeze(0)).item()

print(f"similarity_openai: {similarity_openai}")

similarity_openai: 0.7564167976379395


As we can see it is a value very close to 1, so we can say that the result obtained is very similar to the embedding of `reina`.

If we use English words, we get a result closer to 1.

In [15]:
embedding_openai_rey = torch.Tensor(client_openai.embeddings.create(input="king", model=model_openai).data[0].embedding)
embedding_openai_hombre = torch.Tensor(client_openai.embeddings.create(input="man", model=model_openai).data[0].embedding)
embedding_openai_mujer = torch.Tensor(client_openai.embeddings.create(input="woman", model=model_openai).data[0].embedding)
embedding_openai_reina = torch.Tensor(client_openai.embeddings.create(input="queen", model=model_openai).data[0].embedding)

In [16]:
embedding_openai = embedding_openai_rey - embedding_openai_hombre + embedding_openai_mujer

In [17]:
similarity_openai = cosine_similarity(embedding_openai.unsqueeze(0), embedding_openai_reina.unsqueeze(0))
print(f"similarity_openai: {similarity_openai}")

similarity_openai: tensor([0.8849])


This is normal, since the OpenAi model has been trained with more txtos in English than in Spanish.

### Types of word embeddings

There are several types of word embeddings, and each of them has its advantages and disadvantages. Let's take a look at the most important ones

 * Word2Vec
 * GloVe
 * FastText
 * BERT
 * GPT-2

#### Word2Vec

Word2Vec is an algorithm used to create word embeddings. This algorithm was created by Google in 2013, and it is one of the most used algorithms to create word embeddings.

It has two variants, `CBOW` and `Skip-gram`. `CBOW` is faster to train, while `Skip-gram` is more accurate. Let's see how each of them works

##### CBOW

CBOW` or `Continuous Bag of Words` is an algorithm used to predict a word from the surrounding words. For example if we have the sentence `The cat is an animal`, the algorithm will try to predict the word `cat` from the surrounding words, in this case `The`, `is`, `an` and `animal`.

![CBOW](http://maximofn.com/wp-content/uploads/2023/12/cbow-scaled.webp)

In this architecture, the model predicts which is the most likely word in the given context. Therefore, words that have the same probability of appearing are considered similar and are therefore closer in dimensional space.

Suppose that in a sentence we replace `boat` with `boat`, then the model predicts the probability for both and if it turns out to be similar then we can consider that the words are similar.

##### Skip-gram

`Skip-gram` or `Skip-gram with Negative Sampling` is an algorithm used to predict the words surrounding a word. For example if we have the sentence `The cat is an animal`, the algorithm will try to predict the words `The`, `is`, `an` and `animal` from the word `cat`.

![Skip-gram](http://maximofn.com/wp-content/uploads/2023/12/Skip-gram-scaled.webp)

This architecture is similar to that of CBOW, but instead the model works backwards. The model predicts the context using the given word. Therefore, words that have the same context are considered similar and are therefore closer in dimensional space.

#### GloVe

GloVe` or `Global Vectors for Word Representation` is an algorithm used to create word embeddings. This algorithm was created by Stanford University in 2014.

Word2Vec ignores the fact that some context words occur more frequently than others and also only take into account the local context and therefore do not capture the global context.

This algorithm uses a co-occurrence matrix to create the word embeddings. This co-occurrence matrix is a matrix that contains the number of times each word appears next to each of the other words in the vocabulary.

#### FastText

`FastText` is an algorithm that is used to create word embeddings. This algorithm was created by Facebook in 2016.

One of the main disadvantages of `Word2Vec` and `GloVe` is that they cannot encode unknown or out-of-vocabulary words.

So, to deal with this problem, Facebook proposed a `FastText` model. It is an extension of `Word2Vec` and follows the same `Skip-gram` and `CBOW` model. But unlike `Word2Vec` which feeds whole words into the neural network, `FastText` first splits words into several subwords (or `n-grams`) and then feeds them to the neural network.

For example, if the value of `n` is 3 and the word is `apple` then your tri-gram will be [`<ma`, `man`, `anz`, `nza`, `zan`, `ana`, `na>`] and your word embedding will be the sum of the vector representation of these tri-grams. Here, the hyperparameters `min_n` and `max_n` are considered as 3 and the characters `<` and `>` represent the beginning and end of the word.

Therefore, using this methodology, unknown words can be represented in vector form, since it has a high probability that their `n-grams` are also present in other words.

This algorithm is an improvement of `Word2Vec`, since in addition to taking into account the words surrounding a word, it also takes into account the `n-grams` of the word. For example if we have the word `cat`, it also takes into account the `n-grams` of the word, in this case `ga`, `at` and `to`, for `n = 2`.

#### Limitations of word embeddings

Word embedding techniques have given a decent result, but the problem is that the approach is not precise enough. They do not take into account the order of the words in which they appear, which leads to loss of syntactic and semantic understanding of the sentence.

For example, `You go there to teach, not to play` AND `You go there to play, not to teach` Both sentences will have the same representation in vector space, but they do not mean the same thing.

In addition, the word embedding model cannot give satisfactory results on a large amount of text data, since the same word may have a different meaning in a different sentence depending on the context of the sentence.

For example, `I am going to sit in the bank` AND `I am going to do business in the bank` In both sentences, the word `bank` has different meanings.

Therefore, we require a type of representation that can retain the contextual meaning of the word present in a sentence.

## Sentence embeddings

Sentence embedding is similar to word embedding, but instead of words, they encode the whole sentence in the vector representation.

A simple way to obtain sentence embedding is to average the word embedding of all the words present in the sentence. But they are not accurate enough.

Some of the most advanced models for sentence embedding are `ELMo`, `InferSent` and `Sentence-BERT`.

### ELMo

`ELMo` or `Embeddings from Language Models` is a sentence embedding model that was created by Allen University in 2018. It uses a bidirectional deep LSTM network to produce vector representation. `ELMo` can represent unknown or out-of-vocabulary words in vector form since it is character-based.

### InferSent

`InferSent` is a sentence embedding model that was created by Facebook in 2017. It uses a bidirectional deep LSTM network to produce vector representation. `InferSent` can represent unknown or out-of-vocabulary words in vector form as it is character-based. Sentences are encoded in a 4096-dimensional vector representation.

The training of the model is performed on the Stanford Natural Language Inference (`SNLI`) dataset. This dataset is labeled and written by humans for about 500K sentence pairs.

### Sentence-BERT

Sentence-BERT is a sentence embedding model that was created by the University of London in 2019. It uses a bidirectional deep LSTM network to produce vector representation. `Sentence-BERT` can represent unknown or out-of-vocabulary words in vector form as it is character-based. Sentences are encoded in a 768-dimensional vector representation.

The state-of-the-art NLP model `BERT` is excellent in Semantic Textual Similarity tasks, but the problem is that it would take a long time for a huge corpus (65 hours for 10,000 sentences), as it requires both sentences to be entered into the network and this increases the computation by a huge factor.

Therefore, `Sentence-BERT` is a modification of the `BERT` model.