# Chapter 3 Embeddings

In this chapter, we will learn about Embeddings. Embeddings can be understood as a vector representation of text that is easier for computers to process, and are usually used to convert discrete, high-dimensional data representations (such as words, sentences, or documents) into continuous, low-dimensional vector representations.

The purpose of Embeddings is to capture the semantic and grammatical relationships between data so that similar data is closer in the embedding space. For example, in natural language processing, Word Embeddings can be used to map words to a continuous vector space so that words with similar meanings are closer in the embedding space. This feature also makes them one of the most important components in LLM (Large Language Model).

In this chapter, we need to use Cohere's API key.

## Table of Contents

- [I. Environment Configuration](#I.)

- [II. Word Embeddings](#II.)

- [2.1 Understanding the Concept of Embeddings](#2.1)

- [2.2 Implementing Word Embeddings](#2.2)

- [III. Sentence Embeddings](#III.)

- [3.1 Understanding the Concept of Sentence Embeddings](#3.1)

- [3.2Implementing sentence embeddings](#3.2)

- [IV. Articles Embeddings](#IV.)

## 1. Environment Configuration <a id="1."></a>

Let's first prepare some Python libraries and APIs that we will need.

In [None]:
!pip install cohere umap-learn altair datasets

In [None]:
# The following code can help us load the API we need
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # 读取本地 .env 文件

We need to import the Cohere library and create a Cohere client using the API key.

The Cohere library is a library that contains functions for calling large language models, which can be called through API calls.

In [None]:
import cohere
co = cohere.Client(os.environ['COHERE_API_KEY'])

In addition, we also need to import the Pandas library, which can be used for data analysis and data processing.

In [2]:
import pandas as pd

## 2. Word Embeddings

### 2.1 Understanding the concept of embeddings

Let's first learn what Embeddings are.

Here, we have a grid with horizontal and vertical axes and coordinate values, and we can see that a bunch of words are located in this grid.
If you want to put the words in the right place, where would you put the word "apple" (translated as "apple")?

![Alt ​​text](images/3-1.png)

As you can see in this grid, similar words are grouped together.
So in the upper left, there are footballs, basketballs, and table tennis balls, in the lower left, there are houses, buildings, and castles, in the lower right, there are vehicles such as bicycles and cars, and in the upper right, there are fruits.
Therefore, apple will be classified as the fruit in the upper right.
Then, we associate each word in the table with the coordinate axis. Here, the coordinate of apple is (5, 5).

This is a kind of Embeddings, which maps each word to a vector consisting of two values.

Generally speaking, Embeddings will map words to more values. We will have as many words as possible, and in order to represent each word, the Embeddings actually used can map a word to hundreds or even thousands of values.

### 2.2 Implementing word embedding <a id="2.2"></a>

We will use a very small data table. It contains three words: joy, happiness, and potato. We create it with Pandas as follows:

In [4]:
three_words = pd.DataFrame({'text':
  [
      'joy',
      'happiness',
      'potato'
  ]})

three_words

Unnamed: 0,text
0,joy
1,happiness
2,potato


In [3]:
# Chinese version
three_words = pd.DataFrame({'text':
  [
      '欢乐',
      '快乐',
      '马铃薯'
  ]})

three_words

Unnamed: 0,text
0,欢乐
1,快乐
2,马铃薯


Next, we create embeddings for these three words.
We use the embed function from the Cohere library to create these embeddings.
The embed function takes a few inputs. The first input is the dataset "three_words" that we want to embed, and we also need to specify the column to be used as "text", and the model to be used.

In [None]:
three_words_emb = co.embed(texts=list(three_words['text']),
                           model='embed-english-v2.0').embeddings  # 英文版本用英文嵌入模型 embed-english-v2.0

In [None]:
# Chinese version
three_words_emb = co.embed(texts=list(three_words['text']),
                           model='embed-multilingual-v2.0').embeddings  # 中文版本用多语言嵌入模型 embed-multilingual-v2.0

Now let's look at the embeddings associated with each word. We call the embedding associated with the word "joy" "word_1", which can be obtained by looking at the first row of "three_words_emb". We do the same for "word_2" and "word_3". They are the embeddings corresponding to the words "happiness" and "potato".

We can print the first 10 values ​​of the embedding associated with the word "joy", that is, the first ten values ​​in "word_1".

In [None]:
word_1 = three_words_emb[0]
word_2 = three_words_emb[1]
word_3 = three_words_emb[2]

print(word_1[:10])
# Note: The output results below are for the English version. If it is the Chinese version, it will be different.

[2.3203125,
-0.18334961,
-0.578125,
-0.7314453,
-2.2050781,
-2.59375,
0.35205078,
-1.6220703,
0.27954102,
0.3083496]

## 3. Sentence Embeddings <a id="三、"></a>

### 3.1 Understand the concept of sentence embeddings <a id="3.1"></a>

Embeddings can be used not only for words, but also for longer text snippets. In fact, it can be a very long text snippet.

![Alt ​​text](images/3-2.png)

In this example, we have some sentence embeddings.

Now these sentences are converted into a vector or a list of values.

Notice that the first sentence is "hello, how are you?" and the last sentence is "Hi, how's it going?".

They don't have the same words, but their meanings are very similar, so Embeddings will map them to some very close values.

### 3.2 Implement sentence embeddings <a id="3.2"></a>

Prepare a small dataset with multiple sentences. As you can see, this dataset has eight sentences, and they appear in pairs. Each sentence is the answer to the previous one. For example, the answer to "What color is the sky?" is "The sky is blue" and the answer to "What is an apple?" is "An appleis a fruit".

In [None]:
# Create a DataFrame containing eight sentences
sentences = pd.DataFrame({'text':
  [
   'Where is the world cup?',  # 句子1: 世界杯在哪里？
   'The world cup is in Qatar',  # 句子2: 世界杯在卡塔尔。
   'What color is the sky?',  # 句子3: 天空是什么颜色的？
   'The sky is blue',  # 句子4: 天空是蓝色的。
   'Where does the bear live?',  # 句子5: 熊住在哪里？
   'The bear lives in the woods',  # 句子6: 熊住在森林里。
   'What is an apple?',  # 句子7: 苹果是什么？
   'An apple is a fruit',  # 句子8: 苹果是一种水果。
  ]})


In [None]:
# Chinese version
sentences = pd.DataFrame({'text':
  [
   '世界杯在哪里？',
   '世界杯在卡塔尔', 
   '天空是什么颜色的?', 
   '天空是蓝色的', 
   '熊住在哪里？',  
   '熊住在森林里', 
   '苹果是什么?',  
   '苹果是一种水果',  
  ]})


Now, still using the embed function from the Cohere library, we convert all these sentences into Embeddings and observe which sentences are close or far from each other.

In [None]:
emb = co.embed(texts=list(sentences['text']),
               model='embed-english-v2.0').embeddings  # 英文版本用英文嵌入模型 embed-english-v2.0

In [None]:
# Chinese version
emb = co.embed(texts=list(sentences['text']),
               model='embed-multilingual-v2.0').embeddings  # 中文版本用多语言嵌入模型 embed-multilingual-v2.0

Let's look at the first 3 values ​​of the embedding for each sentence.

In [None]:
for e in emb:
    print(e[:3])
# Note: The output results below are for the English version. If it is the Chinese version, it will be different.

[0.27319336, -0.37768555, -1.0273438]

[0.49804688, 1.2236328, 0.4074707]

[-0.23571777, -0.9375, 0.9614258]

[0.08300781, -0.32080078, 0.9272461]

[0.49780273, -0.35058594, -1.6171875]

[1.2294922, -1.3779297, -1.8378906]

[0.15686035, -0.92041016, 1.5996094]

[1.0761719, -0.7211914, 0.9296875]

Let’s take a look at how many values ​​each sentence’s Embeddings has.

In [None]:
print(len(emb[0]))

In this particular case, the answer is 4096, but different embeddings will have different lengths.

Let’s visualize the embeddings for this dataset.
We call the utils library function called umap_plot, which calls the umap and altair packages and generates the following plot.

In [None]:
from utils import umap_plot
# Generate a chart using the umap_plot function
chart = umap_plot(sentences, emb)
# Call the interactive method to display an interactive chart
chart.interactive()

![Alt ​​text](images/3-3.png)

This figure shows eight points, each of which represents a sentence in our dataset. Placing the mouse on a point will show which sentence the point represents.

We observed that two sentences with similar meanings are very close to each other, such as 'Where does the bear live?' and 'The bear lives in the the woods'.

So we can conclude that Embeddings will place points with similar meanings close to each other and points with large differences in meanings far away from each other.
Usually, the sentence that is most similar to a question is its specific answer.
Therefore, we can find the answer to the question by searching for the sentence closest to the question.

## IV. Document Embeddings <a id="IV. "></a>

Now that you know how to embed a small dataset of eight sentences, let's work on a large dataset.

We will use a large dataset of Wikipedia articles.
It has 2000 articles with titles, text of the first paragraph, and embeddings of the first paragraph.
Let's load the following dataset with the following code.

In [None]:
import pandas as pd
wiki_articles = pd.read_pickle('wikipedia.pkl')
wiki_articles[['title','text','emb']]

![Alt ​​text](images/3-4.png)

We will import the Numpy library and a function that will help us visualize this graph, which is very similar to the previous graph.
We will reduce it to two dimensions so that we can see it.

In [None]:
import numpy as np
from utils import umap_plot_big

# Get data for the 'title' and 'text' columns from wiki_articles
articles = wiki_articles[['title', 'text']]

# Get the data of the 'emb' column from wiki_articles and convert it to a numpy array
embeds = np.array([d for d in wiki_articles['emb']])

# Generate a chart using the umap_plot_big function
chart = umap_plot_big(articles, embeds)

# Call the interactive method to display an interactive chart
chart.interactive()

Put the mouse on a point in the graph to display the content of the article. You can see that similar articles are located in similar locations.

![Alt ​​text](images/3-5.png)

That’s all about Embeddings.