# 2.1 Understanding word embedding

Deep neural network models, including large language models (LLMs), cannot process raw text directly. 
Because text is categorical data, it is incompatible with the mathematical operations used to implement and train neural networks. 
Therefore, we need a way to represent words as continuous-valued vectors. 
(Readers who are not familiar with vector and tensor calculations in context can learn more in Appendix A, A2.2, “Understanding Tensors.”)

The concept of converting data into a vector format is often referred to as embedding.
As shown in Figure 2.2, by using specific neural network layers or other pre-trained neural network models, we can embed different types of data, such as video, audio, and text.

**Figure 2.2

Deep learning models cannot process raw data formats such as video, audio, and text.
Therefore, we use embedding models to convert these raw data into dense vector representations so that deep learning architectures can easily understand and process the raw data.
Specifically, this figure shows the process of converting raw data into a three-dimensional numerical vector.
It is important to note that different data formats require different embedding models.
For example, an embedding model designed for text is not suitable for embedding audio or video data. **

![fig2.2](https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-2-2.jpg?raw=true)

Essentially, an embedding is a mapping that maps discrete objects, such as words, images, or even entire documents, to points in a continuous vector space — the main purpose of embeddings is to convert non-numeric data into a format that neural networks can process.

While word embeddings are the most common form of text embedding, they can also be used to embed sentences, paragraphs, or entire documents.
Sentence or paragraph embeddings are a popular choice for retrieval-augmented generation. Retrieval-augmented generation combines generation (e.g., producing text) and retrieval (e.g., searching an external knowledge base) to pull relevant information while generating text. These techniques are beyond the scope of this book.
Since our goal is to train large language models (LLMs) like GPT, which learn to generate text one word at a time, this chapter focuses on word embeddings.

There are many algorithms and frameworks that have been developed to generate word embeddings.
One of the earliest and most popular examples is the Word2Vec method.
The Word2Vec trained neural network architecture generates word embeddings by predicting the context of a given target word or vice versa.
The main idea of ​​the Word2Vec architecture is that words that appear in similar contexts tend to have similar meanings.
Therefore, when the word embeddings are projected into a two-dimensional space for easy visualization, similar terms can be seen clustered together, as shown in Figure 2.3.

**Figure 2.3 If word embeddings are two-dimensional, we can plot them in a two-dimensional scatter plot for easy visualization, as shown in this figure. When using word embedding techniques such as Word2Vec, words corresponding to similar concepts are usually close to each other in the embedding space. For example, different types of birds are closer to each other in the embedding space than countries and cities. **

![fig2.3](https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-2-3.jpg?raw=true)

Word embeddings can have different dimensions, ranging from one dimension to thousands of dimensions.
As shown in Figure 2.3, we can choose a two-dimensional word embedding for easy visualization.
Higher dimensions may capture more subtle relationships between words, but at the cost of decreased computational efficiency.

While we can use pretrained models such as Word2Vec to generate embeddings for machine learning models, large language models (LLMs) typically generate their own embeddings that are part of the input layer and updated during training.
The advantage of optimizing embeddings as part of LLM training, rather than using Word2Vec, is that the embeddings are optimized for the specific task and data at hand.
We will implement such an embedding layer later in this chapter.
In addition, as we discussed in Chapter 3, LLMs can also create contextualized output embeddings.

Unfortunately, high-dimensional embeddings are challenging to visualize because our sensory perception and common graphical representations are inherently limited to three dimensions or less, which is why Figure 2.3 shows a 2D embedding in a 2D scatter plot).
However, when working with large language models (LLMs), we typically use embeddings with dimensions much higher than those shown in Figure 2.3.
For GPT-2 and GPT-3, the embedding size (often referred to as the dimensionality of the model's hidden state) varies depending on the specific model variant and size.
This is a trade-off between performance and efficiency.
The smallest GPT-2 (117 million parameters) and GPT-3 (125 million parameters) models use an embedding size of 768 dimensions to provide a concrete example. The largest GPT-3 model (175 billion parameters) uses an embedding size of 12,288 dimensions.

The rest of this chapter describes the steps required to prepare embeddings for a large language model (LLM), including segmenting text into words, converting words into tokens, and converting tokens into embedding vectors.