# Embedding from Scratch

This notebook focuses on traditional embedding methods and implementing one by ourselves from scratch.

## What Are Embeddings?

Processing text for NLP tasks requires us to have a numeric representation of each word. Every embedding method comes down to turning a "word" (or token) into a "vector". The methods of this goal are what makes embedding techniques different from each other. A high-quality embedding gives the program or neural network a better understanding of what each token means. 

Embedding is not only for text. In a general sense, embedding is the process of converting data into vectors, and it can be applied to text, image, audio, etc. Of course, the embeddings and the emebedding methods of each modality is different and unique. Here, when mentioning embeddings, I am referring to the embedding of text.

An overview of different methods can be viewed here:


![image.png](attachment:image.png)

*Figure 1: Overview of different word embedding techniques. (Selva and Kanniga, 2021)*


So how can we evaluate an embedding technique? In other words, what makes an embedding ideal?
- **Quality of Semantic Representation**: Embeddings must capture the semantic relationships between words. Words with similar meanings should be placed close in the vector space andun
related words must be set apart. The vectors of "cat" and "dog" must be more similar that "dog" and "barrel". 
- **Dimensionality Efficiency**: How big must the the embedding vectors be? 15, 50, 300? Striking the right balance is key. Smaller vectors (lower dimensions) are more efficient to keep in memory or to process, while bigger vectors (higher dimensions) can capture intricate relationships, but are also prone to overfitting. For reference, GPT-2 model family has an embedding size of at least 768. 

***NOTE***: When reading about embeddings you may come across "static" vs. "dynamic/contextualized" word embeddings. Static embeddings have a fixed representation for each word, regarless of the context it appears in. For example, the word "tear" has very different meanings in "Tears felt down from her eyes" or "tearing a page out", and that dynamic word embeddings change this representation based on the context of the word. 

## Traditional Embedding Techniques
Almost every embedding technique relies on a corpus of text data to extract the relationship of the word. Previously, word embedding methods relied on statistic methods. These methods are based on the co-occurance of words in a text: words that often appear together must have a closer relationship than words that never appear together. For us in the modern day who know how embeddings can be more sophisticated, this doesn't seem a reliable approach. But to get an idea, let's check out one of these traditional embedding methods in practice:

### TF-IDF (Term Frequency-Inverse Document Frequency):
The idea of TF-IDF is to calculate the importance of a word in a document by considering two factors[1]:
1. **Term-Frequency (TF)**: How frequent a term appears in a document. A higher TF shows that a term is more important to the document.
2. **Inverse Document Frequency (IDF)**: How rare a term is across documents. This is based on the assumption that terms that appear in many of the documents are less important than terms that are unique to fewer documents. 

$$
\text{tf}(t,d) = \begin{cases}
- 1 + \log_e(f_{t,d}) & \text{if } f_{t,d} > 0 \\
- 0 & \text{if } f_{t,d} = 0
\end{cases}
$$
where $f_{t,d}$ is the raw frequency of term $t$ in document $d$


$$
\text{idf}(t,\mathcal{D}) = \log\left(\frac{N + 1}{\text{df}(t) + 1}\right) + 1
$$
where:

$t$ is a term in the vocabulary
$\mathcal{D}$ is the corpus of documents
$N$ is the total number of documents in $\mathcal{D}$
$\text{df}(t)$ is the document frequency of term $t$


Now let's start use TF-IDF using the [TinyShakespeare](https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt) dataset.

In [20]:
# load the dataset
with open("tinyshakespeare.txt", "r") as file:
    corpus = file.read()

print(f"Text corpus includes {len(corpus.split())} words.")

# to simulate multiple documents, we chunk up the corpus into N pieces
N = len(corpus) // 10
documents = [corpus[i:i+N] for i in range(0, len(corpus), N)]

documents = documents[:-1] #last document is residue
# now we have N documents from the corpus

Text corpus includes 202651 words.


In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
embeddings = vectorizer.fit_transform(documents)
words = vectorizer.get_feature_names_out()

print(f"Word count: {len(words)} e.g.: {words[:10]}")
print(f"Embedding shape: {embeddings.shape}")

Word count: 11446 e.g.: ['abandon' 'abase' 'abate' 'abated' 'abbey' 'abbot' 'abed' 'abel' 'abet'
 'abhor']
Embedding shape: (10, 11446)


let's now visualize the embeddings in 2d space.

In [39]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

pca = PCA(n_components=2)
emb_2d = pca.fit_transform(embeddings.T)

In [50]:
import pandas as pd 
import holoviews as hv
hv.extension('bokeh')

df = pd.DataFrame({
    'x': emb_2d[:, 0],
    'y': emb_2d[:, 1],
    'word': list(words)
})

# sample of words we are interested in
special_words = ['dog', 'cat', 'animal', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten']
# show only 200 words that are not special, otherwhise the plot would be too dense
mask = df['word'].isin(special_words)
non_special_df = df[~mask].sample(n=200, random_state=42)
special_df = df[mask]
df = pd.concat([special_df, non_special_df])

# show special words in red
df['color'] = 'gray'
df.loc[df['word'].isin(special_words), 'color'] = 'red'

df['size'] = 5  
df.loc[df['word'].isin(special_words), 'size'] = 15

# add label color column
df['label_color'] = 'gray'
df.loc[df['word'].isin(special_words), 'label_color'] = 'red'

points = hv.Points(df, kdims=['x', 'y'], vdims=['word', 'color', 'size', 'label_color'])

# add labels and customize
labels = hv.Labels(points, ['x', 'y'], ['word', 'label_color'])

# Create plot with separate options for Points and Labels
points_opts = hv.opts.Points(
    width=800, height=600,
    tools=['hover', 'box_zoom', 'wheel_zoom', 'pan', 'reset'],
    alpha=0.3,  # More transparent for regular words
    color='color',
    size='size'
)

labels_opts = hv.opts.Labels(
    text_font_size='8pt',
    text_color='label_color'
)

plot = (points.opts(points_opts) * labels.opts(labels_opts)).opts(
    xlabel='Component 1', 
    ylabel='Component 2'
)

# Save the plot
hv.save(plot, 'tf-idf-embeddings.html')


In [52]:
plot

Because TF-IDF is based on the occurance frequency of terms in the document, it doesn't hold any semantic meanings. Vectors that are similar to each other are irrelevant in meaning. And words that are semanticly close, like numbers from one to ten, have no relationship in the vector space. The inability of TF-IDF and similar approaches is what makes them unsuitable for many NLP tasks. However, the simplicity makes these methods useful in applications such as information retreival, keyword extraction, and basic text analysis. You can read about some of these methods in [2].

## word2vec
Another approach that is less traditional than TF-IDF and is deep-learning based. word2vec, as can be assumed by the name, is a network that aims to convert words into embedding vectors. It achieves this by defining a side goal, something to optimize the network for. For example, in CBOW (continous bag of words), the word2vec network is trained to predict a missing word when its given the neighbors of that word as input. The intuition is that you can infer the embeddings of a word given the words around it. 

The word2vec architecture is pretty simple: one hidden layer that we extract the embeddings from, and one output layer which predicts the probabilities of all words in the vocabulary. On the surface, the network is trained to predict the right missing word given its neighbors, but in reality, this is an excuse to train the hidden layer of the network and find the right embeddings for each word. After the network is trained, the last layer can be tossed out the window because the embeddings are what we're looking for.  

![image.png](attachment:image.png)

*Figure 2: word2vec in a CBOW example*

Aside from CBOW, another variant is Skipgram which works completely the opposite: it aims to predict the neighbors, given a particular word as input.
Let's see what happens in the case of a CBOW word2vec: after choosing a context window (e.g. 2 in the image above), we get the two words that appear before and two words after a particular word. The four words are encoded as one-hot vectors and passed through the hidden layer. The hidden layer has no activation function (actually it does, and it's linear activation function so it outputs anything that it gets as input). The outputs of the hidden layer are aggregated (e.g. using a lambda mean funcion) and then fed to the final layer which, by using Softmax, predicts a probability for each possible word. The one with the highest probability is considered the output of the network. 

As mentioned before the hidden layer holds the embeddings. It has a shape of *Vocabulary size x Embedding size* and as we give a one-hot vector of a word to the network, that specific `1` triggers the embeddings of that word to be passed to the next layers. You can see a cool and simple implementation of the word2vec network in [3].

![image.png](attachment:image.png)

Since the network relies on the relationship between words in a context, and not on the occurance or co-occurance of words as in TF-IDF, it is able to capture **Semantics Relationships** between the words. What does this mean? Words that have closer meaning, also have closer embeddings. Let's see word2vec embeddings in action, you can download the pretrained version from Google's official page [4]:

In [1]:
# let's load the pretrained embeddings and see how they look
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)

In [2]:
print(f"The embedding size: {model.vector_size}")
print(f"The vocabulary size: {len(model)}")


The embedding size: 300
The vocabulary size: 3000000


In [3]:
# italy - rome + london = england
model.most_similar(positive=['london', 'italy'], negative=['rome'])

[('england', 0.5743448734283447),
 ('europe', 0.537047266960144),
 ('liverpool', 0.5141493678092957),
 ('chelsea', 0.5138063430786133),
 ('barcelona', 0.5128480792045593),
 ('birmingham', 0.5125836730003357),
 ('spain', 0.4980141520500183),
 ('sweden', 0.49154016375541687),
 ('leeds', 0.4871762692928314),
 ('holland', 0.4858900308609009)]

In [9]:
# the release of word2vec sparked conversations on social bias
model.most_similar(positive=['woman', 'doctor'], negative=['man'])

[('gynecologist', 0.7093892097473145),
 ('nurse', 0.6477287411689758),
 ('doctors', 0.6471460461616516),
 ('physician', 0.6438996195793152),
 ('pediatrician', 0.6249487996101379),
 ('nurse_practitioner', 0.6218312978744507),
 ('obstetrician', 0.6072013974189758),
 ('ob_gyn', 0.5986713171005249),
 ('midwife', 0.5927063226699829),
 ('dermatologist', 0.5739566683769226)]

The semantic relationships is a fun topic to study. You can explore the biases of society or the data, or explore how words have evolved overtime by utilizing older manuscripts. 

## BERT (Bidirectional encoder representations from transformers)

BERT is the Sopranos of the NLP world, it's old, but you find references to it in whatever you look for. It's a good idea to do yourself a favor and learn about BERT once and for all, as it is the source of many ideas and techniques when it comes to LLMs. Here's a good video to get started [5]
In summary, BERT is an encoder-only transformer model consisting of 4 main parts:
1. Tokenizer: chops up texts into sequences of integers
2. Embedding: the module that converts discrete tokens into vectors
3. Encoder: a stack of transformer blocks with self-attention 
4. Task head: when encoder is finished with the representations, this task-specific head handles them for token generation or classification tasks. 

BERT inspired from the Transformer architecture introduced in "Attention is all you need", to become an encoder-only transformer that is able to produce meaningful representations and is aimed to understand language. The idea was to pretrain BERT to understand landguage, and depending on specific problems we like it to solve, fine-tune it to learn about tasks. These specific tasks can be Q&A (question + passage -> answer), text summarization, classification, etc.

In the pretraining phase, BERT is trained to learn two tasks simultaneously:
1. Masked Language Modeling: is to predict masked words in a sentance (I [MASKED] this book before -> read)
2. Next Sentence Prediction: given two sentences, predict if A came before B or not. The special [SEP] token seperates the two sentences and the task is similar to binary classification.

Note the other special token, [CLS]. This special token helps with classification tasks. As the model processes input layer by layer, [CLS] becomes an aggregation of all the input tokens, which can later be used for classification purposes.

![image-2.png](attachment:image-2.png)
Figure 3: Overview of BERT architecture [6]

So why is BERT important? 

BERT is among the first instances of Transformer-based **contextualized, dynamic embeddings**. When given a sentence as input, the layers of the BERT model leverage self-attention and feed-forward mechanisms to update incorporate context from all other tokens in the sentence. The final output of each Transformer layer is a contextualized representation of the word which depending on the context, would be differnet. 

## Embeddings in Modern LLMs
Modern LLMs, such as the GPT family or the recent Deepseek model, use embeddings as a foundational component. In the context of Large Language Models, "embeddings" is a broad term. For the purpose of this article, we focus on "embeddings" as the module that transforms tokens into vector representations.

### Where deos the embedding fit into LLMs?
In transformer-based models, the term "embedding" can refer to both static embeddings and dynamic contextual representations:
1. **Static Embeddings** generated in the first layer and combine token embeddings (vectors representing tokens) with positional embeddings (vectors encoding token's position in the sequence, which we'll cover in a later notebook/article).
2. **Dynamic Contextual Representations**. As input tokens pass through the self-attention and feed-forward layers, their embeddings are updated to become contextual. These dynamic representations capture the meaning of tokens based on their surrounding context. For example, the word "bank" appears both as "river bank" and "bank robbery", and while the **token embedding** of the word bank is the same in both cases, the transformations it goes through in the layers of the network account for the context of which the word "bank" appears in.

So where do we draw the line?

**What should we actually call embeddings?**

I personally don't find it exotic to call the latent representations of the later LLM layers as "embeddings", it introduces unnecessary confusion. This naming is only enforced by the differentiation of "dynamic" and "static" embeddings that is made in many courses and articles, and the only way for an embedding to be "dynamic" and "context-aware", is that it's processed through the layers of the network. So I prefer to reserve the term "embedding" for the first module of the LLM architecture: **The Embedding layer**.

# <img src="llm-overview.png" alt="llm-overview.png" width="500" style="display: block; margin: 0 auto;"/>


### LLM Embeddings are Trained
LLM embeddings are optimized during the training process. Borrowing from Sebastian Raschka's **Build a Large Language Model (From Scratch)**[7], " While we can use pretrained models such as Word2Vec to generate embeddings for machine learning models, LLMs commonly produce their own embeddings that are part of the input layer and are updated during training. The advantage of optimizing the embeddings as part of the LLM training instead of using Word2Vec is that the embeddings are optimized to the specific task and data at hand."

### torch.nn.Embedding
The embedding layer in LLMs works as a look-up table. Given a list of indices (token ids) it returns their embeddings. [7] shows this concept pretty well. 
# <img src="embeddings_as_lookup.png" alt="embeddings_as_lookup.png" width="500" style="display: block; margin: 0 auto;"/>

The practical implementation of an embedding layer in PyTorch is done with `torch.nn.Embedding` which acts as a simple look-up table. There is nothing more special about this layer than a simple Linear layer, rather than it allows us to get output with indices as input rather than one-hot encodings. The Embedding layer is simply a Linear layer that works with indices. 
# <img src="embedding_as_linear.png" alt="embedding_as_linear.png" width="500" style="display: block; margin: 0 auto;"/>

This notebook by Sebastian Raschka explains the Embedding layer in depth [8]. 

Now let's work with the embedding of a model and see some visuals!


# Embeddings in Action: DeepSeek-R1-Distill-Qwen-1.5B
Let's disect the embeddings of the distilled version of DeepSeek-R1 in the Qwen model. Some parts of the following code is inspired by [9].

In [1]:
import torch
from transformers import AutoTokenizer, AutoModel

tokenizer_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
model_name = tokenizer_name

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# Load the pre-trained model
model = AutoModel.from_pretrained(model_name)

# Extract the embeddings layer
embeddings = model.get_input_embeddings()

# Print out the embeddings
print(f"Extracted Embeddings Layer for {model_name}: {embeddings}")

# Save the embeddings layer
torch.save(embeddings.state_dict(), "embeddings_qwen.pth")

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Extracted Embeddings Layer for deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B: Embedding(151936, 1536)


Now let's load the embedding layer and work with it. The goal of seperating the embedding from the other parts of the model is to do the whole thing more efficiently without using the rest of the model.

In [2]:
vocab_size = 151936
dimensions = 1536
embeddings_filename = "embeddings_qwen.pth"

In [3]:
import torch.nn as nn

class EmbeddingModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(EmbeddingModel, self).__init__()
        self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)

    def forward(self, input_ids):
        return self.embedding(input_ids)

In [4]:
# Initialize the custom embedding model
model = EmbeddingModel(vocab_size, dimensions)

# Load the saved embeddings from the file
saved_embeddings = torch.load(embeddings_filename)

# Ensure the 'weight' key exists in the saved embeddings dictionary
if 'weight' not in saved_embeddings:
    raise KeyError("The saved embeddings file does not contain 'weight' key.")

embeddings_tensor = saved_embeddings['weight']

# Check if the dimensions match
if embeddings_tensor.size() != (vocab_size, dimensions):
    raise ValueError(f"The dimensions of the loaded embeddings do not match the model's expected dimensions ({vocab_size}, {dimensions}).")

# Assign the extracted embeddings tensor to the model's embedding layer
model.embedding.weight.data = embeddings_tensor

# put the model in eval mode
model.eval()

  saved_embeddings = torch.load(embeddings_filename)


EmbeddingModel(
  (embedding): Embedding(151936, 1536)
)

### Now we have the embedding model loaded!
Let's see how a string is tokenized and how the embeddings look like

In [53]:
from prettytable import PrettyTable

def prompt_to_embeddings(prompt:str):
    # tokenize the input text
    tokens = tokenizer(prompt, return_tensors="pt")
    input_ids = tokens['input_ids']

    # make a forward pass
    outputs = model(input_ids)

    # directly use the embeddings layer to get embeddings for the input_ids
    embeddings = outputs

    # print each token
    token_id_list = tokenizer.encode(prompt, add_special_tokens=True)
    token_str = [tokenizer.decode(t_id, skip_special_tokens=True) for t_id in token_id_list]

    return token_id_list, embeddings, token_str


def print_tokens_and_embeddings(prompt:str):
    table = PrettyTable()

    token_id_list, embeddings, token_str = prompt_to_embeddings(prompt)

    headers = ["token_id", "token", "Embedding Vector"]
    token_emb_table = []
    for i, (t_id, t_str) in enumerate(zip(token_id_list, token_str)):
        embedding_values = embeddings[0][i].tolist()
        embedding_str = f"{embedding_values[0]:.6f}, {embedding_values[1]:.6f}, {embedding_values[2]:.6f} ... {embedding_values[-1]:.6f}"
        t_str = t_str.replace(" ",  "#")
        token_emb_table.append([t_id, t_str, embedding_str])

    table.title = "Token Embeddings"
    table.field_names = headers
    table.add_rows(token_emb_table)
    print(table)


In [55]:
print_tokens_and_embeddings("HTML coders are not considered programmers")

+------------------------------------------------------------------------+
|                            Token Embeddings                            |
+----------+--------------+----------------------------------------------+
| token_id |    token     |               Embedding Vector               |
+----------+--------------+----------------------------------------------+
|  151646  |              | -0.027466, 0.002899, -0.005188 ... 0.021606  |
|   5835   |     HTML     | -0.018555, 0.000912, 0.010986 ... -0.015991  |
|  20329   |     #cod     | -0.026978, -0.012939, 0.021362 ... 0.042725  |
|   388    |     ers      | -0.012085, 0.001244, -0.069336 ... -0.001213 |
|   525    |     #are     | -0.001785, -0.008789, 0.006195 ... -0.016235 |
|   537    |     #not     |  0.016357, -0.039062, 0.045898 ... 0.001686  |
|   6509   | #considered  | -0.000721, -0.021118, 0.027710 ... -0.051270 |
|  54846   | #programmers | -0.047852, 0.057861, -0.069336 ... 0.005280  |
+----------+-------------

As you can see, the prompt is chopped into tokens. Each tokenizer has a different approach of breaking strings down into meaningful pieces and sometimes it can be surprising too! (coders = cod + ers)
Also, I have replaced the whitespace " " with # in tokens for easier understanding.

### Visualize the embeddings
Now let's run a visualization experiment. For a given prompt, I will find `n` closes embeddings to each token and visualize them in a 2d space. 

In [63]:
def find_similar_embeddings(target_embedding, n=10):
    """
    Find the n most similar embeddings to the target embedding using cosine similarity
    
    Args:
        target_embedding: The embedding vector to compare against
        n: Number of similar embeddings to return (default 3)
    
    Returns:
        List of tuples containing (word, similarity_score) sorted by similarity
    """
    # Convert target to tensor if not already
    if not isinstance(target_embedding, torch.Tensor):
        target_embedding = torch.tensor(target_embedding)
        
    # Get all embeddings from the model
    all_embeddings = model.embedding.weight
    
    # Compute cosine similarity between target and all embeddings
    similarities = torch.nn.functional.cosine_similarity(
        target_embedding.unsqueeze(0), 
        all_embeddings
    )
    
    # Get top n similar embeddings
    top_n_similarities, top_n_indices = torch.topk(similarities, n)
    
    # Convert to word-similarity pairs
    results = []
    for idx, score in zip(top_n_indices, top_n_similarities):
        word = tokenizer.decode(idx)
        results.append((word, score.item()))
        
    return results


In [97]:
token_id_list, prompt_embeddings, prompt_token_str = prompt_to_embeddings("USA and China are the most prominent countries in AI.")

tokens_and_neighbors = {}
for i in range(1, len(prompt_embeddings[0])):
    token_results = find_similar_embeddings(prompt_embeddings[0][i], n=6)
    similar_embs = []
    for word, score in token_results:
        similar_embs.append(word.replace(" ", "#"))
    tokens_and_neighbors[prompt_token_str[i]] = similar_embs


In [98]:
all_token_embeddings = {}

# Process each token and its neighbors
for token, neighbors in tokens_and_neighbors.items():
    # Get embedding for the original token
    token_id, token_emb, _ = prompt_to_embeddings(token)
    all_token_embeddings[token] = token_emb[0][1]
    
    # Get embeddings for each neighbor token
    for neighbor in neighbors:
        # Replace # with space
        neighbor = neighbor.replace("#", " ")
        # Get embedding
        neighbor_id, neighbor_emb, _ = prompt_to_embeddings(neighbor)
        all_token_embeddings[neighbor] = neighbor_emb[0][1]

In [104]:
import numpy as np
import holoviews as hv
hv.extension('bokeh')

# Convert embeddings to numpy array for PCA
embeddings_array = np.array([emb.detach().numpy() for emb in all_token_embeddings.values()])
words = list(all_token_embeddings.keys())


# Perform PCA to reduce to 2 dimensions
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings_array)

# Create points and labels datasets
points_data = []
labels_data = []
for i, word in enumerate(words):
    color = 'red' if word in prompt_token_str else 'blue'
    points_data.append((embeddings_2d[i,0], embeddings_2d[i,1], word, color, 6, color))
    # Offset labels slightly below points
    labels_data.append((embeddings_2d[i,0], embeddings_2d[i,1], word.replace(" ","#"), color))

# Create scatter plot with labels
points = hv.Points(points_data, ['x', 'y'], ['word', 'color', 'size', 'label_color'])
labels = hv.Labels(labels_data, ['x', 'y'], ['word', 'label_color'])

# Combine plot elements and set options
plot = (points * labels).opts({
    'Points': {'alpha': 0.6, 'size': 'size', 'color': 'color'},
    'Labels': {'text_font_size': '8pt'},
    'Overlay': {'title': '2D PCA Visualization of Token Embeddings', 
                'width': 800,
                'height': 600}
})

hv.save(plot, 'token_embeddings_qwen.html')


In [105]:
plot

## Conclusion
Embeddings remain as one of the fundemental parts in natural language processing and modern LLMs. While the research in machine leanring and LLMs rapidly uncovers new methods and techniques, embeddings haven't seen much change in large language models. They are essential, easy to understand, and easy to work with. In this notebook I have covered the essentials of embeddings, and their evolution from traditional statistical methods into their use case in today's LLMs. 

[1] Vardhan, H. (2024, November 22). A Comprehensive Guide to Word Embeddings in NLP - Harsh Vardhan - Medium. Medium. https://medium.com/@harsh.vardhan7695/a-comprehensive-guide-to-word-embeddings-in-nlp-ee3f9e4663ed

[2] Turing. (2022, February 10). A Guide on Word embeddings in NLP. https://www.turing.com/kb/guide-on-word-embeddings-in-nlp

[3] Sarkar, D. (n.d.). Implementing deep learning methods and feature engineering for text data: The Continuous Bag of Words (CBOW) - KDNuggets. KDnuggets. https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-cbow.html

[4] Google Code Archive - Long-term storage for Google Code Project Hosting. (n.d.). https://code.google.com/archive/p/word2vec/

[5] CodeEmporium. (2020, May 4). BERT Neural Network - EXPLAINED! [Video]. YouTube. https://www.youtube.com/watch?v=xI0HHN5XKDo

[6] Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018, October 11). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv.org. https://arxiv.org/abs/1810.04805

[7] Build a large language model (From scratch). (n.d.). Manning Publications. https://www.manning.com/books/build-a-large-language-model-from-scratch

[8] Rasbt. (n.d.). LLMs-from-scratch/ch02/03_bonus_embedding-vs-matmul at main · rasbt/LLMs-from-scratch. GitHub. https://github.com/rasbt/LLMs-from-scratch/tree/main/ch02/03_bonus_embedding-vs-matmul

[9] Chrishayuk. (n.d.). GitHub - chrishayuk/embeddings. https://github.com/chrishayuk/embeddings/tree/main