# Embedding from Scratch

This notebook focuses on traditional embedding methods and implementing one by ourselves from scratch.

## What Are Embeddings?

Processing text for NLP tasks requires us to have a numeric representation of each word. Every embedding method comes down to turning a "word" (or token) into a "vector". The methods of this goal are what makes embedding techniques different from each other. A high-quality embedding gives the program or neural network a better understanding of what each token means. 

Embedding is not only for text. In a general sense, embedding is the process of converting data into vectors, and it can be applied to text, image, audio, etc. Of course, the embeddings and the emebedding methods of each modality is different and unique. Here, when mentioning embeddings, I am referring to the embedding of text.

An overview of different methods can be viewed here:


![image.png](attachment:image.png)

*Figure 1: Overview of different word embedding techniques. (Selva and Kanniga, 2021)*


So how can we evaluate an embedding technique? In other words, what makes an embedding ideal?
- **Quality of Semantic Representation**: Embeddings must capture the semantic relationships between words. Words with similar meanings should be placed close in the vector space andun
related words must be set apart. The vectors of "cat" and "dog" must be more similar that "dog" and "barrel". 
- **Dimensionality Efficiency**: How big must the the embedding vectors be? 15, 50, 300? Striking the right balance is key. Smaller vectors (lower dimensions) are more efficient to keep in memory or to process, while bigger vectors (higher dimensions) can capture intricate relationships, but are also prone to overfitting. For reference, GPT-2 model family has an embedding size of at least 768. 

***NOTE***: When reading about embeddings you may come across "static" vs. "dynamic/contextualized" word embeddings. Static embeddings have a fixed representation for each word, regarless of the context it appears in. For example, the word "tear" has very different meanings in "Tears felt down from her eyes" or "tearing a page out", and that dynamic word embeddings change this representation based on the context of the word. 

## Traditional Embedding Techniques
Almost every embedding technique relies on a corpus of text data to extract the relationship of the word. Previously, word embedding methods relied on statistic methods. These methods are based on the co-occurance of words in a text: words that often appear together must have a closer relationship than words that never appear together. For us in the modern day who know how embeddings can be more sophisticated, this doesn't seem a reliable approach. But to get an idea, let's check out one of these traditional embedding methods in practice:

### TF-IDF (Term Frequency-Inverse Document Frequency):
The idea of TF-IDF is to calculate the importance of a word in a document by considering two factors[1]:
1. **Term-Frequency (TF)**: How frequent a term appears in a document. A higher TF shows that a term is more important to the document.
2. **Inverse Document Frequency (IDF)**: How rare a term is across documents. This is based on the assumption that terms that appear in many of the documents are less important than terms that are unique to fewer documents. 

$$
\text{tf}(t,d) = \begin{cases}
- 1 + \log_e(f_{t,d}) & \text{if } f_{t,d} > 0 \\
- 0 & \text{if } f_{t,d} = 0
\end{cases}
$$
where $f_{t,d}$ is the raw frequency of term $t$ in document $d$


$$
\text{idf}(t,\mathcal{D}) = \log\left(\frac{N + 1}{\text{df}(t) + 1}\right) + 1
$$
where:

$t$ is a term in the vocabulary
$\mathcal{D}$ is the corpus of documents
$N$ is the total number of documents in $\mathcal{D}$
$\text{df}(t)$ is the document frequency of term $t$


Now let's start use TF-IDF using the [TinyShakespeare](https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt) dataset.

In [20]:
# load the dataset
with open("tinyshakespeare.txt", "r") as file:
    corpus = file.read()

print(f"Text corpus includes {len(corpus.split())} words.")

# to simulate multiple documents, we chunk up the corpus into N pieces
N = len(corpus) // 10
documents = [corpus[i:i+N] for i in range(0, len(corpus), N)]

documents = documents[:-1] #last document is residue
# now we have N documents from the corpus

Text corpus includes 202651 words.


In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
embeddings = vectorizer.fit_transform(documents)
words = vectorizer.get_feature_names_out()

print(f"Word count: {len(words)} e.g.: {words[:10]}")
print(f"Embedding shape: {embeddings.shape}")

Word count: 11446 e.g.: ['abandon' 'abase' 'abate' 'abated' 'abbey' 'abbot' 'abed' 'abel' 'abet'
 'abhor']
Embedding shape: (10, 11446)


let's now visualize the embeddings in 2d space.

In [39]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

pca = PCA(n_components=2)
emb_2d = pca.fit_transform(embeddings.T)

In [50]:
import pandas as pd 
import holoviews as hv
hv.extension('bokeh')

df = pd.DataFrame({
    'x': emb_2d[:, 0],
    'y': emb_2d[:, 1],
    'word': list(words)
})

# sample of words we are interested in
special_words = ['dog', 'cat', 'animal', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten']
# show only 200 words that are not special, otherwhise the plot would be too dense
mask = df['word'].isin(special_words)
non_special_df = df[~mask].sample(n=200, random_state=42)
special_df = df[mask]
df = pd.concat([special_df, non_special_df])

# show special words in red
df['color'] = 'gray'
df.loc[df['word'].isin(special_words), 'color'] = 'red'

df['size'] = 5  
df.loc[df['word'].isin(special_words), 'size'] = 15

# add label color column
df['label_color'] = 'gray'
df.loc[df['word'].isin(special_words), 'label_color'] = 'red'

points = hv.Points(df, kdims=['x', 'y'], vdims=['word', 'color', 'size', 'label_color'])

# add labels and customize
labels = hv.Labels(points, ['x', 'y'], ['word', 'label_color'])

# Create plot with separate options for Points and Labels
points_opts = hv.opts.Points(
    width=800, height=600,
    tools=['hover', 'box_zoom', 'wheel_zoom', 'pan', 'reset'],
    alpha=0.3,  # More transparent for regular words
    color='color',
    size='size'
)

labels_opts = hv.opts.Labels(
    text_font_size='8pt',
    text_color='label_color'
)

plot = (points.opts(points_opts) * labels.opts(labels_opts)).opts(
    xlabel='Component 1', 
    ylabel='Component 2'
)

# Save the plot
hv.save(plot, 'tf-idf-embeddings.html')


In [52]:
plot

Because TF-IDF is based on the occurance frequency of terms in the document, it doesn't hold any semantic meanings. Vectors that are similar to each other are irrelevant in meaning. And words that are semanticly close, like numbers from one to ten, have no relationship in the vector space. The inability of TF-IDF and similar approaches is what makes them unsuitable for many NLP tasks. However, the simplicity makes these methods useful in applications such as information retreival, keyword extraction, and basic text analysis. You can read about some of these methods in [2].

## word2vec


![image.png](attachment:image.png)

*Figure 2: word2vec in a CBOW example*

[1] Vardhan, H. (2024, November 22). A Comprehensive Guide to Word Embeddings in NLP - Harsh Vardhan - Medium. Medium. https://medium.com/@harsh.vardhan7695/a-comprehensive-guide-to-word-embeddings-in-nlp-ee3f9e4663ed
[2] Turing. (2022, February 10). A Guide on Word embeddings in NLP. https://www.turing.com/kb/guide-on-word-embeddings-in-nlp