# 🧠 Text Embeddings

## → Simple document vectors

### 🔢 Count / frequency

You will recall that we previously used token counts to represent text, and showed this for an example news headline corpus in a document-feature matrix.

In [None]:
# import count vectorizer and pandas to display the results
import spacy
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

financial_headlines = [
    "Apple stocks soar to record highs as investors celebrate strong earnings",
    "Global markets crash amid fears of a deepening recession and economic turmoil",
    "Central bank raises interest rates, boosting confidence in economic recovery",
    "Tesco reports impressive quarterly profits as retail sales surge",
    "Oil prices collapse after oversupply announcement, sparking industry panic",
    "Unemployment rate drops to lowest level in a decade as job market thrives",
    "Debenhams files for bankruptcy protection as retail sector faces disaster",
    "Investors optimistic as trade tensions ease and markets rally strongly",
    "Housing market slows down as prices fall and buyers worry",
    "Novo Nordisk faces lawsuit over drug safety, shares plunge on negative news",
    "Consumer confidence reaches all-time high as retail sales boom",
    "Tesla recalls thousands of vehicles over safety concerns, shares tumble",
    "Earnings miss sends shares of Trump Inc plummeting as profits disappoint",
    "Mergers and acquisitions activity hits new peak as companies celebrate growth",
    "Cryptocurrency market rebounds after sharp decline, investors cheer recovery",
    "Retail sales disappoint during holiday season as consumer confidence drops",
    "Government stimulus package boosts economic outlook, markets surge",
    "Manufacturing sector contracts for third straight month as demand weakens",
    "Lab grown meat startup secures major funding round, industry hails innovation",
    "Layoffs announced as company restructures operations, employees face uncertainty"
]

nlp = spacy.load("en_core_web_sm")  # load English tokenizer, POS tagger, etc.

# Preprocess texts with spaCy (lemmatize, remove stopwords/punctuation)
def spacy_tokenizer(text):
    doc = nlp(text)
    return [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct]

# Use CountVectorizer with the spaCy tokenizer
vectorizer = CountVectorizer(tokenizer=spacy_tokenizer)
X = vectorizer.fit_transform(financial_headlines)

# Convert to dense matrix and view as DataFrame
df = pd.DataFrame(X.toarray(), columns=list(vectorizer.get_feature_names_out()),
                  index=financial_headlines)

df

Each row is known as a vector that represents its corresponding document. The columns are the features, which in this case are the words in the vocabulary. Each cell in the matrix is a count of how many times that word appears in that document. We can numerically describe the documents in this way, but it is not so informative.


### 📊 TF-IDF

We can, however, represent these texts in a different way. Firstly, we could normalise the counts to be between 0 and 1 to account for different length documents. Another method is to use a weighted representation, where the weights are based on the **importance** of the word in the document. This is known as term frequency-inverse document frequency (**TF-IDF**). The idea is that if a word appears frequently in a document but not in many other documents, it is likely to be important for that document. The TF-IDF score is calculated as follows:

$$
\text{TF-IDF}(t, d) = \text{TF}(t, d) \cdot \text{IDF}(t)
$$

where:
- $\text{TF}(t, d)$ is the term frequency of term $t$ in document $d$, which is the number of times $t$ appears in $d$ divided by the total number of terms in $d$.
- $\text{IDF}(t)$ is the inverse document frequency of term $t$, which is calculated as:
$$
\text{IDF}(t) = \log\left(\frac{N}{n_t}\right)
$$
where:
- $N$ is the total number of documents in the corpus.
- $n_t$ is the number of documents containing term $t$.

The TF-IDF score is a measure of how important a word is to a document in a corpus, taking into account the words in the other documents in the corpus. It is often used in information retrieval and text mining. The TF-IDF vector for a document describes it by a weighted representation of its important words.

Let's see how we can calculate the TF-IDF score for our example news headlines. We will use the `TfidfVectorizer` class from the `sklearn` library to do this. The `TfidfVectorizer` class takes a list of documents as input and returns a matrix of TF-IDF scores for each document and each term in the vocabulary.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Use TfidfVectorizer with the spaCy tokenizer
vectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer)
X = vectorizer.fit_transform(financial_headlines)

# Convert to dense matrix and view as DataFrame
df = pd.DataFrame(X.toarray(), columns=list(vectorizer.get_feature_names_out()),
                  index=financial_headlines)

df


TF-IDF provides a more informative representation of the text than simple counts as a way to describe a document. For example, common words in these headlines like "market" don't tell us much about the content of the document, so they receive a lower score (if they are present). Unique words in a document give us more information about the document, so they receive a higher score. This is useful for tasks like document classification, where we want to identify the topic of a document based on its content. This is perhaps more apparent when we have longer documents and a larger corpus. Consider the following examples of Guardian opinion pieces about COVID-19:

In [None]:
# Read the data
covid_df = pd.read_csv("data/covid_stories.csv")
display(covid_df)

# Use TfidfVectorizer with the spaCy tokenizer on the COVID dataset
vectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer)
X = vectorizer.fit_transform(covid_df['bodyContent'])

# Convert and view as DataFrame
tfidf_df = pd.DataFrame(X.toarray(), columns=list(vectorizer.get_feature_names_out()),
                  index=covid_df['webTitle'])

display(tfidf_df)


In [None]:
# Display a sorted TF-IDF vector for the first story
print(tfidf_df.index[0])
display(tfidf_df.iloc[0].sort_values(ascending=False))

The first row in the dataframe indicates the most important words in the first document, relative to the rest of the corpus. It is clearly focused on the outbreak itself in Wuhan. "Coronavirus", "Covid-19", etc. are terms likely to appear across most documents in the corpus, so whilst they might appear frequently in the document, they are not as important for distinguishing it from the others.

## 🧠 Embeddings

The aformentioned methods of representing text are known as **bag-of-words** models. They are simple and effective, but they have some limitations. For example, they do not take into account the order of the words in the text, and they do not capture the meaning of the words. Without some dimensionality reduction attempts they are also too unwieldy to work with. This is where embeddings come in.

In NLP, an embedding is a dense vector representation of a piece of text (could be a word or document) such that the vector captures the semantic "meaning" of the text. Words or documents that are similar in meaning should end up with vectors that are close to each other in this vector space. This idea underpins a lot of modern NLP – instead of dealing with words as discrete symbols, we embed them in a continuous multidimensional vector space that a machine learning model can work with.

* Word embeddings
  * Map individual words to vectors. These models learn from large corpora such that words used in similar contexts end up near each other in the vector space (i.e. their vectors look similar). For example, “king” and “queen” might be close in space, and analogies can be solved with vector arithmetic (classic example: king – man + woman ≈ queen). Word2Vec and GloVe are classic algorithms that produce word embeddings. More advanced models like BERT and GPT also produce contextualized word embeddings, meaning the same word can have different vectors depending on its context in a sentence.

* Sentence / document embeddings
  * Produce a vector for a whole sentence or larger text. This is often done by using models like BERT (a transformer) or specialized sentence embedding models (like Sentence Transformers) that aggregate meaning over multiple words. Sentence embeddings are useful to compare similarity of entire sentences or to feed documents into clustering or classification algorithms.

### 💬 Word embeddings

spaCy has a built-in word embedding model that can be used to convert words into vectors. The `en_core_web_md` model contains pre-trained ('Explosion') word vectors that can be used to represent words in a continuous vector space.

In [None]:
# !python -m spacy download en_core_web_md

nlp = spacy.load("en_core_web_md")  # load English tokenizer, POS tagger, etc. Note that we use a medium model here

word1 = nlp("king")
word2 = nlp("queen")
word3 = nlp("man")
word4 = nlp("woman")
word5 = nlp("programming")

df = pd.concat([pd.Series(word1.vector, name='king'),
                 pd.Series(word2.vector, name='queen'),
                 pd.Series(word3.vector, name='man'),
                 pd.Series(word4.vector, name='woman'),
                 pd.Series(word5.vector, name='programming')],
                axis=1).T
display(df)


We end up with a 300d vector for each word in the vocabulary. The vectors are dense, meaning they have non-zero values in many dimensions, and they capture semantic relationships between word.

We use cosine similarity to measure the similarity between two word vectors - i.e. how similar the meanings of the words are based on their context in the training data. Cosine similarity is defined as the cosine of the angle between two vectors, and it ranges from -1 (completely dissimilar) to 1 (completely similar). It is calculated as follows:
$$
\text{cosine\_similarity}(A, B) = \frac{A \cdot B}{||A|| \cdot ||B||}
$$
where:
- $A$ and $B$ are the two vectors.
- $A \cdot B$ is the dot product of the two vectors.
- $||A||$ and $||B||$ are the magnitudes (lengths) of the vectors.
$$

In [None]:
print("Similarity (king–queen):", word1.similarity(word2))
print("Similarity (king–man):", word1.similarity(word3))
print("Similarity (king–woman):", word1.similarity(word4))
print("Similarity (queen–man):", word2.similarity(word3))
print("Similarity (queen–woman):", word2.similarity(word4))
print("Similarity (king–programming):", word1.similarity(word5))

### 📄 Document / sentence embeddings

A simple approach is to just average the word embeddings of all words in a sentence to get a single vector. This is known as **mean pooling**. However, more sophisticated methods like using attention mechanisms or transformers can yield better results (coming soon).

In [None]:
doc1 = nlp("The economy is growing.")
doc2 = nlp("The market is expanding.")
doc3 = nlp("The weather is nice.")
doc4 = nlp("The economy is shrinking.")
doc5 = nlp("Scooby Doo where are you?")

We can still produce a document-feature matrix. However now the features are not words, but more abstract concepts. This makes them less directly interpretable.

In [None]:
df = pd.concat([pd.Series(doc1.vector, name='doc1'),
                 pd.Series(doc2.vector, name='doc2'),
                 pd.Series(doc3.vector, name='doc3'),
                 pd.Series(doc4.vector, name='doc4'),
                 pd.Series(doc5.vector, name='doc5')],
                axis=1).T
display(df)

Again, we can use cosine similarity to measure the similarity between two document vectors. This is useful for tasks like document clustering, where we want to group similar documents together based on their content.

In [None]:
print("Similarity (doc1–doc2):", doc1.similarity(doc2))
print("Similarity (doc1–doc3):", doc2.similarity(doc3))
print("Similarity (doc1–doc4):", doc1.similarity(doc4))
print("Similarity (doc1–doc5):", doc1.similarity(doc5))

All of these similarities are quite high due to the mean pooling. Shared common words like "the", "is", can inflate the similarity scores, even if the sentences are not semantically similar. In addition, many unrelated sentences end up with vectors near the mean of the embedding space — so the cosine similarity between them is often still high

A better approach is to use pre-trained models that are specifically designed to produce high-quality sentence / document embeddings. For this, we’ll use the `sentence-transformers` library, which has pre-trained models that can convert sentences to vectors. One popular model is all-MiniLM-L6-v2 – a MiniLM (small distilled transformer) that outputs 384-dimensional sentence embeddings and is fast.

In [None]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')  # Small, fast, decent quality

# List of documents
documents = [
    "The economy is growing.",
    "The market is expanding.",
    "The weather is nice.",
    "The economy is shrinking.",
    "Scooby Doo where are you?"
]

# Get embeddings
embeddings = model.encode(documents, convert_to_tensor=True)

# Check embedding shape
print(embeddings.shape)  # torch.Size([5, 384])

# Convert embeddings to a DataFrame for better visualization
df = pd.DataFrame(embeddings.cpu().numpy(), index=documents)
display(df)


Again, we can calculate the similarity between two document vectors using cosine similarity. This is useful for tasks like document clustering, where we want to group similar documents together based on their content.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculate cosine similarity
cosine_sim = cosine_similarity(embeddings.cpu().numpy())
# Convert to DataFrame for better visualization
cosine_sim_df = pd.DataFrame(cosine_sim, index=documents, columns=documents)
# Display the cosine similarity matrix
display(cosine_sim_df)

## 🪆 Embedding models

Different embedding models are designed for different tasks. Models can vary by size / speed, language support, training data, and task suitability. Researchers and companies create a model by training on large text corpora to predict next / missing / contextual words.

It's not actually that hard to train or 'finetune' your own small model, it can even outperform larger models on specific tasks. However, this is often not necessary as there are many pre-trained models available that are performant out of the box on a wide range of domains. spaCy and sentence_transformers have several built-in, but check out [Hugging Face's Model Hub](https://huggingface.co/) for a wide range of models. You can use the `transformers` library to load and use these models in your own code.

N.B. I'm referring to embeddings here, but the same applies to task-specific models for things like classification, translation, etc.

### 🏋️ Exercise
1. Use the cosine similarity function to find the most similar documents to the first document in the df based on TF-IDF
2. Use the cosine similarity function to find the most similar documents to the first document in the df based on embeddings
3. Compare the results of the two methods (plot as a scatter graph). Do they yield similar results? Why or why not?

## 🌯 Wrap up

We have seen how to represent text as vectors using TF-IDF and embeddings. These representations allow us to capture the meaning of the text (to different degrees) and compare documents based on their content. These representations underpin a wide range of NLP tasks.