# Text Embedding

AI models cannot directly process raw text; they must convert it into a numerical format. This process involves **tokenization** and **embedding**.

***

### 1. Tokenization

The text is first broken down into smaller units called **tokens** (words, sub-words, or characters) by a **tokenizer**. Each token is then assigned a unique **numerical ID** from the model's vocabulary.

### 2. Embedding

These numerical IDs are passed to an **embedding model**, which generates a **vector** of floating-point numbers for each token. This vector is the **embedding**, and it captures the **semantic meaning** of the token. Tokens with similar meanings are positioned closer together in the high-dimensional embedding space.

***

### Factual Corrections and Key Points:

* **Model Specificity:** Every model's embedding is unique, defined by its specific **training data and architecture**. Therefore, two different models will generate different embeddings for the exact same token.
* **Dimensionality:** The **dimension** (length) of the embedding vector is a fixed architectural parameter. It determines the **capacity** for the vector to store semantic information, but the specific **semantic information** is encoded by the values within the vector, which are learned during training.
* **Types:** Embeddings are a general concept used across different data types, including **text**, **image/video**, and **audio**. 

Learn more here: https://huggingface.co/docs/chat-ui/en/configuration/embeddings

We will be using the `all-MiniLM-L6-v2` learn more here: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

## First operation: `ETL: Extract - Transform - Load`
Here, we will first extract the text content from the source and the transform it and finally load the content.

Our process wil be, 
- Load the document
- Chunk it either paragraph or sentences, etc
- Embed the document

In [2]:
from pypdf import PdfReader
import pdfplumber

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# from umap import UMAP


#### Create a PDF reader 

In [3]:
pdf_reader = pdfplumber.open("../Data/Uber-2024-Annual-Report.pdf")
len(pdf_reader.pages)

142

In [None]:
reader = PdfReader("../Data/Uber-2024-Annual-Report.pdf")
len(reader.pages)

#### Extract all the tables

In [None]:
tables = []
for i, page in enumerate(pdf_reader.pages):
    table_content = page.extract_tables()
    if table_content:
        tables.append({
            "type": "table",
            "page": f"{i+1}",
            "content": table_content
        }
tables[0]


{'type': 'table',
 'page': '51',
 'content': [[['October 1, 2024 to October 31,', '3,605', '', '3,605', ''],
   [None, None, '$ 76.56', None, '$ 6,024'],
   ['2024', None, None, None, None],
   ['November 1, 2024 to November', '2,024', '', '2,024', ''],
   [None, None, '$ 70.77', None, '$ 5,881'],
   ['30, 2024', None, None, None, None],
   ['December 1, 2024 to December', '2,032', '', '2,032', ''],
   [None, None, '$ 64.34', None, '$ 5,750'],
   ['31, 2024', None, None, None, None],
   ['Total', '7,661', '', '7,661', '']]]}

#### Extract all the text
- Chunking is at Document level

In [6]:
text_content = []
document_name = "".join(pdf_reader.stream.name.split("/")[-1].split(".")[:-1])

for i, page in enumerate(pdf_reader.pages):
    text_page = page.extract_text()
    text_content.append({
        "type" : "text",
        "document": document_name,
        "page": f"{i+1}",
        "content": text_page
    })

text_content[0]

{'type': 'text',
 'document': 'Uber-2024-Annual-Report',
 'page': '1',
 'content': 'On Our Way\n2024 ANNUAL REPORT'}

#### Create dataFrame from text content
- content - str
- Metadata - pageNo, documentName

In [7]:
text_doc =pd.DataFrame(text_content)
text_doc

Unnamed: 0,type,document,page,content
0,text,Uber-2024-Annual-Report,1,On Our Way\n2024 ANNUAL REPORT
1,text,Uber-2024-Annual-Report,2,Uber’s Mission\nWe reimagine the way the world...
2,text,Uber-2024-Annual-Report,3,UNITED STATES\nSECURITIES AND EXCHANGE COMMISS...
3,text,Uber-2024-Annual-Report,4,Large accelerated filer ☒ Accelerated filer ☐\...
4,text,Uber-2024-Annual-Report,5,"UBER TECHNOLOGIES, INC.\nTABLE OF CONTENTS\nPa..."
...,...,...,...,...
137,text,Uber-2024-Annual-Report,138,"KNOW ALL PERSONS BY THESE PRESENTS, that each ..."
138,text,Uber-2024-Annual-Report,139,Alexander Wynaendts\n135
139,text,Uber-2024-Annual-Report,140,[THIS PAGE INTENTIONALLY LEFT BLANK]
140,text,Uber-2024-Annual-Report,141,Board of Directors Officers Stock Exchange\nUb...


In [8]:
text_doc["MetaData"] = text_doc.apply(lambda x: {"Document": x["document"], "Page": x["page"], "Type": x["type"]}, axis=1)
text_doc

Unnamed: 0,type,document,page,content,MetaData
0,text,Uber-2024-Annual-Report,1,On Our Way\n2024 ANNUAL REPORT,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
1,text,Uber-2024-Annual-Report,2,Uber’s Mission\nWe reimagine the way the world...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
2,text,Uber-2024-Annual-Report,3,UNITED STATES\nSECURITIES AND EXCHANGE COMMISS...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
3,text,Uber-2024-Annual-Report,4,Large accelerated filer ☒ Accelerated filer ☐\...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
4,text,Uber-2024-Annual-Report,5,"UBER TECHNOLOGIES, INC.\nTABLE OF CONTENTS\nPa...","{'Document': 'Uber-2024-Annual-Report', 'Page'..."
...,...,...,...,...,...
137,text,Uber-2024-Annual-Report,138,"KNOW ALL PERSONS BY THESE PRESENTS, that each ...","{'Document': 'Uber-2024-Annual-Report', 'Page'..."
138,text,Uber-2024-Annual-Report,139,Alexander Wynaendts\n135,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
139,text,Uber-2024-Annual-Report,140,[THIS PAGE INTENTIONALLY LEFT BLANK],"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
140,text,Uber-2024-Annual-Report,141,Board of Directors Officers Stock Exchange\nUb...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."


In [9]:
text_doc = text_doc.drop(["type", "document", "page"], axis=1)
text_doc

Unnamed: 0,content,MetaData
0,On Our Way\n2024 ANNUAL REPORT,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
1,Uber’s Mission\nWe reimagine the way the world...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
2,UNITED STATES\nSECURITIES AND EXCHANGE COMMISS...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
3,Large accelerated filer ☒ Accelerated filer ☐\...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
4,"UBER TECHNOLOGIES, INC.\nTABLE OF CONTENTS\nPa...","{'Document': 'Uber-2024-Annual-Report', 'Page'..."
...,...,...
137,"KNOW ALL PERSONS BY THESE PRESENTS, that each ...","{'Document': 'Uber-2024-Annual-Report', 'Page'..."
138,Alexander Wynaendts\n135,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
139,[THIS PAGE INTENTIONALLY LEFT BLANK],"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
140,Board of Directors Officers Stock Exchange\nUb...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."


## Text Embedding
- Create the model
- Pass the content to embed

## Types of Embedding
- `Dense Embedding`: Captures the semantic meaning. Dimension size is less and mostly non-zero. Dimension is low 100 to 1024.
- `Sparse Embedding`: Captures kind of word count. Vector dimension is very high and mostly zeros. 
Eg: One Hot Encoding, TF-IDF, BagOfWords.
- `Quantised Embedding`: Real valued vectors are compressed using fewer bits. Commonly used in FAISS.
- `Binary Embedding`: Each element is either 0/1 or 1/1. It is built using LSH
- `MultiModal Embedding`: Embedding where it contains both text and images

Model - `all-MiniLM-L6-v2` <br/>
Dim - `384`

In [10]:
model_name = "all-MiniLM-L6-v2"
embedding_model = SentenceTransformer(model_name)
only_text = text_doc["content"].tolist()
embeddings = embedding_model.encode(only_text)

# embeddings[0]

In [11]:
len(embeddings[0]), embeddings[0]

(384,
 array([-3.97824645e-02,  2.67725661e-02,  6.64736563e-03,  5.84164560e-02,
         9.89165250e-03,  3.31628025e-02, -9.03214291e-02, -2.95946933e-02,
        -4.26093265e-02, -2.09719799e-02, -4.05032262e-02,  8.99754092e-02,
         2.43941396e-02,  1.09723127e-02, -8.02493244e-02,  5.96309416e-02,
        -1.86618492e-02, -5.57510816e-02,  1.24213481e-02, -3.63436192e-02,
         3.27867605e-02,  8.62288550e-02,  1.85550246e-02,  2.59517692e-02,
        -1.16655827e-01,  9.22764186e-03, -4.61959839e-02,  8.64057150e-03,
        -3.97837497e-02,  1.66031700e-02, -2.92449482e-02,  7.76578262e-02,
        -1.07767312e-02,  1.32794213e-02, -3.23496833e-02,  1.77402068e-02,
         6.50395527e-02,  9.64716449e-03,  1.01716906e-01, -1.90990996e-02,
         3.27548124e-02, -9.78315771e-02,  7.64240623e-02,  3.75914387e-02,
        -4.14180085e-02, -2.05772347e-03, -2.98356805e-02, -9.65187326e-03,
         1.88575927e-02,  1.09276094e-01, -7.35093653e-02,  3.57681252e-02,
      

## Query search
- Ask a query
- convert the sentence to embedding
- perform a cosine similarity on query and documents

In [12]:
query = "what is Ubers revenue?"
query_emd = embedding_model.encode([query])
top_k = 3

#### Cosine similarity
- Considers only  direction
- Most models are trained on this

In [13]:
similarities = cosine_similarity(query_emd, embeddings)

In [15]:
len(similarities[0])

142

In [None]:
similarities[0][4]

np.float32(0.4728542)

Using argmax for getting the most similar item using the similarity score

In [21]:
best_match_indexes = [np.argmax(item) for item in similarities]
print(best_match_indexes)

best_match_index = best_match_indexes[0]

best_match_content = text_doc.loc[best_match_index, "content"]
best_match_metadata = text_doc.loc[best_match_index, "MetaData"]

[np.int64(78)]


In [23]:
print(f"Search Query: '{query}'")
print("-" * 30)
print(f"Best Match Index: {best_match_index}")
print(f"Similarity Score: {similarities[0][best_match_index]:.4f}")
print(f"Content: '{best_match_content[:1000]}'")
print(f"Metadata: {best_match_metadata}")

Search Query: 'what is Ubers revenue?'
------------------------------
Best Match Index: 78
Similarity Score: 0.6940
Content: 'UBER TECHNOLOGIES, INC.
CONSOLIDATED STATEMENTS OF OPERATIONS
(In millions, except share amounts which are reflected in thousands, and per share amounts)
Year Ended December 31,
2022 2023 2024
Revenue $ 31,877 $ 37,281 $ 43,978
Costs and expenses
Cost of revenue, exclusive of depreciation and amortization shown separately below 19,659 22,457 26,651
Operations and support 2,413 2,689 2,732
Sales and marketing 4,756 4,356 4,337
Research and development 2,798 3,164 3,109
General and administrative 3,136 2,682 3,639
Depreciation and amortization 947 823 711
Total costs and expenses 33,709 36,171 41,179
Income (loss) from operations (1,832) 1,110 2,799
Interest expense (565) (633) (523)
Other income (expense), net (7,029) 1,844 1,849
Income (loss) before income taxes and income (loss) from equity method investments (9,426) 2,321 4,125
Provision for (benefit from) inc

_____
Getting the top k results from the similarity search

In [25]:
top_k_indices = np.argsort(similarities[0])[::-1][:top_k]
print(f"Search Query: '{query}'")
print(f"Top {top_k} most relevant pages:")
print("-" * 40)

for idx in top_k_indices:
    content = text_doc.loc[idx, 'content']
    metadata = text_doc.loc[idx, 'MetaData']
    score = similarities[0][idx]
    
    print(f"Index: {idx} | Score: {score:.4f}")
    print(f"Content: '{content[:1000]}'")
    print(f"Metadata: {metadata}\n")

Search Query: 'what is Ubers revenue?'
Top 3 most relevant pages:
----------------------------------------
Index: 78 | Score: 0.6940
Content: 'UBER TECHNOLOGIES, INC.
CONSOLIDATED STATEMENTS OF OPERATIONS
(In millions, except share amounts which are reflected in thousands, and per share amounts)
Year Ended December 31,
2022 2023 2024
Revenue $ 31,877 $ 37,281 $ 43,978
Costs and expenses
Cost of revenue, exclusive of depreciation and amortization shown separately below 19,659 22,457 26,651
Operations and support 2,413 2,689 2,732
Sales and marketing 4,756 4,356 4,337
Research and development 2,798 3,164 3,109
General and administrative 3,136 2,682 3,639
Depreciation and amortization 947 823 711
Total costs and expenses 33,709 36,171 41,179
Income (loss) from operations (1,832) 1,110 2,799
Interest expense (565) (633) (523)
Other income (expense), net (7,029) 1,844 1,849
Income (loss) before income taxes and income (loss) from equity method investments (9,426) 2,321 4,125
Provision for (

#### Dot Product
- Some models are trained to generate higher magnitute for more relavant text
- Considers both, magnitute and Direction of the vector
- Use only for models trained on dot product

In [27]:
# 3. Calculate the dot product between the query and all documents
#    We use .T to transpose the document embeddings matrix for matrix multiplication
similarities_dot_mul = np.dot(query_emd, embeddings.T)

# 4. Get the indices of the top_k most similar documents (same as before)
top_k_indices = np.argsort(similarities_dot_mul[0])[::-1][:top_k]

# 5. Retrieve and display the results
print(f"Search Query: '{query}'")
print(f"Top {top_k} most relevant pages (using Dot Product):")
print("-" * 50)

for idx in top_k_indices:
    content = text_doc.loc[idx, 'content']
    metadata = text_doc.loc[idx, 'MetaData']
    score = similarities[0][idx]
    
    print(f"Index: {idx} | Score: {score:.4f}")
    print(f"Content: '{content[:1000]}'")
    print(f"Metadata: {metadata}\n")

Search Query: 'what is Ubers revenue?'
Top 3 most relevant pages (using Dot Product):
--------------------------------------------------
Index: 78 | Score: 0.6940
Content: 'UBER TECHNOLOGIES, INC.
CONSOLIDATED STATEMENTS OF OPERATIONS
(In millions, except share amounts which are reflected in thousands, and per share amounts)
Year Ended December 31,
2022 2023 2024
Revenue $ 31,877 $ 37,281 $ 43,978
Costs and expenses
Cost of revenue, exclusive of depreciation and amortization shown separately below 19,659 22,457 26,651
Operations and support 2,413 2,689 2,732
Sales and marketing 4,756 4,356 4,337
Research and development 2,798 3,164 3,109
General and administrative 3,136 2,682 3,639
Depreciation and amortization 947 823 711
Total costs and expenses 33,709 36,171 41,179
Income (loss) from operations (1,832) 1,110 2,799
Interest expense (565) (633) (523)
Other income (expense), net (7,029) 1,844 1,849
Income (loss) before income taxes and income (loss) from equity method investments (9,42

### Visualise the Embedding in clusters

In [28]:
from sklearn.decomposition import PCA
pca_reducer = PCA(n_components=2)
embeddings_2d = pca_reducer.fit_transform(embeddings)

In [29]:
from sklearn.cluster import KMeans

# Apply K-means clustering
num_clusters = 3
kmeans = KMeans(n_clusters=num_clusters)
cluster_labels = kmeans.fit_predict(embeddings_2d)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [30]:
# Add cluster labels as a new column to the DataFrame
text_doc['cluster_label'] = cluster_labels

In [31]:
text_doc["cluster_label"].value_counts()

cluster_label
0    82
1    31
2    29
Name: count, dtype: int64

In [32]:
import plotly.express as px
import textwrap


def wrap_text(text, width=50):
    """Inserts <br> tags into text for Plotly line wrapping."""
    wrapped_lines = textwrap.wrap(text, width=width)
    return '<br>'.join(wrapped_lines)

# Create a DataFrame for the data
cluster_df = pd.DataFrame({
    'x': embeddings_2d[:, 0],
    'y': embeddings_2d[:, 1],
    'label': cluster_labels,
    'sentence': text_doc['content']
})

# Create an interactive scatter plot using plotly
fig = px.scatter(
    cluster_df,
    x='x', y='y',
    color='label',
    hover_name='sentence',
    title='Uber 2024 annual report',
    labels={'label': 'Cluster'},
    width=800,  # Adjust the width as desired
    height=600,  # Adjust the height as desired
)

fig.update_traces(
    marker=dict(size=8)  # Adjust the size value as needed
)

# Set the background color to black
fig.update_layout(
    plot_bgcolor='white',
)

fig.show()