AI models cannot directly process raw text; they must convert it into a numerical format. This process involves **tokenization** and **embedding**.

***

### 1. Tokenization

The text is first broken down into smaller units called **tokens** (words, sub-words, or characters) by a **tokenizer**. Each token is then assigned a unique **numerical ID** from the model's vocabulary.

### 2. Embedding

These numerical IDs are passed to an **embedding model**, which generates a **vector** of floating-point numbers for each token. This vector is the **embedding**, and it captures the **semantic meaning** of the token. Tokens with similar meanings are positioned closer together in the high-dimensional embedding space.

***

### Factual Corrections and Key Points:

* **Model Specificity:** Every model's embedding is unique, defined by its specific **training data and architecture**. Therefore, two different models will generate different embeddings for the exact same token.
* **Dimensionality:** The **dimension** (length) of the embedding vector is a fixed architectural parameter. It determines the **capacity** for the vector to store semantic information, but the specific **semantic information** is encoded by the values within the vector, which are learned during training.
* **Types:** Embeddings are a general concept used across different data types, including **text**, **image/video**, and **audio**. 

Learn more here: https://huggingface.co/docs/chat-ui/en/configuration/embeddings

We will be using the `all-MiniLM-L6-v2` learn more here: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

## First operation: `ETL: Extract - Transform - Load`
Here, we will first extract the text content from the source and the transform it and finally load the content.

Our process wil be, 
- Load the document
- Chunk it either paragraph or sentences, etc
- Embed the document

In [1]:
from pypdf import PdfReader
import pdfplumber
import pandas as pd
import numpy as np

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

from umap import UMAP
import matplotlib.pyplot as plt

#### Create a PDF reader 

In [2]:
pdf_reader = pdfplumber.open("../Data/Uber-2024-Annual-Report.pdf")
len(pdf_reader.pages)

142

In [3]:
reader = PdfReader("../Data/Uber-2024-Annual-Report.pdf")
len(reader.pages)

142

#### Extract all the tables

In [4]:
tables = []
for i, page in enumerate(pdf_reader.pages):
    table_content = page.extract_tables()
    if table_content:
        tables.append({
            "type": "table",
            "page": f"{i+1}",
            "content": table_content
        })
tables[0]


{'type': 'table',
 'page': '51',
 'content': [[['October 1, 2024 to October 31,', '3,605', '', '3,605', ''],
   [None, None, '$ 76.56', None, '$ 6,024'],
   ['2024', None, None, None, None],
   ['November 1, 2024 to November', '2,024', '', '2,024', ''],
   [None, None, '$ 70.77', None, '$ 5,881'],
   ['30, 2024', None, None, None, None],
   ['December 1, 2024 to December', '2,032', '', '2,032', ''],
   [None, None, '$ 64.34', None, '$ 5,750'],
   ['31, 2024', None, None, None, None],
   ['Total', '7,661', '', '7,661', '']]]}

#### Extract all the text
- Chunking is at Document level

In [5]:
text_content = []
document_name = "".join(pdf_reader.stream.name.split("/")[-1].split(".")[:-1])
for i, page in enumerate(reader.pages):
    text_page = page.extract_text()
    text_content.append({
        "type" : "text",
        "document": document_name,
        "page": f"{i+1}",
        "content": text_page
    })

text_content[0]

{'type': 'text',
 'document': 'Uber-2024-Annual-Report',
 'page': '1',
 'content': 'On Our Way\n2024 ANNUAL REPORT'}

#### Create dataFrame from text content
- content - str
- Metadata - pageNo, documentName

In [6]:
text_doc =pd.DataFrame(text_content)
text_doc

Unnamed: 0,type,document,page,content
0,text,Uber-2024-Annual-Report,1,On Our Way\n2024 ANNUAL REPORT
1,text,Uber-2024-Annual-Report,2,Uber’s Mission\nWe reimagine the way the world...
2,text,Uber-2024-Annual-Report,3,\n \nUNITED STATES \nSECURITIES AND EXCHANGE ...
3,text,Uber-2024-Annual-Report,4,\n \nLarge accelerated filer ☒ Accelerated ...
4,text,Uber-2024-Annual-Report,5,"\n1 \nUBER TECHNOLOGIES, INC. \nTABLE OF CONT..."
...,...,...,...,...
137,text,Uber-2024-Annual-Report,138,"\n134 \nKNOW ALL PERSONS BY THESE PRESENTS, t..."
138,text,Uber-2024-Annual-Report,139,\n135 \nAlexander Wynaendts \n
139,text,Uber-2024-Annual-Report,140,[THIS PAGE INTENTIONALL Y LEFT BLANK]
140,text,Uber-2024-Annual-Report,141,Board of Directors\nRonald Sugar\nChairperson ...


In [None]:
text_doc["MetaData"] = text_doc.apply(lambda x: {"Document": x["document"], "Page": x["page"], "Type": x["type"]}, axis=1)
text_doc

Unnamed: 0,type,document,page,content,MetaData
0,text,Uber-2024-Annual-Report,1,On Our Way\n2024 ANNUAL REPORT,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
1,text,Uber-2024-Annual-Report,2,Uber’s Mission\nWe reimagine the way the world...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
2,text,Uber-2024-Annual-Report,3,\n \nUNITED STATES \nSECURITIES AND EXCHANGE ...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
3,text,Uber-2024-Annual-Report,4,\n \nLarge accelerated filer ☒ Accelerated ...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
4,text,Uber-2024-Annual-Report,5,"\n1 \nUBER TECHNOLOGIES, INC. \nTABLE OF CONT...","{'Document': 'Uber-2024-Annual-Report', 'Page'..."
...,...,...,...,...,...
137,text,Uber-2024-Annual-Report,138,"\n134 \nKNOW ALL PERSONS BY THESE PRESENTS, t...","{'Document': 'Uber-2024-Annual-Report', 'Page'..."
138,text,Uber-2024-Annual-Report,139,\n135 \nAlexander Wynaendts \n,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
139,text,Uber-2024-Annual-Report,140,[THIS PAGE INTENTIONALL Y LEFT BLANK],"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
140,text,Uber-2024-Annual-Report,141,Board of Directors\nRonald Sugar\nChairperson ...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."


In [8]:
text_doc = text_doc.drop(["type", "document", "page"], axis=1)
text_doc

Unnamed: 0,content,MetaData
0,On Our Way\n2024 ANNUAL REPORT,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
1,Uber’s Mission\nWe reimagine the way the world...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
2,\n \nUNITED STATES \nSECURITIES AND EXCHANGE ...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
3,\n \nLarge accelerated filer ☒ Accelerated ...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
4,"\n1 \nUBER TECHNOLOGIES, INC. \nTABLE OF CONT...","{'Document': 'Uber-2024-Annual-Report', 'Page'..."
...,...,...
137,"\n134 \nKNOW ALL PERSONS BY THESE PRESENTS, t...","{'Document': 'Uber-2024-Annual-Report', 'Page'..."
138,\n135 \nAlexander Wynaendts \n,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
139,[THIS PAGE INTENTIONALL Y LEFT BLANK],"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
140,Board of Directors\nRonald Sugar\nChairperson ...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."


## Text Embedding
- Create the model
- Pass the content to embed

## Types of Embedding
- `Dense Embedding`: Captures the semantic meaning. Dimension size is less and mostly non-zero. Dimension is low 100 to 1024.
- `Sparse Embedding`: Captures kind of word count. Vector dimension is very high and mostly zeros. 
Eg: One Hot Encoding, TF-IDF, BagOfWords.
- `Quantised Embedding`: Real valued vectors are compressed using fewer bits. Commonly used in FAISS.
- `Binary Embedding`: Each element is either 0/1 or 1/1. It is built using LSH
- `MultiModal Embedding`: Embedding where it contains both text and images

Model - `all-MiniLM-L6-v2` <br/>
Dim - `384`

In [9]:
model_name = "all-MiniLM-L6-v2"
embedding_model = SentenceTransformer(model_name)
only_text = text_doc["content"].tolist()
embeddings = embedding_model.encode(only_text)

# embeddings[0]

In [10]:
len(embeddings[0])

384

## Query search
- Ask a query
- convert the sentence to embedding
- perform a cosine similarity on query and documents

In [11]:
query = "what is Ubers revenue?"
query_emd = embedding_model.encode([query])
top_k = 3

#### Cosine similarity
- Considers only  direction
- Most models are trained on this

In [None]:
similarities = cosine_similarity(query_emd, embeddings)

[np.int64(78)]


Using argmax for getting the most similar item using the similarity score

In [None]:
best_match_indexes = [np.argmax(item) for item in similarities]
print(best_match_indexes)

best_match_index = best_match_indexes[0]

best_match_content = text_doc.loc[best_match_index, "content"]
best_match_metadata = text_doc.loc[best_match_index, "MetaData"]

In [13]:
print(f"Search Query: '{query}'")
print("-" * 30)
print(f"Best Match Index: {best_match_index}")
print(f"Similarity Score: {similarities[0][best_match_index]:.4f}")
print(f"Content: '{best_match_content[:50]}'")
print(f"Metadata: {best_match_metadata}")

Search Query: 'what is Ubers revenue?'
------------------------------
Best Match Index: 78
Similarity Score: 0.6778
Content: ' 
75 
UBER TECHNOLOGIES, INC. 
CONSOLIDATED STATEM'
Metadata: {'Document': 'Uber-2024-Annual-Report', 'Page': '79', 'Yype': 'text'}


_____
Getting the top k results from the similarity search

In [14]:
top_k_indices = np.argsort(similarities[0])[::-1][:top_k]
print(f"Search Query: '{query}'")
print(f"Top {top_k} most relevant pages:")
print("-" * 40)

for idx in top_k_indices:
    content = text_doc.loc[idx, 'content']
    metadata = text_doc.loc[idx, 'MetaData']
    score = similarities[0][idx]
    
    print(f"Index: {idx} | Score: {score:.4f}")
    print(f"Content: '{content[:100]}'")
    print(f"Metadata: {metadata}\n")

Search Query: 'what is Ubers revenue?'
Top 3 most relevant pages:
----------------------------------------
Index: 78 | Score: 0.6778
Content: ' 
75 
UBER TECHNOLOGIES, INC. 
CONSOLIDATED STATEMENTS OF OPERATIONS 
(In millions, except share amo'
Metadata: {'Document': 'Uber-2024-Annual-Report', 'Page': '79', 'Yype': 'text'}

Index: 97 | Score: 0.6653
Content: ' 
94 
15, 2026, and interim periods within fiscal years beginning afte r December 15, 2027. Early ad'
Metadata: {'Document': 'Uber-2024-Annual-Report', 'Page': '98', 'Yype': 'text'}

Index: 57 | Score: 0.6225
Content: ' 
54 
The following table sets forth the components of our consolidated statements of operations for'
Metadata: {'Document': 'Uber-2024-Annual-Report', 'Page': '58', 'Yype': 'text'}



#### Dot Product
- Some models are trained to generate higher magnitute for more relavant text
- Considers both, magnitute and Direction of the vector
- Use only for models trained on dot product

In [15]:
# 3. Calculate the dot product between the query and all documents
#    We use .T to transpose the document embeddings matrix for matrix multiplication
similarities_dot_mul = np.dot(query_emd, embeddings.T)

# 4. Get the indices of the top_k most similar documents (same as before)
top_k_indices = np.argsort(similarities_dot_mul[0])[::-1][:top_k]

# 5. Retrieve and display the results
print(f"Search Query: '{query}'")
print(f"Top {top_k} most relevant pages (using Dot Product):")
print("-" * 50)

for idx in top_k_indices:
    content = text_doc.loc[idx, 'content']
    metadata = text_doc.loc[idx, 'MetaData']
    score = similarities[0][idx]
    
    print(f"Index: {idx} | Score: {score:.4f}")
    print(f"Content: '{content[:100]}'")
    print(f"Metadata: {metadata}\n")

Search Query: 'what is Ubers revenue?'
Top 3 most relevant pages (using Dot Product):
--------------------------------------------------
Index: 78 | Score: 0.6778
Content: ' 
75 
UBER TECHNOLOGIES, INC. 
CONSOLIDATED STATEMENTS OF OPERATIONS 
(In millions, except share amo'
Metadata: {'Document': 'Uber-2024-Annual-Report', 'Page': '79', 'Yype': 'text'}

Index: 97 | Score: 0.6653
Content: ' 
94 
15, 2026, and interim periods within fiscal years beginning afte r December 15, 2027. Early ad'
Metadata: {'Document': 'Uber-2024-Annual-Report', 'Page': '98', 'Yype': 'text'}

Index: 57 | Score: 0.6225
Content: ' 
54 
The following table sets forth the components of our consolidated statements of operations for'
Metadata: {'Document': 'Uber-2024-Annual-Report', 'Page': '58', 'Yype': 'text'}



### Visualise the Embedding in clusters

In [16]:
from sklearn.decomposition import PCA
pca_reducer = PCA(n_components=2)
embeddings_2d = pca_reducer.fit_transform(embeddings)

In [17]:
from sklearn.cluster import KMeans

# Apply K-means clustering
num_clusters = 3
kmeans = KMeans(n_clusters=num_clusters)
cluster_labels = kmeans.fit_predict(embeddings_2d)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [18]:
# Add cluster labels as a new column to the DataFrame
text_doc['cluster_label'] = cluster_labels

In [19]:
text_doc["cluster_label"].value_counts()

cluster_label
1    62
0    45
2    35
Name: count, dtype: int64

In [20]:
import plotly.express as px
import textwrap


def wrap_text(text, width=50):
    """Inserts <br> tags into text for Plotly line wrapping."""
    wrapped_lines = textwrap.wrap(text, width=width)
    return '<br>'.join(wrapped_lines)

# Create a DataFrame for the data
cluster_df = pd.DataFrame({
    'x': embeddings_2d[:, 0],
    'y': embeddings_2d[:, 1],
    'label': cluster_labels,
    'sentence': text_doc['content']
})

# Create an interactive scatter plot using plotly
fig = px.scatter(
    cluster_df,
    x='x', y='y',
    color='label',
    hover_name='sentence',
    title='Uber 2024 annual report',
    labels={'label': 'Cluster'},
    width=800,  # Adjust the width as desired
    height=600,  # Adjust the height as desired
)

fig.update_traces(
    marker=dict(size=8)  # Adjust the size value as needed
)

# Set the background color to black
fig.update_layout(
    plot_bgcolor='white',
)

fig.show()