# VectorDB

Vector databases are optimized for semantic search. They use **ANN (Approximate Nearest Neighbor)** algorithms, which trade a bit of accuracy for significantly faster performance compared to exact **KNN (k-Nearest Neighbors)**. ANN typically operates in *O(log N)* time, while KNN is *O(N)*.

### VectorDB vs RDBMS

* **RDBMS** stores structured data (rows and columns) and relies on exact keyword matching.
* **VectorDB** stores unstructured data (text, images, audio, video) as vector embeddings and enables similarity-based search.
* VectorDBs are faster for semantic search and are critical in GenAI and **RAG (Retrieval-Augmented Generation)** systems.

### Indexing Techniques in VectorDB

* **Flat**: Brute-force search.
* **LSH (Locality-Sensitive Hashing)**: Groups vectors into hash buckets.
* **IVF (Inverted File Index)**: Partitions vectors into clusters; **IVFPQ** further compresses each cluster.

  * *Note:* Borderline vectors may trigger search in nearby clusters.
* **HNSW (Hierarchical Navigable Small World)**: Organizes vectors in a multi-layer graph for fast navigation across similar vectors.

In [None]:
import pdfplumber
import pandas as pd
import numpy as np

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

#FAISS, PINECONE
import chromadb
from chromadb import PersistentClient
from chromadb.config import Settings

In [2]:
pdf_reader = pdfplumber.open("../Data/Uber-2024-Annual-Report.pdf")
len(pdf_reader.pages)

142

#### Chunking Strategies
- Fixed Size chunking - Fixed length
- Sentence based chunking - EOS
- New Line based chunking - \n
- Paragraph based Chunking - \n\n
- Page based Chunking
- Token based chunking - Fixed length of tokens rather than words
- Sliding window chunking - Overlaps some content from previous chunk
- Hierarhical Chunking - Breaks down documents at multiple levels, such as sections, subsections, and paragraphs
- Content-Aware Chunking - Chunking text at paragraph level and tables as seperate entities
- Table aware Chunking
- Keyword based Chunking - Introduction, Conclusion, Summary these are chunked
- Hybrid Chunking - Using different Chunking strategies based on data

In [None]:
text_content = []
document_name = "".join(pdf_reader.stream.name.split("/")[-1].split(".")[:-1])

for i, page in enumerate(pdf_reader.pages):
    text_page = page.extract_text()

    split_text = text_page.split("\n")

    for text in split_text:
        if len(text.split(" ")) > 10:
            text_content.append({
                "type" : "text",
                "document": document_name,
                "page": f"{i+1}",
                "content": text
            })

text_content[0]

In [None]:
len(text_content)

In [3]:
text_content = []

def find_middle_newline(s):
    # Step 1: Find all indexes of '\n'
    newline_indices = [i for i, char in enumerate(s) if char == '\n']
    
    if not newline_indices:
        return None  # No newline found
    
    # Step 2: Find the middle index
    middle_index = len(newline_indices) // 2
    
    # Step 3: Return the position of the middle '\n'
    return newline_indices[middle_index]


document_name = "".join(pdf_reader.stream.name.split("/")[-1].split(".")[:-1])


for i, page in enumerate(pdf_reader.pages):
    text_page = page.extract_text()

    if len(text_page.split(" ")) < 10:
        print(f"Page number: {i+1}, count: {len(text_page.split(" "))}")
        continue

    if len(text_page) > 5000:
        mid_index = find_middle_newline(text_page)
        text_content.append({
            "type" : "text",
            "document": document_name,
            "page": f"{i+1}",
            "split":f"0",
            "content": text_page[:mid_index]
        })

        text_content.append({
            "type" : "text",
            "document": document_name,
            "page": f"{i+1}",
            "split":f"1",
            "content": text_page[mid_index+1:]
        })
    else:
        text_content.append({
                    "type" : "text",
                    "document": document_name,
                    "page": f"{i+1}",
                    "split":f"0",
                    "content": text_page
                })

text_content[0]

Page number: 1, count: 5
Page number: 139, count: 2
Page number: 140, count: 5


{'type': 'text',
 'document': 'Uber-2024-Annual-Report',
 'page': '2',
 'split': '0',
 'content': 'Uber’s Mission\nWe reimagine the way the world moves for the better\nWe are Uber. The go-getters. The kind of people who are relentless about our\nmission to help people go anywhere and get anything and earn their way.\nMovement is what we power. It’s our lifeblood. It runs through our veins. It’s\nwhat gets us out of bed each morning. It pushes us to constantly reimagine\nhow we can move better. For you. For all the places you want to go. For all the\nthings you want to get. For all the ways you want to earn. Across the entire\nworld. In real time. At the incredible speed of now.'}

In [4]:
len(text_content)

205

In [5]:
text_doc = pd.DataFrame(text_content)
text_doc.head()

Unnamed: 0,type,document,page,split,content
0,text,Uber-2024-Annual-Report,2,0,Uber’s Mission\nWe reimagine the way the world...
1,text,Uber-2024-Annual-Report,3,0,UNITED STATES\nSECURITIES AND EXCHANGE COMMISS...
2,text,Uber-2024-Annual-Report,4,0,Large accelerated filer ☒ Accelerated filer ☐\...
3,text,Uber-2024-Annual-Report,5,0,"UBER TECHNOLOGIES, INC.\nTABLE OF CONTENTS\nPa..."
4,text,Uber-2024-Annual-Report,6,0,SPECIAL NOTE REGARDING FORWARD-LOOKING STATEME...


In [6]:
text_doc["MetaData"] = text_doc.apply(lambda x: {"Document": x["document"], "Page": x["page"], "Split": x["split"], "Type": x["type"]}, axis=1)
text_doc = text_doc.drop(["type", "document", "page"], axis=1)
text_doc.head()


Unnamed: 0,split,content,MetaData
0,0,Uber’s Mission\nWe reimagine the way the world...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
1,0,UNITED STATES\nSECURITIES AND EXCHANGE COMMISS...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
2,0,Large accelerated filer ☒ Accelerated filer ☐\...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
3,0,"UBER TECHNOLOGIES, INC.\nTABLE OF CONTENTS\nPa...","{'Document': 'Uber-2024-Annual-Report', 'Page'..."
4,0,SPECIAL NOTE REGARDING FORWARD-LOOKING STATEME...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."


In [7]:
model_name = "all-MiniLM-L6-v2"
embedding_model = SentenceTransformer(model_name)
only_text = text_doc["content"].tolist()
embeddings = embedding_model.encode(only_text)

In [8]:
Chroma_DB_Path = "../Store/2_VectorDB"
COLLECTION_NAME = "uber_revenue"

# chroma_client = chromadb.Client(Settings(
#     persist_directory=Chroma_DB_Path,
#     anonymized_telemetry=False
# ))

chroma_client = PersistentClient(path=Chroma_DB_Path)

collection = chroma_client.get_or_create_collection(name=COLLECTION_NAME)

In [9]:
ids = text_doc["MetaData"].apply(lambda x: f"{x['Document']}_p{x['Page']}_s{x['Split']}")
ids[:5]

0    Uber-2024-Annual-Report_p2_s0
1    Uber-2024-Annual-Report_p3_s0
2    Uber-2024-Annual-Report_p4_s0
3    Uber-2024-Annual-Report_p5_s0
4    Uber-2024-Annual-Report_p6_s0
Name: MetaData, dtype: object

In [10]:
collection.add(
    documents=text_doc['content'].tolist(),
    metadatas=text_doc['MetaData'].tolist(),
    ids=ids.tolist()
)
print("Successfully stored")

Successfully stored


In [11]:
caching = []
cache_emd = []

In [12]:
def get_chroma_results(query):
    query_emd = embedding_model.encode([query])
    
    if len(cache_emd) > 0:
        cache_emd_array = np.vstack(cache_emd) 
        similarities = cosine_similarity(query_emd, cache_emd_array)
        best_match_indexes = [np.argmax(item) for item in similarities]

        if len(best_match_indexes) > 0 and similarities[0][best_match_indexes[0]] > 0.8:
            print(f"Returning from query: {caching[best_match_indexes[0]]["query"]} cache with score: {similarities[0][best_match_indexes[0]]:.4f}")
            return caching[best_match_indexes[0]]["results"]
    

    results = collection.query(
        query_texts=[query],
        n_results=3
    )

    caching.append({"query": query, "results": results}) 
    cache_emd.append(query_emd)
    
    return results

In [13]:
query = "What is the revenue of Uber?"
results = get_chroma_results(query=query)
results

{'ids': [['Uber-2024-Annual-Report_p79_s0',
   'Uber-2024-Annual-Report_p98_s0',
   'Uber-2024-Annual-Report_p58_s0']],
 'embeddings': None,
 'documents': [['UBER TECHNOLOGIES, INC.\nCONSOLIDATED STATEMENTS OF OPERATIONS\n(In millions, except share amounts which are reflected in thousands, and per share amounts)\nYear Ended December 31,\n2022 2023 2024\nRevenue $ 31,877 $ 37,281 $ 43,978\nCosts and expenses\nCost of revenue, exclusive of depreciation and amortization shown separately below 19,659 22,457 26,651\nOperations and support 2,413 2,689 2,732\nSales and marketing 4,756 4,356 4,337\nResearch and development 2,798 3,164 3,109\nGeneral and administrative 3,136 2,682 3,639\nDepreciation and amortization 947 823 711\nTotal costs and expenses 33,709 36,171 41,179\nIncome (loss) from operations (1,832) 1,110 2,799\nInterest expense (565) (633) (523)\nOther income (expense), net (7,029) 1,844 1,849\nIncome (loss) before income taxes and income (loss) from equity method investments (9,

In [20]:
query = "What is the profit of Uber?"
results = get_chroma_results(query=query)
results

Returning from query: What is the revenue of Uber? cache with score: 0.9118


{'ids': [['Uber-2024-Annual-Report_p79_s0',
   'Uber-2024-Annual-Report_p98_s0',
   'Uber-2024-Annual-Report_p58_s0']],
 'embeddings': None,
 'documents': [['UBER TECHNOLOGIES, INC.\nCONSOLIDATED STATEMENTS OF OPERATIONS\n(In millions, except share amounts which are reflected in thousands, and per share amounts)\nYear Ended December 31,\n2022 2023 2024\nRevenue $ 31,877 $ 37,281 $ 43,978\nCosts and expenses\nCost of revenue, exclusive of depreciation and amortization shown separately below 19,659 22,457 26,651\nOperations and support 2,413 2,689 2,732\nSales and marketing 4,756 4,356 4,337\nResearch and development 2,798 3,164 3,109\nGeneral and administrative 3,136 2,682 3,639\nDepreciation and amortization 947 823 711\nTotal costs and expenses 33,709 36,171 41,179\nIncome (loss) from operations (1,832) 1,110 2,799\nInterest expense (565) (633) (523)\nOther income (expense), net (7,029) 1,844 1,849\nIncome (loss) before income taxes and income (loss) from equity method investments (9,

In [21]:
query = "What is the loss of Uber?"
results = get_chroma_results(query=query)
results

{'ids': [['Uber-2024-Annual-Report_p84_s0',
   'Uber-2024-Annual-Report_p79_s0',
   'Uber-2024-Annual-Report_p98_s0']],
 'embeddings': None,
 'documents': [['UBER TECHNOLOGIES, INC.\nCONSOLIDATED STATEMENTS OF CASH FLOWS\n(In millions)\nYear Ended December 31,\n2022 2023 2024\nCash flows from operating activities\nNet income (loss) including non-controlling interests $ (9,138) $ 2,156 $ 9,845\nAdjustments to reconcile net income (loss) to net cash provided by operating activities:\nDepreciation and amortization 947 823 737\nBad debt expense 114 92 61\nStock-based compensation 1,793 1,935 1,796\nLoss from sale of investments — 74 —\nGain on business divestitures (14) (204) —\nDeferred income taxes (441) 26 (6,027)\nAccretion of discounts on marketable debt securities, net (9) (154) (251)\nImpairments of goodwill, long-lived assets and other assets 28 86 —\nImpairment of equity method investment 182 — —\nLoss (income) from equity method investments, net (107) (48) 38\nUnrealized (gain) l

In [22]:
query = "How much degrade for Uber?"
results = get_chroma_results(query=query)
results

{'ids': [['Uber-2024-Annual-Report_p98_s0',
   'Uber-2024-Annual-Report_p79_s0',
   'Uber-2024-Annual-Report_p84_s0']],
 'embeddings': None,
 'documents': [['15, 2026, and interim periods within fiscal years beginning after December 15, 2027. Early adoption is permitted. We are currently\nevaluating the impact of this accounting standard update on our consolidated financial statements and related disclosures.\nNote 2 – Revenue\nThe following tables present our revenues disaggregated by offering and geographical region. Revenue by geographical region is\nbased on where the transaction occurred. This level of disaggregation takes into consideration how the nature, amount, timing, and\nuncertainty of revenue and cash flows are affected by economic factors (in millions):\nYear Ended December 31,\n2022 2023 2024\nMobility revenue (1) $ 14,029 $ 19,832 $ 25,087\nDelivery revenue (1) 10,901 12,204 13,750\nFreight revenue 6,947 5,245 5,141\nTotal revenue $ 31,877 $ 37,281 $ 43,978\n(1) We offe

In [23]:
query = "How much negative margin for Uber?"
results = get_chroma_results(query=query)
results

{'ids': [['Uber-2024-Annual-Report_p84_s0',
   'Uber-2024-Annual-Report_p79_s0',
   'Uber-2024-Annual-Report_p78_s0']],
 'embeddings': None,
 'documents': [['UBER TECHNOLOGIES, INC.\nCONSOLIDATED STATEMENTS OF CASH FLOWS\n(In millions)\nYear Ended December 31,\n2022 2023 2024\nCash flows from operating activities\nNet income (loss) including non-controlling interests $ (9,138) $ 2,156 $ 9,845\nAdjustments to reconcile net income (loss) to net cash provided by operating activities:\nDepreciation and amortization 947 823 737\nBad debt expense 114 92 61\nStock-based compensation 1,793 1,935 1,796\nLoss from sale of investments — 74 —\nGain on business divestitures (14) (204) —\nDeferred income taxes (441) 26 (6,027)\nAccretion of discounts on marketable debt securities, net (9) (154) (251)\nImpairments of goodwill, long-lived assets and other assets 28 86 —\nImpairment of equity method investment 182 — —\nLoss (income) from equity method investments, net (107) (48) 38\nUnrealized (gain) l

In [24]:
query = "What is the margin for Uber?"
results = get_chroma_results(query=query)
results

Returning from query: How much negative margin for Uber? cache with score: 0.8939


{'ids': [['Uber-2024-Annual-Report_p84_s0',
   'Uber-2024-Annual-Report_p79_s0',
   'Uber-2024-Annual-Report_p78_s0']],
 'embeddings': None,
 'documents': [['UBER TECHNOLOGIES, INC.\nCONSOLIDATED STATEMENTS OF CASH FLOWS\n(In millions)\nYear Ended December 31,\n2022 2023 2024\nCash flows from operating activities\nNet income (loss) including non-controlling interests $ (9,138) $ 2,156 $ 9,845\nAdjustments to reconcile net income (loss) to net cash provided by operating activities:\nDepreciation and amortization 947 823 737\nBad debt expense 114 92 61\nStock-based compensation 1,793 1,935 1,796\nLoss from sale of investments — 74 —\nGain on business divestitures (14) (204) —\nDeferred income taxes (441) 26 (6,027)\nAccretion of discounts on marketable debt securities, net (9) (154) (251)\nImpairments of goodwill, long-lived assets and other assets 28 86 —\nImpairment of equity method investment 182 — —\nLoss (income) from equity method investments, net (107) (48) 38\nUnrealized (gain) l

In [25]:
query = "What much money Uber made?"
results = get_chroma_results(query=query)
results

Returning from query: What is the revenue of Uber? cache with score: 0.8098


{'ids': [['Uber-2024-Annual-Report_p79_s0',
   'Uber-2024-Annual-Report_p98_s0',
   'Uber-2024-Annual-Report_p58_s0']],
 'embeddings': None,
 'documents': [['UBER TECHNOLOGIES, INC.\nCONSOLIDATED STATEMENTS OF OPERATIONS\n(In millions, except share amounts which are reflected in thousands, and per share amounts)\nYear Ended December 31,\n2022 2023 2024\nRevenue $ 31,877 $ 37,281 $ 43,978\nCosts and expenses\nCost of revenue, exclusive of depreciation and amortization shown separately below 19,659 22,457 26,651\nOperations and support 2,413 2,689 2,732\nSales and marketing 4,756 4,356 4,337\nResearch and development 2,798 3,164 3,109\nGeneral and administrative 3,136 2,682 3,639\nDepreciation and amortization 947 823 711\nTotal costs and expenses 33,709 36,171 41,179\nIncome (loss) from operations (1,832) 1,110 2,799\nInterest expense (565) (633) (523)\nOther income (expense), net (7,029) 1,844 1,849\nIncome (loss) before income taxes and income (loss) from equity method investments (9,