# VectorDB

Vector databases are optimized for semantic search. They use **ANN (Approximate Nearest Neighbor)** algorithms, which trade a bit of accuracy for significantly faster performance compared to exact **KNN (k-Nearest Neighbors)**. ANN typically operates in *O(N log N)* time, while KNN is *O(N)*.

### VectorDB vs RDBMS

* **RDBMS** stores structured data (rows and columns) and relies on exact keyword matching.
* **VectorDB** stores unstructured data (text, images, audio, video) as vector embeddings and enables similarity-based search.
* VectorDBs are faster for semantic search and are critical in GenAI and **RAG (Retrieval-Augmented Generation)** systems.

### Indexing Techniques in VectorDB

* **Flat**: Brute-force search.
* **LSH (Locality-Sensitive Hashing)**: Groups vectors into hash buckets.
* **IVF (Inverted File Index)**: Partitions vectors into clusters; **IVFPQ** further compresses each cluster.

  * *Note:* Borderline vectors may trigger search in nearby clusters.
* **HNSW (Hierarchical Navigable Small World)**: Organizes vectors in a multi-layer graph for fast navigation across similar vectors.

In [None]:
import pdfplumber
import pandas as pd
import numpy as np

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity


import chromadb
from chromadb import PersistentClient
from chromadb.config import Settings

In [None]:
pdf_reader = pdfplumber.open("../Data/Uber-2024-Annual-Report.pdf")
len(pdf_reader.pages)

#### Chunking Strategies
- Fixed Size chunking - Fixed length
- Sentence based chunking - \n
- Paragraph based Chunking - \n\n
- Page based Chunking
- Token based chunking - Fixed length of tokens rather than words
- Sliding window chunking - Overlaps some content from previous chunk
- Hierarhical Chunking - Breaks down documents at multiple levels, such as sections, subsections, and paragraphs
- Content-Aware Chunking - Chunking text at paragraph level and tables as seperate entities
- Table aware Chunking
- Keyword based Chunking - Introduction, Conclusion, Summary these are chunked
- Hybrid Chunking - Using different Chunking strategies based on data

In [None]:
text_content = []
document_name = "".join(pdf_reader.stream.name.split("/")[-1].split(".")[:-1])
for i, page in enumerate(pdf_reader.pages):
    text_page = page.extract_text()

    split_text = text_page.split("\n")

    for text in split_text:
        if len(text.split(" ")) > 10:
            text_content.append({
                "type" : "text",
                "document": document_name,
                "page": f"{i+1}",
                "content": text
            })

text_content[0]

In [None]:
len(text_content)

In [None]:
text_content = []

def find_middle_newline(s):
    # Step 1: Find all indexes of '\n'
    newline_indices = [i for i, char in enumerate(s) if char == '\n']
    
    if not newline_indices:
        return None  # No newline found
    
    # Step 2: Find the middle index
    middle_index = len(newline_indices) // 2
    
    # Step 3: Return the position of the middle '\n'
    return newline_indices[middle_index]

document_name = "".join(pdf_reader.stream.name.split("/")[-1].split(".")[:-1])
for i, page in enumerate(pdf_reader.pages):
    text_page = page.extract_text()

    if len(text_page.split(" ")) < 10:
        print(f"Page number: {i+1}, count: {len(text_page.split(" "))}")
        continue

    if len(text_page) > 5000:
        mid_index = find_middle_newline(text_page)
        text_content.append({
            "type" : "text",
            "document": document_name,
            "page": f"{i+1}",
            "split":f"0",
            "content": text_page[:mid_index]
        })

        text_content.append({
            "type" : "text",
            "document": document_name,
            "page": f"{i+1}",
            "split":f"1",
            "content": text_page[mid_index+1:]
        })
    else:
        text_content.append({
                    "type" : "text",
                    "document": document_name,
                    "page": f"{i+1}",
                    "split":f"0",
                    "content": text_page
                })

text_content[0]

In [None]:
len(text_content)

In [None]:
text_doc = pd.DataFrame(text_content)
text_doc.head()

In [None]:
text_doc["MetaData"] = text_doc.apply(lambda x: {"Document": x["document"], "Page": x["page"], "Split": x["split"], "Type": x["type"]}, axis=1)
text_doc = text_doc.drop(["type", "document", "page"], axis=1)
text_doc.head()


In [None]:
model_name = "all-MiniLM-L6-v2"
embedding_model = SentenceTransformer(model_name)
only_text = text_doc["content"].tolist()
embeddings = embedding_model.encode(only_text)

In [None]:
Chroma_DB_Path = "../Store/2_VectorDB"
COLLECTION_NAME = "uber_revenue"

# chroma_client = chromadb.Client(Settings(
#     persist_directory=Chroma_DB_Path,
#     anonymized_telemetry=False
# ))

chroma_client = PersistentClient(path=Chroma_DB_Path)

collection = chroma_client.get_or_create_collection(name=COLLECTION_NAME)

In [None]:
ids = text_doc["MetaData"].apply(lambda x: f"{x['Document']}_p{x['Page']}_s{x['Split']}")
ids[:5]

In [None]:
collection.add(
    documents=text_doc['content'].tolist(),
    metadatas=text_doc['MetaData'].tolist(),
    ids=ids.tolist()
)
print("Successfully stored")

In [None]:
caching = []
cache_emd = []

In [None]:
def get_chroma_results(query):
    query_emd = embedding_model.encode([query])
    
    if len(cache_emd) > 0:
        cache_emd_array = np.vstack(cache_emd) 
        similarities = cosine_similarity(query_emd, cache_emd_array)
        best_match_indexes = [np.argmax(item) for item in similarities]

        if len(best_match_indexes) > 0 and similarities[0][best_match_indexes[0]] > 0.8:
            print(f"Returning from query: {caching[best_match_indexes[0]]["query"]} cache with score: {similarities[0][best_match_indexes[0]]:.4f}")
            return caching[best_match_indexes[0]]["results"]
    

    results = collection.query(
        query_texts=[query],
        n_results=3
    )

    caching.append({"query": query, "results": results}) 
    cache_emd.append(query_emd)
    
    return results

In [None]:
query = "What is the revenue of Uber?"
results = get_chroma_results(query=query)
results

In [None]:
query = "What is the profit of Uber?"
results = get_chroma_results(query=query)
results

In [None]:
query = "What is the loss of Uber?"
results = get_chroma_results(query=query)
results

In [None]:
query = "How much degrade for Uber?"
results = get_chroma_results(query=query)
results

In [None]:
query = "How much negative margin for Uber?"
results = get_chroma_results(query=query)
results

In [None]:
query = "What is the margin for Uber?"
results = get_chroma_results(query=query)
results

In [None]:
query = "What much money Uber made?"
results = get_chroma_results(query=query)
results