# 📚 DeepDocSearch - AI-powered Document Search

DeepDocSearch is an AI-powered system that allows users to search for relevant information 
within internal documents using FAISS (Facebook AI Similarity Search) and a Large Language Model (LLM).
This notebook will guide you through the entire process from indexing documents to querying with AI.

In [28]:
import os
import time
import fitz 
import faiss
import numpy as np
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from deep_doc_search.query_handler import search_in_vector_store, normalize_query
from deep_doc_search.vector_store import create_vector_store_from_pdf
from deep_doc_search.llm_handler import generate_response


## 📄 Step 1: Extract Text from a PDF
We first extract text from a PDF document using PyMuPDF.

In [None]:
def extract_text_from_pdf(pdf_path):
    """Extracts text from a PDF document."""
    doc = fitz.open(pdf_path)
    text = "\n".join(page.get_text() for page in doc)
    return text
 
# Example usage
pdf_path = 'data/test.pdf'
document_text = extract_text_from_pdf(pdf_path)
print(document_text[:500]) 

2023 A NNUA L R EPORT
Passionate
about creativity




Passionate  
about creativity

The LVMH spirit
The LVMH Group was formed in 1987, following the merger between 
Louis Vuitton and Moët Hennessy. From the outset, Bernard Arnault 
gave the Group a clear vision: to become the world leader in luxury, 
with a philosophy summed up in its motto, “Passionate about creativity”. 
Today, the LVMH Group comprises 75 exceptional Maisons, each of 
which creates products that embody unique craftsmanship, r


## 🔍 Step 2: Indexing with FAISS
We generate embeddings for text chunks and store them in FAISS for efficient retrieval.

In [13]:
DB_PATH = 'data/faiss_index'
METADATA_PATH = 'data/metadata.pkl'
pdf_path = "data/test.pdf"

if not os.path.exists(DB_PATH):
    create_vector_store_from_pdf(pdf_path)
else:
    print("FAISS vector database already exists.")

FAISS vector database already exists.


## 🔎 Step 3: Searching with FAISS
Now we query FAISS to retrieve the most relevant text chunks.

In [16]:
query = 'What is the plan to protect water resources?'
results, distances = search_in_vector_store(query)

print("\nSearch Results:")
for i, (res, dist) in enumerate(zip(results, distances)):
    print(f"\nResult {i+1} (Distance: {dist:.4f}):\n{res}")


Search Results:

Result 1 (Distance: 0.8325):
positive outcomes, as well as the actions it will take to 
meet its 2026 and 2030 targets, at the event, which 
was notably attended by Christophe Béchu (France’s 
Minister of Sustainability and Regional Cohesion) and 
Virginijus Sinkevičius (European Commissioner for the 
Environment, Oceans and Fisheries).
Protecting water resources 
and biodiversity
In 2023, the Group unveiled the first part of its plan to 
protect water resources, which are essential for its 
Wines & Spirits and its Perfumes & Cosmetics business 
groups and also critical for its fashion and leather goods 
items. The goal is a 30% reduction in the amount of 
water used by LVMH’s operations and its value chain 
by 2030, especially in regions experiencing water 
stress. In 2023, LVMH ramped up its program of bio-
diversity initiatives, launching regenerative agriculture 
projects in Turkey and Chad for cotton, in Australia for 
merino wool, in Indonesia for palm oil, and 

## 🤖 Step 4: Generating an AI Response
We pass the retrieved results to an LLM to generate a structured answer.

In [17]:
context = '\n\n'.join(results)
prompt = f'''
You are an AI assistant specialized in analyzing internal documents.
Here is an excerpt from the document that may help answer the question:

{context}

Question: {query}
Respond accurately and concisely using only the provided information.
'''
response = generate_response(prompt)
print('🤖 AI Response:', response)

🤖 AI Response:  The document mentions that LVMH unveiled a plan in 2023 to protect water resources. The goal is a 30% reduction in the amount of water used by LVMH’s operations and its value chain by 2030, with a focus on regions experiencing water stress. The strategies include ramping up programs for biodiversity initiatives and launching regenerative agriculture projects in various locations such as Turkey, Chad, Australia, Indonesia, and France. Additionally, LVMH has implemented a Business Partners program to help its suppliers reduce their carbon footprint and impact on water and biodiversity. The document does not provide specific details about the measures or actions to achieve this goal beyond these initiatives.


## ⚡ Step 5: Evaluating Performance
We measure the time taken for FAISS search and LLM response.

In [None]:
start_time = time.time()
results, _ = search_in_vector_store(query, k=3)
search_time = time.time() - start_time

start_time = time.time()
response = generate_response(prompt)
llm_time = time.time() - start_time

print(f'🔍 FAISS Search Time: {search_time:.4f} sec')
print(f'🤖 LLM Response Time: {llm_time:.4f} sec')

🔍 FAISS Search Time: 3.8441 sec
🤖 LLM Response Time: 23.9297 sec


In [23]:
# Search without normalization
results_raw, distances_raw = search_in_vector_store(query)

# Search with normalization
query = 'WHAT ARE LVMH’S SUSTAINABILITY GOALS?'
query_normalized = normalize_query(query)
results_norm, distances_norm = search_in_vector_store(query_normalized)

print("\nResults WITHOUT normalization:")
for i, (res, dist) in enumerate(zip(results_raw, distances_raw)):
    print(f"\nResult {i+1} (Distance: {dist:.4f}):\n{res}")

print("\nResults WITH normalization:")
for i, (res, dist) in enumerate(zip(results_norm, distances_norm)):
    print(f"\nResult {i+1} (Distance: {dist:.4f}):\n{res}")

print("\nComparison of results:")
if results_raw == results_norm:
    print("Normalization does not change the results.")
else:
    print("Normalization has modified the FAISS results!")


Results WITHOUT normalization:

Result 1 (Distance: 0.8325):
positive outcomes, as well as the actions it will take to 
meet its 2026 and 2030 targets, at the event, which 
was notably attended by Christophe Béchu (France’s 
Minister of Sustainability and Regional Cohesion) and 
Virginijus Sinkevičius (European Commissioner for the 
Environment, Oceans and Fisheries).
Protecting water resources 
and biodiversity
In 2023, the Group unveiled the first part of its plan to 
protect water resources, which are essential for its 
Wines & Spirits and its Perfumes & Cosmetics business 
groups and also critical for its fashion and leather goods 
items. The goal is a 30% reduction in the amount of 
water used by LVMH’s operations and its value chain 
by 2030, especially in regions experiencing water 
stress. In 2023, LVMH ramped up its program of bio-
diversity initiatives, launching regenerative agriculture 
projects in Turkey and Chad for cotton, in Australia for 
merino wool, in Indonesia for

In [None]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

text_1 = "Amazon optimizes its logistics with AI."
text_2 = "amazon optimizes its logistics with ai"

vec1 = np.array(embeddings.embed_query(text_1))
vec2 = np.array(embeddings.embed_query(text_2))

# Compute cosine similarity between both versions
similarity = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
print(f"Cosine similarity between the two versions: {similarity:.4f}")


  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


Cosine similarity between the two versions: 0.9832


In [30]:
EMBEDDING_MODELS = [
    "sentence-transformers/all-MiniLM-L6-v2",
    "sentence-transformers/all-distilroberta-v1",
]

# List of chunk sizes and overlaps to test
CHUNK_SIZES = [200, 500, 1000]
CHUNK_OVERLAPS = [50, 100]

QUERY = "What is the plan to protect water resources?"
GROUND_TRUTH = "The goal is a 30% reduction in the amount of water used by LVMH’s operations and its value chain by 2030"

def normalize_text(text):
    """Cleans text by removing extra spaces, line breaks, and converting to lowercase."""
    return " ".join(text.lower().strip().split())

def evaluate_embedding_quality(model_name, chunk_size, chunk_overlap, text):
    """Tests an embedding model with given chunk size and overlap, evaluates FAISS retrieval."""
    print(f"\nTesting {model_name} | Chunk Size: {chunk_size} | Chunk Overlap: {chunk_overlap}")

    embeddings = HuggingFaceEmbeddings(model_name=model_name)

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    chunks = text_splitter.split_text(text)

    vectors = [embeddings.embed_query(chunk) for chunk in chunks]
    vectors = np.array(vectors, dtype="float32")

    dimension = vectors.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(vectors)

    query_vector = np.array([embeddings.embed_query(QUERY)], dtype="float32")
    D, I = index.search(query_vector, k=5)  # Retrieve up to 5 results

    found_rank = -1
    best_distance = float('inf')

    print(f"\nResults for Query: {QUERY}")
    for rank, (idx, dist) in enumerate(zip(I[0], D[0])):
        chunk_text = chunks[idx]
        normalized_chunk = normalize_query(chunk_text)
        normalized_ground_truth = normalize_query(GROUND_TRUTH)

        print(f"📊 Rank {rank + 1} | Distance: {dist:.4f} | Chunk: {chunk_text}...")

        if normalized_ground_truth in normalized_chunk:
            found_rank = rank + 1
            best_distance = dist
            break  

    if found_rank != -1:
        print(f"\nGround truth found at rank {found_rank} with a distance of {best_distance:.4f}")
    else:
        print("\nGround truth not found in the top 5 results.")

    return found_rank, best_distance

pdf_path = "data/test.pdf"
text = extract_text_from_pdf(pdf_path)

# Test all hyperparameter combinations
results = []
for model in EMBEDDING_MODELS:
    for chunk_size in CHUNK_SIZES:
        for chunk_overlap in CHUNK_OVERLAPS:
            found_rank, best_distance = evaluate_embedding_quality(model, chunk_size, chunk_overlap, text)
            results.append((model, chunk_size, chunk_overlap, found_rank, best_distance))

print("\nSUMMARY OF TEST RESULTS:")
for model, chunk_size, chunk_overlap, rank, dist in results:
    if rank != -1:
        print(f"Model: {model} | Chunk Size: {chunk_size} | Overlap: {chunk_overlap} | Found at Rank: {rank} | Distance: {dist:.4f}")
    else:
        print(f"Model: {model} | Chunk Size: {chunk_size} | Overlap: {chunk_overlap} | Ground truth NOT FOUND in the top 5")


Testing sentence-transformers/all-MiniLM-L6-v2 | Chunk Size: 200 | Chunk Overlap: 50

Results for Query: What is the plan to protect water resources?
📊 Rank 1 | Distance: 0.4713 | Chunk: Protecting water resources 
and biodiversity
In 2023, the Group unveiled the first part of its plan to 
protect water resources, which are essential for its...
📊 Rank 2 | Distance: 0.8842 | Chunk: with environmental concerns, reporting on our ­policies 
and projects and the progress achieved in meeting our 
objectives. Contributing to environmental protection...
📊 Rank 3 | Distance: 0.9465 | Chunk: innovative and ambitious environmental practices 
implemented covering water consumption, efficient 
use of air conditioning, clean energy use, and design 
and construction practices....
📊 Rank 4 | Distance: 0.9679 | Chunk: Minister of Sustainability and Regional Cohesion) and 
Virginijus Sinkevičius (European Commissioner for the 
Environment, Oceans and Fisheries).
Protecting water resources 
and biodiver