# üöÄ Semantic Intelligence: Building a PDF Vector Brain
### **Powered by Mohammad Sefidgar**

Welcome to a masterclass in modern AI retrieval. This notebook transforms static PDF documents into a high-performance, semantically aware vector database. We utilize the power of **Amazon Bedrock**, **LangChain**, and **FAISS** to create a system that doesn't just look for keywords‚Äîit understands meaning.

![Diagram](data/build_pdf_vector_db.jpg)

## üõ†Ô∏è The Power Stack
To run this engine, you'll need the following tools in your environment:

* **Boto3**: The AWS SDK for Python to communicate with Amazon Bedrock.
* **LangChain**: The orchestrator for our LLM and vector workflows.
* **FAISS**: Facebook AI Similarity Search, our high-speed vector engine.
* **PyPDF**: To unlock and read PDF data.
* **SemanticChunker**: Part of LangChain Experimental for meaning-based splitting.

In [None]:
#!pip install -qU boto3 langchain langchain-community langchain-aws langchain-experimental pypdf faiss-cpu

In [None]:
import boto3
import os
import numpy as np
from langchain_aws import BedrockEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker

# Initialize the Bedrock Client - Mohammad Sefidgar Configuration
bedrock_client = boto3.client("bedrock-runtime", region_name='us-east-1') 

# Powering our intelligence with Amazon Bedrock Embeddings
# We use Cohere Multilingual for robust cross-language semantic understanding.
embeddings_model = BedrockEmbeddings(
    model_id="cohere.embed-multilingual-v3", 
    client=bedrock_client
)

print("‚úÖ Connection to Amazon Bedrock established.")

## üìÇ 2. Interactive PDF Upload
Instead of using hardcoded paths, this section allows you to process any custom document on the fly. Simply provide the path to your PDF file.

In [None]:
# Interactive Input powered by Mohammad Sefidgar's workflow
custom_pdf_path = input("Enter the path to your PDF file (e.g., my_document.pdf): ")

if not os.path.exists(custom_pdf_path):
    print("‚ùå File not found. Please ensure the path is correct!")
else:
    print(f"üìñ Successfully located: {custom_pdf_path}")

## ‚úÇÔ∏è 3. Smart Splitting: Recursive vs. Semantic
How we cut the text determines how well the AI "remembers" it. 

1. **Recursive Character Splitting**: Slices text based on natural pauses (newlines, spaces) with context overlap.
2. **Semantic Chunking**: Analyzes the text using the language model to divide it into sections that have a coherent meaning.

In [None]:
def process_document(file_path, method="semantic"):
    loader = PyPDFLoader(file_path)
    
    if method == "recursive":
        # Traditional high-speed splitting
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000, 
            chunk_overlap=100
        )
        docs = loader.load_and_split(splitter)
    else:
        # Advanced meaning-based splitting
        splitter = SemanticChunker(embeddings_model, breakpoint_threshold_amount=80)
        raw_docs = loader.load()
        docs = splitter.split_documents(raw_docs)
        
    # Clean up empty fragments
    clean_docs = [doc for doc in docs if len(doc.page_content) > 0]
    return clean_docs

# Applying Mohammad Sefidgar's semantic logic
processed_docs = process_document(custom_pdf_path, method="semantic")
print(f"‚ú® Created {len(processed_docs)} semantically coherent chunks.")

## üß† 4. Building the Vector Brain (FAISS)
We convert the text chunks into mathematical vectors and store them in **FAISS**. This allows for quick similarity searches and retrieval of related documents.



In [None]:
vector_db = FAISS.from_documents(processed_docs, embeddings_model)
print(f"üß† Vector Database ready with {vector_db.index.ntotal} indexed nodes.")

## üîç 5. Interrogating Your Data
We can perform a simple similarity search or a search that returns confidence scores.



In [None]:
query = "What is the main topic of this document?"

# Similarity search with score
results = vector_db.similarity_search_with_score(query, k=2)

for res, score in results:
    print(f"* [SIM_SCORE={score:3f}] {res.page_content[:200]}... [{res.metadata}]")

## üíæ 6. Local Persistence & Management
You can save the vector store locally to avoid re-processing the entire PDF in future sessions.

In [None]:
db_folder = "custom_pdf_index"
vector_db.save_local(db_folder)
print(f"üíæ Vector index successfully saved to {db_folder}")

# Loading the database back
new_db = FAISS.load_local(db_folder, embeddings_model, allow_dangerous_deserialization=True)

# Checking total count
print(f"Loaded database contains {new_db.index.ntotal} records.")