# üöÄ Semantic Intelligence: Building a PDF Vector Brain
### **Powered by Mohammad Sefidgar**

Welcome to a masterclass in modern AI retrieval. This notebook transforms static PDF documents into a high-performance, semantically aware vector database. We utilize the power of **Hugging Face**, **LangChain**, and **FAISS** to create a system that doesn't just look for keywords‚Äîit understands meaning.

![Diagram](https://github.com/mhsefidgar/AI-Engineering-Pro/blob/main/Practical%20RAG/Semantic%20Search%20AMAZON%20Titan%20Embedding%20FAISS/data/build_pdf_vector_db.jpg?raw=1)

## üõ†Ô∏è The Power Stack
To run this engine, you'll need the following tools in your environment:

* **Hugging Face**: Local embeddings for semantic understanding without API keys.
* **LangChain**: The orchestrator for our LLM and vector workflows.
* **FAISS**: Facebook AI Similarity Search, our high-speed vector engine.
* **PyPDF**: To unlock and read PDF data.
* **SemanticChunker**: Part of LangChain Experimental for meaning-based splitting.

In [None]:
# Install local processing requirements
!pip install -qU langchain-huggingface sentence-transformers
!pip install -qU langchain-community pypdf faiss-cpu
!pip install -qU langchain-experimental

In [None]:
import os
import numpy as np
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker

# Powering our intelligence with Local Hugging Face Embeddings
# 'all-MiniLM-L6-v2' is fast, accurate, and runs locally without any credentials.
embeddings_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

print("‚úÖ Local Hugging Face Embeddings initialized.")

## üìÇ 2. Interactive PDF Upload
Instead of using hardcoded paths, this section allows you to process any custom document on the fly. Simply provide the path to your PDF file.

In [None]:
from google.colab import files
import os

uploaded = files.upload()

if not uploaded:
    print("‚ùå No file uploaded.")
else:
    custom_pdf_path = list(uploaded.keys())[0]
    if os.path.exists(custom_pdf_path):
        print(f"üìñ Successfully uploaded and located: {custom_pdf_path}")

## üß† 3. Advanced Document Processing
We can process the document using two methods: Traditional Recursive splitting or Advanced Semantic splitting.

In [None]:
def process_document(file_path, method="semantic"):
    loader = PyPDFLoader(file_path)
    
    if method == "recursive":
        # Traditional high-speed splitting
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000, 
            chunk_overlap=100
        )
        docs = loader.load_and_split(splitter)
    else:
        # Advanced meaning-based splitting using local embeddings
        splitter = SemanticChunker(embeddings_model, breakpoint_threshold_amount=80)
        raw_docs = loader.load()
        docs = splitter.split_documents(raw_docs)
        
    # Clean up empty fragments
    clean_docs = [doc for doc in docs if len(doc.page_content) > 0]
    return clean_docs

# Applying Mohammad Sefidgar's semantic logic
processed_docs = process_document(custom_pdf_path, method="semantic")
print(f"‚ú® Created {len(processed_docs)} semantically coherent chunks.")

## üèóÔ∏è 4. Vector Database Construction
Injecting our semantically processed documents into FAISS for lightning-fast retrieval.

In [None]:
vector_db = FAISS.from_documents(processed_docs, embeddings_model)
print(f"üèóÔ∏è Vector database created with {vector_db.index.ntotal} vectors.")

## üîç 5. Precision Semantic Retrieval
Testing the brain's ability to find relevant information based on meaning.

In [None]:
query = "What are the key findings or main topics in this document?"
results = vector_db.similarity_search(query, k=3)

print(f"\nüîç Query: {query}\n")
for i, res in enumerate(results):
    print(f"[Result {i+1}]: {res.page_content[:200]}... [{res.metadata}]\n")

## üíæ 6. Local Persistence & Management
Save the vector store locally to avoid re-processing in future sessions.

In [None]:
db_folder = "custom_pdf_index"
vector_db.save_local(db_folder)
print(f"üíæ Vector index successfully saved to {db_folder}")

# Loading the database back
new_db = FAISS.load_local(db_folder, embeddings_model, allow_dangerous_deserialization=True)

print(f"Loaded database contains {new_db.index.ntotal} records.")