# üöÄ Semantic Intelligence: Building a PDF Vector Brain
### **Powered by Mohammad Sefidgar**

Welcome to a masterclass in modern AI retrieval. This notebook transforms static PDF documents into a high-performance, semantically aware vector database. We utilize the power of **Hugging Face**, **LangChain**, and **FAISS** to create a system that doesn't just look for keywords‚Äîit understands meaning.

![Diagram](https://github.com/mhsefidgar/AI-Engineering-Pro/blob/main/Practical%20RAG/Semantic%20Search%20AMAZON%20Titan%20Embedding%20FAISS/data/build_pdf_vector_db.jpg?raw=1)

## üõ†Ô∏è The Power Stack
To run this engine, you'll need the following tools in your environment:

* **Hugging Face**: Local embeddings for semantic understanding without API keys.
* **LangChain**: The orchestrator for our LLM and vector workflows.
* **FAISS**: Facebook AI Similarity Search, our high-speed vector engine.
* **PyPDF**: To unlock and read PDF data.
* **SemanticChunker**: Part of LangChain Experimental for meaning-based splitting.

In [1]:
# Install local processing requirements
!pip install -qU langchain-huggingface sentence-transformers
!pip install -qU langchain-community pypdf faiss-cpu
!pip install -qU langchain-experimental

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[90m‚ï∫[0m[90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.5/2.5 MB[0m [31m15.6 MB/s[0m eta [36m0:00:01[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m [32m2.5/2.5 MB[0m [31m41.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2.5/2.5 MB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m329.6/329.6 kB[0m [31m15.6 MB/s[0m eta [36m0:

In [3]:
!pip install semantic-chunker-langchain

Collecting semantic-chunker-langchain
  Downloading semantic_chunker_langchain-0.1.4-py3-none-any.whl.metadata (3.1 kB)
Collecting langchain<0.4.0,>=0.3.25 (from semantic-chunker-langchain)
  Downloading langchain-0.3.27-py3-none-any.whl.metadata (7.8 kB)
Collecting langchain-community<0.4.0,>=0.3.26 (from semantic-chunker-langchain)
  Downloading langchain_community-0.3.31-py3-none-any.whl.metadata (3.0 kB)
Collecting openai<2.0.0,>=1.84.0 (from semantic-chunker-langchain)
  Downloading openai-1.109.1-py3-none-any.whl.metadata (29 kB)
Collecting pdfplumber<0.12.0,>=0.11.6 (from semantic-chunker-langchain)
  Downloading pdfplumber-0.11.9-py3-none-any.whl.metadata (43 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m43.6/43.6 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken<0.10.0,>=0.9.0 (from semantic-chunker-langchain)
  Downloading tiktoken-0.9.0-cp312-cp312-many

In [4]:
import os
import numpy as np
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker

# Powering our intelligence with Local Hugging Face Embeddings
# 'all-MiniLM-L6-v2' is fast, accurate, and runs locally without any credentials.
embeddings_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

print("‚úÖ Local Hugging Face Embeddings initialized.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

‚úÖ Local Hugging Face Embeddings initialized.


## üìÇ 2. Interactive PDF Upload
Instead of using hardcoded paths, this section allows you to process any custom document on the fly. Simply provide the path to your PDF file.

In [5]:
from google.colab import files
import os

uploaded = files.upload()

if not uploaded:
    print("‚ùå No file uploaded.")
else:
    custom_pdf_path = list(uploaded.keys())[0]
    if os.path.exists(custom_pdf_path):
        print(f"üìñ Successfully uploaded and located: {custom_pdf_path}")

Saving latestv1 Mohammad Sefidgar DataScience.pdf to latestv1 Mohammad Sefidgar DataScience.pdf
üìñ Successfully uploaded and located: latestv1 Mohammad Sefidgar DataScience.pdf


## üß† 3. Advanced Document Processing
We can process the document using two methods: Traditional Recursive splitting or Advanced Semantic splitting.

In [6]:
def process_document(file_path, method="semantic"):
    loader = PyPDFLoader(file_path)

    if method == "recursive":
        # Traditional high-speed splitting
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=100
        )
        docs = loader.load_and_split(splitter)
    else:
        # Advanced meaning-based splitting using local embeddings
        splitter = SemanticChunker(embeddings_model, breakpoint_threshold_amount=80)
        raw_docs = loader.load()
        docs = splitter.split_documents(raw_docs)

    # Clean up empty fragments
    clean_docs = [doc for doc in docs if len(doc.page_content) > 0]
    return clean_docs

# Applying Mohammad Sefidgar's semantic logic
processed_docs = process_document(custom_pdf_path, method="semantic")
print(f"‚ú® Created {len(processed_docs)} semantically coherent chunks.")

‚ú® Created 10 semantically coherent chunks.


## üèóÔ∏è 4. Vector Database Construction
Injecting our semantically processed documents into FAISS for lightning-fast retrieval.

In [7]:
vector_db = FAISS.from_documents(processed_docs, embeddings_model)
print(f"üèóÔ∏è Vector database created with {vector_db.index.ntotal} vectors.")

üèóÔ∏è Vector database created with 10 vectors.


## üîç 5. Precision Semantic Retrieval
Testing the brain's ability to find relevant information based on meaning.

In [8]:
query = "What are the key findings or main topics in this document?"
results = vector_db.similarity_search(query, k=3)

print(f"\nüîç Query: {query}\n")
for i, res in enumerate(results):
    print(f"[Result {i+1}]: {res.page_content[:200]}... [{res.metadata}]\n")


üîç Query: What are the key findings or main topics in this document?

[Result 1]: ‚Ä¢ Making reports and documentation for the developed system.... [{'producer': 'Microsoft¬Æ Word for Office 365', 'creator': 'Microsoft¬Æ Word for Office 365', 'creationdate': '2026-01-06T10:04:30-05:00', 'title': 'Microsoft Word - mh-cv-computervision.docx', 'author': 'Mohammad Sefidgar', 'moddate': '2026-01-06T10:04:30-05:00', 'source': 'latestv1 Mohammad Sefidgar DataScience.pdf', 'total_pages': 3, 'page': 2, 'page_label': '3'}]

[Result 2]: CGPA.... [{'producer': 'Microsoft¬Æ Word for Office 365', 'creator': 'Microsoft¬Æ Word for Office 365', 'creationdate': '2026-01-06T10:04:30-05:00', 'title': 'Microsoft Word - mh-cv-computervision.docx', 'author': 'Mohammad Sefidgar', 'moddate': '2026-01-06T10:04:30-05:00', 'source': 'latestv1 Mohammad Sefidgar DataScience.pdf', 'total_pages': 3, 'page': 1, 'page_label': '2'}]

[Result 3]: Research and Development Engineer | Mehr-Sanat, Iran  
‚Ä¢ Research, des

## üíæ 6. Local Persistence & Management
Save the vector store locally to avoid re-processing in future sessions.

In [9]:
db_folder = "custom_pdf_index"
vector_db.save_local(db_folder)
print(f"üíæ Vector index successfully saved to {db_folder}")

# Loading the database back
new_db = FAISS.load_local(db_folder, embeddings_model, allow_dangerous_deserialization=True)

print(f"Loaded database contains {new_db.index.ntotal} records.")

üíæ Vector index successfully saved to custom_pdf_index
Loaded database contains 10 records.
