# Academic PDF Semantic Search with Chroma

Build a semantic search system for academic PDFs using:
- **LangChain** for document processing
- **OpenAI embeddings** for vectors
- **Chroma** for in-memory vector storage
- **Q&A system** for natural language queries

Load PDFs from arXiv, URLs, or local files → Split into chunks → Generate embeddings → Search and ask questions!

---

## Install Packages

Install required packages for PDF processing and semantic search:

In [None]:
!pip install langchain langchain-core langchain-openai langchain-community langchain-text-splitters
!pip install openai chromadb pypdf pymupdf arxiv tiktoken python-dotenv requests tqdm

## Setup API Key

**Get your OpenAI API key**: [platform.openai.com/api-keys](https://platform.openai.com/api-keys)

For Colab: Use Secrets (🔑 icon) | For Local: Create `.env` file

In [None]:
import os
import warnings
warnings.filterwarnings('ignore')

# Detect environment
try:
    import google.colab
    IN_COLAB = True
    print("🔗 Running in Google Colab")
except ImportError:
    IN_COLAB = False
    print("💻 Running locally")

# Setup API key based on environment
if IN_COLAB:
    # Google Colab: Try to get from secrets, fallback to manual input
    try:
        from google.colab import userdata
        openai_api_key = userdata.get('OPENAI_API_KEY')
        if openai_api_key:
            os.environ["OPENAI_API_KEY"] = openai_api_key
            print("✅ API key loaded from Colab secrets!")
        else:
            print("⚠️ No OPENAI_API_KEY found in Colab secrets")
            from getpass import getpass
            os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
            print("✅ API key set manually!")
    except Exception as e:
        print(f"⚠️ Could not access Colab secrets: {e}")
        from getpass import getpass
        os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
        print("✅ API key set manually!")
else:
    # Local environment: Try .env file, then environment variables
    try:
        from dotenv import load_dotenv
        load_dotenv()
        if os.getenv("OPENAI_API_KEY"):
            print("✅ API key loaded from .env file")
        else:
            print("⚠️ No API key found in .env file")
            # Check environment variables
            if os.environ.get("OPENAI_API_KEY"):
                print("✅ API key found in environment variables")
            else:
                print("❌ No API key found. Please set OPENAI_API_KEY")
    except ImportError:
        # No python-dotenv, check environment variables
        if os.environ.get("OPENAI_API_KEY"):
            print("✅ API key found in environment variables")
        else:
            print("❌ No API key found. Please set OPENAI_API_KEY environment variable")

# Final check
if os.environ.get("OPENAI_API_KEY") and os.environ.get("OPENAI_API_KEY") != "your-api-key-here":
    print("🚀 Ready to go!")
else:
    print("⚠️ Please set your OpenAI API key before continuing!")
    print("   • Colab: Add OPENAI_API_KEY to secrets (🔑 icon)")
    print("   • Local: Set environment variable or create .env file")

## Import Libraries

In [None]:
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.document_loaders import PyPDFLoader, ArxivLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.schema import Document
from langchain_community.vectorstores import Chroma

import time, requests, json
from typing import List, Dict
from tqdm import tqdm

print("✅ Libraries imported!")

## Configuration

In [None]:
# Setup
EMBEDDING_MODEL = "text-embedding-3-small"
CHAT_MODEL = "gpt-4o-mini"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200

# Initialize components
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)
chat_model = ChatOpenAI(model=CHAT_MODEL, temperature=0.1)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP
)

vectorstore = None  # Will be created when adding documents

print("✅ Configuration ready!")

## Initialize Chroma

In [None]:
print("✅ Chroma ready! (in-memory vector database)")
print("  - No external setup needed")
print("  - Will be created when documents are added")

## Document Loaders

In [None]:
def load_arxiv_papers(queries: List[str], max_docs: int = 3) -> List[Document]:
    """Load papers from arXiv."""
    documents = []
    for query in queries:
        print(f"🔍 Searching arXiv: {query}")
        try:
            loader = ArxivLoader(query=query, load_max_docs=max_docs)
            docs = loader.load()
            for doc in docs:
                doc.metadata['search_query'] = query
                doc.metadata['source'] = 'arxiv'
            documents.extend(docs)
            print(f"  ✅ Found {len(docs)} papers")
        except Exception as e:
            print(f"  ❌ Error: {e}")
    return documents

def load_pdf_from_url(url: str, title: str = None) -> List[Document]:
    """Load PDF from URL."""
    try:
        print(f"📥 Loading PDF: {url}")
        response = requests.get(url)
        response.raise_for_status()

        temp_path = "/tmp/temp_paper.pdf"
        with open(temp_path, "wb") as f:
            f.write(response.content)

        loader = PyPDFLoader(temp_path)
        documents = loader.load()

        for doc in documents:
            doc.metadata.update({'source': 'url', 'url': url})
            if title:
                doc.metadata['title'] = title

        print(f"  ✅ Loaded {len(documents)} pages")
        return documents
    except Exception as e:
        print(f"  ❌ Error: {e}")
        return []

def load_local_pdf(file_path: str) -> List[Document]:
    """Load PDF from local file."""
    try:
        print(f"📄 Loading: {file_path}")
        loader = PyPDFLoader(file_path)
        documents = loader.load()

        for doc in documents:
            doc.metadata.update({'source': 'local', 'file_path': file_path})

        print(f"  ✅ Loaded {len(documents)} pages")
        return documents
    except Exception as e:
        print(f"  ❌ Error: {e}")
        return []

print("✅ Document loaders ready!")

## Document Processing

In [None]:
def process_and_store_documents(documents: List[Document]) -> int:
    """Process documents and store in Chroma."""
    global vectorstore

    if not documents:
        return 0

    print(f"📝 Processing {len(documents)} documents...")

    # Split into chunks
    all_chunks = []
    for doc in documents:
        chunks = text_splitter.split_documents([doc])
        all_chunks.extend(chunks)

    print(f"✂️ Created {len(all_chunks)} chunks")

    try:
        if vectorstore is None:
            vectorstore = Chroma.from_documents(documents=all_chunks, embedding=embeddings)
        else:
            vectorstore.add_documents(all_chunks)

        print(f"✅ Stored {len(all_chunks)} chunks")
        return len(all_chunks)
    except Exception as e:
        print(f"❌ Error: {e}")
        return 0

def search_similar_documents(query: str, k: int = 5) -> List[Document]:
    """Search for similar documents."""
    if vectorstore is None:
        return []
    return vectorstore.similarity_search(query, k=k)

print("✅ Processing functions ready!")

In [None]:
def get_vectorstore():
    """Get the current vectorstore."""
    global vectorstore
    if vectorstore is None:
        raise ValueError("No vectorstore available. Process documents first.")
    return vectorstore

def get_vectorstore_stats() -> Dict:
    """Get statistics about the current vectorstore."""
    global vectorstore
    if vectorstore is None:
        return {"status": "empty", "total_documents": 0}

    try:
        # Get collection info
        collection = vectorstore._collection
        count = collection.count()

        # Try to get embedding dimension
        embedding_dim = None
        if count > 0:
            sample = collection.peek(limit=1)
            if sample and 'embeddings' in sample and sample['embeddings']:
                embedding_dim = len(sample['embeddings'][0])

        return {
            "status": "ready",
            "total_documents": count,
            "embedding_dimension": embedding_dim
        }
    except Exception as e:
        return {"status": "error", "error": str(e), "total_documents": 0}

print("✅ Utility functions ready!")

## Demo: Load Sample Papers

Load sample papers from arXiv for testing:

In [None]:
# Define search queries for academic papers
search_queries = [
    "transformer neural networks attention mechanism",
    "large language models GPT BERT",
    "computer vision deep learning CNN"
]

# Load papers from arXiv
print("🚀 Loading academic papers from arXiv...\n")
sample_documents = load_arxiv_papers(search_queries, max_docs=2)

print(f"\n📚 Total documents loaded: {len(sample_documents)}")

# Display information about loaded papers
if sample_documents:
    print("\n📋 Loaded Papers:")
    print("=" * 50)

    for i, doc in enumerate(sample_documents, 1):
        title = doc.metadata.get('Title', 'Unknown Title')
        authors = doc.metadata.get('Authors', 'Unknown Authors')
        published = doc.metadata.get('Published', 'Unknown Date')

        print(f"{i}. {title}")
        print(f"   Authors: {authors}")
        print(f"   Published: {published}")
        print(f"   Content length: {len(doc.page_content)} characters")
        print()
else:
    print("⚠️ No documents were loaded. This might be due to API limits or network issues.")
    print("You can try again later or add your own PDF files using the functions above.")

## Demo: Process and Store

Process documents and store in Chroma:

In [None]:
if sample_documents:
    # Process and store documents
    stored_count = process_and_store_documents(sample_documents)

    # Get vectorstore stats
    stats = get_vectorstore_stats()

    print(f"\n📊 Vectorstore Statistics:")
    print(f"  - Status: {stats['status']}")
    print(f"  - Documents stored this session: {stored_count}")
    if stats['status'] == 'ready':
        print(f"  - Storage: In-memory Chroma database")
        print(f"  - Embedding dimension: {stats['embedding_dimension']}")

else:
    print("⚠️ No documents to process. Please load some documents first.")
    stored_count = 0

## Demo: Semantic Search

Test semantic search functionality:

In [None]:
# Test semantic search
if stored_count > 0:
    print("🔍 Testing Semantic Search\n")

    # Example search queries
    test_queries = [
        "What is attention mechanism in neural networks?",
        "How do transformers work?",
        "Computer vision applications",
        "Deep learning architectures"
    ]

    for query in test_queries:
        print(f"Query: '{query}'")
        print("-" * 50)

        try:
            # Search for similar documents
            results = search_similar_documents(query, k=3)

            if results:
                for i, result in enumerate(results, 1):
                    title = result.metadata.get('Title', 'Unknown Title')
                    content_preview = result.page_content[:200] + "..."

                    print(f"{i}. From: {title}")
                    print(f"   Content: {content_preview}")
                    print()
            else:
                print("   No results found")

        except Exception as e:
            print(f"   ❌ Error: {e}")

        print("=" * 60)
        print()

else:
    print("⚠️ No documents in the index. Please load and process documents first.")

## Q&A System

Build question-answering system:

In [None]:
def create_qa_system() -> RetrievalQA:
    """Create Q&A system using the vector database."""
    vectorstore = get_vectorstore()

    qa_chain = RetrievalQA.from_chain_type(
        llm=chat_model,
        chain_type="stuff",
        retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
        return_source_documents=True,
        verbose=False
    )
    return qa_chain

def ask_question(qa_chain: RetrievalQA, question: str) -> Dict:
    """Ask a question and get answer with sources."""
    try:
        result = qa_chain({"query": question})
        return {
            "answer": result["result"],
            "sources": result["source_documents"]
        }
    except Exception as e:
        return {"answer": f"Error: {e}", "sources": []}

# Create Q&A system
if stored_count > 0:
    print("🤖 Creating Q&A system...")
    qa_system = create_qa_system()
    print("✅ Q&A system ready!")
else:
    print("⚠️ Load documents first.")
    qa_system = None

## Demo: Q&A

Test the question-answering system:

In [None]:
if qa_system:
    print("🤔 Testing Question Answering System\n")

    # Example questions
    questions = [
        "What is the attention mechanism and how does it work?",
        "What are the main advantages of transformer architectures?",
        "How do large language models like GPT work?",
        "What are the key components of computer vision systems?",
        "What are some applications of deep learning in different domains?"
    ]

    for i, question in enumerate(questions, 1):
        print(f"Question {i}: {question}")
        print("=" * 80)

        # Get answer
        result = ask_question(qa_system, question)

        print(f"Answer: {result['answer']}")

        # Show sources
        if result['sources']:
            print(f"\nSources ({len(result['sources'])} documents):")
            for j, source in enumerate(result['sources'][:3], 1):  # Show top 3 sources
                title = source.metadata.get('Title', 'Unknown Title')
                print(f"  {j}. {title}")

        print("\n" + "=" * 80 + "\n")

else:
    print("⚠️ Q&A system not available. Please load documents first.")

## Ask Your Questions

Interactive Q&A with your papers:

In [None]:
if qa_system:
    # Ask your own question
    your_question = input("Enter your question about the academic papers: ")

    if your_question.strip():
        print(f"\n🤔 Question: {your_question}")
        print("=" * 60)

        result = ask_question(qa_system, your_question)

        print(f"🤖 Answer: {result['answer']}")

        if result['sources']:
            print(f"\n📚 Based on {len(result['sources'])} source documents:")
            for i, source in enumerate(result['sources'][:3], 1):
                title = source.metadata.get('Title', 'Unknown Title')
                print(f"  {i}. {title}")
    else:
        print("No question entered.")

else:
    print("⚠️ Q&A system not available. Please load documents first.")
    print("\nTo use this system:")
    print("1. Make sure you have valid API keys")
    print("2. Run the document loading cells")
    print("3. Process and store the documents")
    print("4. Then try asking questions!")

## Advanced Functions

Additional utility functions:

In [None]:
def add_custom_pdf(pdf_path_or_url: str, title: str = None) -> int:
    """
    Add a custom PDF to the knowledge base.

    Args:
        pdf_path_or_url: Local path or URL to PDF
        title: Optional title for the document

    Returns:
        Number of chunks stored
    """
    if pdf_path_or_url.startswith('http'):
        documents = load_pdf_from_url(pdf_path_or_url, title)
    else:
        documents = load_local_pdf(pdf_path_or_url)
        if title and documents:
            for doc in documents:
                doc.metadata['title'] = title

    if documents:
        return process_and_store_documents(documents)
    return 0

def search_papers_by_topic(topic: str, max_papers: int = 5) -> int:
    """
    Search and add papers from arXiv on a specific topic.

    Args:
        topic: Research topic to search for
        max_papers: Maximum number of papers to add

    Returns:
        Number of chunks stored
    """
    documents = load_arxiv_papers([topic], max_docs=max_papers)

    if documents:
        return process_and_store_documents(documents)
    return 0

def get_current_statistics() -> Dict:
    """
    Get current statistics about the Chroma vectorstore.

    Returns:
        Dictionary with vectorstore statistics
    """
    return get_vectorstore_stats()

def clear_vectorstore():
    """
    Clear all documents from the Chroma vectorstore.
    ⚠️ WARNING: This will delete all stored documents!
    """
    global vectorstore

    confirmation = input("Are you sure you want to clear ALL documents? Type 'yes' to confirm: ")
    if confirmation.lower() == 'yes':
        vectorstore = None
        print("✅ Vectorstore cleared successfully!")
    else:
        print("❌ Operation cancelled.")

print("✅ Advanced functions defined!")
print("\nAvailable functions:")
print("- add_custom_pdf(pdf_path_or_url, title): Add your own PDF")
print("- search_papers_by_topic(topic, max_papers): Search arXiv for specific topics")
print("- get_current_statistics(): View current vectorstore stats")
print("- clear_vectorstore(): Clear all documents (use with caution!)")

## Usage Examples

Try these advanced features:

In [None]:
# Example 1: Add papers on a specific topic
# Uncomment and modify the topic as needed
# topic = "quantum computing machine learning"
# stored = search_papers_by_topic(topic, max_papers=2)
# print(f"Added {stored} chunks from papers on '{topic}'")

# Example 2: Add a PDF from URL
# Uncomment and replace with an actual PDF URL
# pdf_url = "https://example.com/paper.pdf"
# stored = add_custom_pdf(pdf_url, "Custom Paper Title")
# print(f"Added {stored} chunks from custom PDF")

# Example 3: Check current vectorstore statistics
if stored_count > 0:
    stats = get_current_statistics()
    print("📊 Current Vectorstore Statistics:")
    print(f"  - Status: {stats['status']}")
    print(f"  - Total documents: {stats['total_documents']}")
    if 'embedding_dimension' in stats:
        print(f"  - Embedding dimension: {stats['embedding_dimension']}")
else:
    print("Vectorstore is empty. Load some documents first!")

# Example 4: Interactive topic search
# Uncomment to allow user input
# topic = input("Enter a research topic to search for papers: ")
# if topic.strip():
#     stored = search_papers_by_topic(topic, max_papers=3)
#     print(f"Added {stored} chunks from papers on '{topic}'")

## Summary

✅ **Complete PDF semantic search system ready!**

**Features**: Load PDFs → Chunk & embed → Search & Q&A  
**Stack**: LangChain + OpenAI + Chroma (in-memory)  
**Sources**: arXiv, URLs, local files

**Next**: Add more papers, customize settings, build apps!

📚 [LangChain Docs](https://docs.langchain.com/) | 🤖 [OpenAI API](https://platform.openai.com/docs) | 🔍 [Chroma Docs](https://docs.trychroma.com/)