# Chat with PDFs Application

This notebook allows you to upload **ANY type of PDF** and ask questions about its content using AI models.

**Works with all PDF types:**
- 📄 Utility bills, invoices, receipts
- 📋 Contracts, legal documents
- 📊 Reports, presentations, spreadsheets
- 📚 Books, research papers, articles
- 🏥 Medical records, prescriptions
- 📝 Forms, applications, certificates
- 💼 Business documents, memos
- And literally any other PDF!

**Features:**
- Upload and process multiple PDF files of any type
- Extract and analyze text from any domain
- Ask questions in natural language
- Get instant answers with source citations
- Works with GPT-3.5 or open-source models


## Step 1: Install Required Packages


In [None]:
%pip install -q langchain langchain-community langchain-openai
%pip install -q pypdf chromadb
%pip install -q sentence-transformers
%pip install -q openai tiktoken
%pip install -q transformers accelerate bitsandbytes
%pip install -q faiss-cpu


## Step 2: Import Libraries


In [None]:
import os
import warnings
warnings.filterwarnings('ignore')

from google.colab import files
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain_community.llms import HuggingFaceHub

print("✅ All libraries imported successfully!")


## Step 3: Configuration

Choose which model you want to use:
- **Option 1:** OpenAI GPT-3.5-turbo (requires API key)
- **Option 2:** Open-source models from HuggingFace (free, but may be slower)


In [None]:
# Configuration
USE_OPENAI = False  # Set to True if you want to use GPT-3.5, False for open-source

if USE_OPENAI:
    # If using OpenAI, enter your API key
    OPENAI_API_KEY = input("Enter your OpenAI API key: ")
    os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
    print("✅ OpenAI API key set!")
else:
    # If using HuggingFace, you can optionally provide a token for gated models
    print("Using open-source models from HuggingFace (no API key needed)")
    # Uncomment below if you want to use gated models
    # HUGGINGFACE_TOKEN = input("Enter your HuggingFace token (optional): ")
    # os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACE_TOKEN


## Step 4: Upload PDF Files


In [None]:
# Upload PDF files
print("📁 Please upload your PDF files...")
uploaded = files.upload()

# Save uploaded files
pdf_files = []
for filename in uploaded.keys():
    if filename.endswith('.pdf'):
        pdf_files.append(filename)
        print(f"✅ Uploaded: {filename}")

print(f"\n📚 Total PDF files uploaded: {len(pdf_files)}")


## Step 5: Process PDFs and Create Vector Store


In [None]:
# Load and split PDFs
print("📖 Loading and processing PDFs...")

all_documents = []

for pdf_file in pdf_files:
    loader = PyPDFLoader(pdf_file)
    documents = loader.load()
    all_documents.extend(documents)
    print(f"✅ Loaded {len(documents)} pages from {pdf_file}")

print(f"\n📄 Total pages loaded: {len(all_documents)}")

# Split documents into chunks
print("\n✂️ Splitting documents into chunks...")
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)

chunks = text_splitter.split_documents(all_documents)
print(f"✅ Created {len(chunks)} text chunks")


In [None]:
# Create embeddings and vector store
print("\n🔢 Creating embeddings (this may take a few minutes)...")

# Using HuggingFace embeddings (free and works well)
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={'device': 'cpu'}
)

# Create FAISS vector store
vectorstore = FAISS.from_documents(chunks, embeddings)
print("✅ Vector store created successfully!")


## Step 6: Setup Question-Answering Chain


In [None]:
# Create prompt template (works for ANY document type)
prompt_template = """You are a helpful AI assistant that answers questions about documents.
Use the following pieces of context from the uploaded PDF(s) to answer the question.
The document could be anything: a bill, invoice, contract, report, article, receipt, or any other type.

Be specific and extract exact information when available (numbers, dates, names, amounts, etc.).
If you don't know the answer based on the context, just say that you don't know - don't make up information.

Context:
{context}

Question: {question}

Helpful Answer:"""

PROMPT = PromptTemplate(
    template=prompt_template, 
    input_variables=["context", "question"]
)

# Setup the language model
if USE_OPENAI:
    print("🤖 Setting up GPT-3.5-turbo...")
    llm = ChatOpenAI(
        model_name="gpt-3.5-turbo",
        temperature=0.7,
    )
else:
    print("🤖 Setting up open-source model (Google FLAN-T5-XXL)...")
    # Using FLAN-T5 which is good for question answering
    llm = HuggingFaceHub(
        repo_id="google/flan-t5-xxl",
        model_kwargs={"temperature": 0.7, "max_length": 512}
    )

# Create QA chain with more chunks for better coverage
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),  # Retrieve more chunks for better accuracy
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

print("✅ QA chain ready!")


## Step 7: Chat with Your PDFs!


In [None]:
def ask_question(question):
    """
    Ask a question about the uploaded PDFs
    """
    print(f"\n❓ Question: {question}")
    print("\n🔍 Searching through documents...\n")
    
    result = qa_chain.invoke({"query": question})
    
    print("💡 Answer:")
    print("-" * 80)
    print(result["result"])
    print("-" * 80)
    
    # Show source documents
    print("\n📚 Sources:")
    for i, doc in enumerate(result["source_documents"], 1):
        source = doc.metadata.get('source', 'Unknown')
        page = doc.metadata.get('page', 'Unknown')
        print(f"  [{i}] {source} (Page {page + 1})")
    
    return result["result"]


In [None]:
# Example usage - Ask your first question!
question = input("\n🗣️ Enter your question: ")
answer = ask_question(question)


## Step 8: Interactive Chat Loop (Optional)

Run this cell for a continuous chat experience. Type 'quit' or 'exit' to stop.


In [None]:
print("💬 Interactive Chat Mode")
print("Type 'quit' or 'exit' to stop\n")
print("=" * 80)

while True:
    question = input("\n🗣️ You: ")
    
    if question.lower() in ['quit', 'exit', 'q']:
        print("\n👋 Thanks for chatting! Goodbye!")
        break
    
    if not question.strip():
        print("⚠️ Please enter a valid question.")
        continue
    
    try:
        answer = ask_question(question)
    except Exception as e:
        print(f"\n❌ Error: {str(e)}")
        print("Please try rephrasing your question.")


## Additional Features

### Get Similar Documents


In [None]:
def search_similar_content(query, k=5):
    """
    Search for similar content in the PDFs
    """
    print(f"\n🔍 Searching for content related to: '{query}'\n")
    
    docs = vectorstore.similarity_search(query, k=k)
    
    for i, doc in enumerate(docs, 1):
        print(f"\n[{i}] Source: {doc.metadata.get('source', 'Unknown')} (Page {doc.metadata.get('page', 'Unknown') + 1})")
        print("-" * 80)
        print(doc.page_content[:300] + "..." if len(doc.page_content) > 300 else doc.page_content)
        print("-" * 80)

# Example usage
# search_similar_content("your search term here")


### Save Vector Store (Optional)

Save the vector store so you don't need to reprocess PDFs next time.


In [None]:
# Save vector store
vectorstore.save_local("pdf_vectorstore")
print("✅ Vector store saved!")

# To load it later:
# vectorstore = FAISS.load_local("pdf_vectorstore", embeddings)


## Tips for Better Results

1. **Be specific with your questions**: Instead of "What is this about?", try "What are the main conclusions about X?"

2. **Ask follow-up questions**: Build on previous answers for deeper understanding

3. **Model Selection**:
   - **GPT-3.5**: Better quality answers, faster, but requires API key and costs money
   - **FLAN-T5**: Free, open-source, good for factual questions, but may be slower

4. **For longer or more complex PDFs**: Consider increasing the `k` parameter in the retriever to search more chunks

5. **If answers are not good**: Try adjusting:
   - `chunk_size`: Larger chunks (1500-2000) for more context
   - `chunk_overlap`: More overlap (300-400) for better continuity
   - `temperature`: Lower (0.3-0.5) for more focused answers
