<a href="https://colab.research.google.com/github/mbb15761/SEIS767/blob/main/2025Fall_SEIS767_FinalProject_YanNi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Complete Cleaning and fresh start

import os
import glob

# 1. Delete ALL PDF files
print("Cleaning up all PDF files...")
pdf_files = glob.glob("/content/*.pdf")
for pdf_file in pdf_files:
    os.remove(pdf_file)
    print(f"Deleted: {pdf_file}")

# 2. Delete the vector store directories
print("Cleaning up vector stores...")
vector_dirs = glob.glob("/content/chroma_db*")
for dir_path in vector_dirs:
    !rm -rf {dir_path}
    print(f"Deleted: {dir_path}")

print("Cleanup complete!")
!ls -la /content/

Cleaning up all PDF files...
Deleted: /content/s41586-025-08628-5.pdf
Cleaning up vector stores...
Cleanup complete!
total 16
drwxr-xr-x 1 root root 4096 Dec  4 23:47 .
drwxr-xr-x 1 root root 4096 Dec  4 23:43 ..
drwxr-xr-x 4 root root 4096 Nov 20 14:30 .config
drwxr-xr-x 1 root root 4096 Nov 20 14:30 sample_data


## **Phase 1**: upload PDF and process PDF into chunks

### **Step 1**: Library installation
*  *pypdf*  installed to a python library to upload and process PDF files
* *langchain-community* installed to process pdf document loader and text spliter

In [2]:
!pip install -q pypdf==3.17.0
!pip install -q langchain==0.1.0 langchain-community==0.0.10

### **Step 2**: Import Specific Components
* *PyPDFLoader* is a langchain class that converts PDF pages into Document objects with page_content and metadata
* *RecursiveCharacterTextSplitter* splits long text into small chuncks
* *google.colab.file* colab's built-in file upload utility


In [3]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from google.colab import files

### **Step 3**: File upload
* upload file and get the file name

In [4]:
print("📤 Please upload your PDF file:")
uploaded = files.upload()

📤 Please upload your PDF file:


Saving s41586-025-08628-5.pdf to s41586-025-08628-5.pdf


In [5]:
# Get the uploaded filename
pdf_filename = list(uploaded.keys())[0] # uploaded returns a dictionary where keys are filenames and values are file content
print(f"✅ Uploaded: {pdf_filename}")

✅ Uploaded: s41586-025-08628-5.pdf


### **Step 4**: Document Processing Function
* define a function that load the documents, split into small chunks and print out the sample chuncks

In [6]:
def process_pdf_dedup(pdf_path):
    """Loads and splits a PDF into chunks."""
    print("🚀 Starting PDF processing...")

    # 1. LOADING
    print("📖 Loading PDF pages...")
    loader = PyPDFLoader(pdf_path)
    documents = loader.load()   # loader open PDF file reads all pages and converts each page into Document object include page_content and metadata
    print(f"   → Loaded {len(documents)} pages")
    print(f"   → Sample document metadata: {documents[0].metadata}")
    print(f"   → First page content preview: {documents[0].page_content[:100]}...")


    # 2. SPLITTING
    print("✂️  Splitting into chunks...")
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,  # 1000 characters is approximately 250~300 tokens
        chunk_overlap=200 # ensures important context isn't lost at chunck boundaries for example between two pages
    )
    chunks = text_splitter.split_documents(documents)
    print(f"   → Created {len(chunks)} chunks")

     # 3. DEDUPLICATION
    print("Removing duplicate chunks...")
    unique_chunks = []
    seen_content = set()

    for chunk in chunks:
        # Create a unique identifier for this chunk
        # Use first 300 chars + page number to identify duplicates
        content_key = f"page{chunk.metadata.get('page', 0)}:{chunk.page_content[:300].strip()}"

        if content_key not in seen_content:
            unique_chunks.append(chunk)
            seen_content.add(content_key)
        else:
            print(f"Removed duplicate chunk from page {chunk.metadata.get('page', 0)}")

    print(f"After deduplication: {len(unique_chunks)} unique chunks")

    # 4. VERIFICATION
    print("\nSample of unique chunks:")
    for i in range(min(3, len(unique_chunks))):
        print(f"Chunk {i+1}: Page {unique_chunks[i].metadata.get('page', 'Unknown')}")
        print(f"Content: {unique_chunks[i].page_content[:150]}...")
        print()

    return unique_chunks

# Execute Phase 1 with deduplication
chunks = process_pdf_dedup(pdf_filename)
print(f"Phase 1 Complete! Ready {len(chunks)} unique chunks for vector storage.")

🚀 Starting PDF processing...
📖 Loading PDF pages...


  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


   → Loaded 10 pages
   → Sample document metadata: {'source': 's41586-025-08628-5.pdf', 'page': 0}
   → First page content preview: 624 | Nature | Vol 639 | 20 March 2025
ArticleA generative model for inorganic materials 
design
Cla...
✂️  Splitting into chunks...
   → Created 64 chunks
Removing duplicate chunks...
After deduplication: 64 unique chunks

Sample of unique chunks:
Chunk 1: Page 0
Content: 624 | Nature | Vol 639 | 20 March 2025
ArticleA generative model for inorganic materials 
design
Claudio Zeni1,8, Robert Pinsler1,8, Daniel Zügner2,8,...

Chunk 2: Page 0
Content: given desired property constraints, but current methods have a low success rate in 
proposing stable crystals or can satisfy only a limited set of pro...

Chunk 3: Page 0
Content: we synthesize one of the generated structures and measure its property value to be 
within 20% of our target. We believe that the quality of generated...

Phase 1 Complete! Ready 64 unique chunks for vector storage.


After we finished Phase 1 we got the followings:
1. Loading: load PDF file and change it to a List of Document objects (one per page)

2. Splitting: Split the document file from loading into small chuncks of document files

3. Inspection: We examine the first 3 chuncks to verify everything worked


## **Phase 2**: Embeddings and Vector databases
In this part, we will convert the small document chucks from phase 1 to a vector space that can be quickly search through. so we need to make embeddings of these chuncks. We will use an embedding LLM *all-MiniLM-L6-v2* from HuggingFace.

### **Step 1:** Library installation and import specific components

In [7]:
!pip uninstall -y sentence-transformers huggingface-hub chromadb
!pip install -q sentence-transformers  # Newer version that doesn't need cached_download
!pip install -q huggingface-hub
!pip install -q chromadb

Found existing installation: sentence-transformers 5.1.2
Uninstalling sentence-transformers-5.1.2:
  Successfully uninstalled sentence-transformers-5.1.2
Found existing installation: huggingface-hub 0.36.0
Uninstalling huggingface-hub-0.36.0:
  Successfully uninstalled huggingface-hub-0.36.0
Found existing installation: chromadb 1.3.5
Uninstalling chromadb-1.3.5:
  Successfully uninstalled chromadb-1.3.5


In [8]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

### **Step 2:** Setup Embedding Model and Create Vector Database

### **Step 3:** Test Retrieval System

In [9]:
# Embedding Model Setup
print("Loading embedding model...")
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)
print("Model loaded: all-mpnet-base-v2")

# Vector Database Creation
print("Creating vector database...")
vector_store = Chroma.from_documents(
    documents=chunks,  # Now using deduplicated chunks
    embedding=embeddings,
    persist_directory="./chroma_db"
)
print("Vector database created and saved to disk")

Loading embedding model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Model loaded: all-mpnet-base-v2
Creating vector database...
Vector database created and saved to disk


In [10]:
# Improved Test Retrieval System
print("Testing retrieval system with improved parameters...")
test_queries = [
    "What is matterGen?",
    "What does diffusion work",
    "Conclusion"
]

for query in test_queries:
    print(f"Testing query: '{query}'")

    # Try different search strategies

    # Strategy 1: Regular search with more results
    print("  Strategy 1: Regular search (k=5)")
    results_regular = vector_store.similarity_search(query, k=5)
    unique_regular = []
    seen_regular = set()
    for doc in results_regular:
        content_start = doc.page_content[:150]
        if content_start not in seen_regular:
            unique_regular.append(doc)
            seen_regular.add(content_start)

    for i, doc in enumerate(unique_regular[:3]):  # Show up to 3
        print(f"    Result {i+1}: Page {doc.metadata.get('page', 'Unknown')}")
        print(f"    Preview: {doc.page_content[:80]}...")

    # Strategy 2: Search with score threshold
    print("  Strategy 2: Search with similarity threshold")
    results_threshold = vector_store.similarity_search_with_relevance_scores(query, k=5)

    # Filter by similarity score (only show results with decent similarity)
    good_results = [(doc, score) for doc, score in results_threshold if score > 0.2]

    if good_results:
        print(f"    Found {len(good_results)} results with similarity > 0.2")
        for i, (doc, score) in enumerate(good_results[:3]):
            print(f"    Result {i+1}: Page {doc.metadata.get('page', 'Unknown')} (score: {score:.3f})")
            print(f"    Preview: {doc.page_content[:80]}...")
    else:
        print("    No results with good similarity scores")

    print()  # Empty line between queries

# Database Statistics
print("Database Statistics:")
print(f"Total unique chunks stored: {vector_store._collection.count()}")

# Additional analysis
print("\nContent Distribution Analysis:")
all_chunks_sample = vector_store.similarity_search("", k=20)
page_distribution = {}
for chunk in all_chunks_sample:
    page = chunk.metadata.get('page', 'Unknown')
    page_distribution[page] = page_distribution.get(page, 0) + 1

print("Pages represented in sample:")
for page, count in sorted(page_distribution.items()):
    print(f"  Page {page}: {count} chunks")

print("Phase 2 Complete! Vector database ready for Q&A system.")

Testing retrieval system with improved parameters...
Testing query: 'What is matterGen?'
  Strategy 1: Regular search (k=5)
    Result 1: Page 9
    Preview: The source code for MatterGen is available at GitHub (https://github.
com/micros...
    Result 2: Page 2
    Preview: analysis). To assess stability, we perform DFT calculations on 1,024 
generated ...
    Result 3: Page 2
    Preview: unique structures is 100% when generating 1,000 structures and only 
drops to 52...
  Strategy 2: Search with similarity threshold
    Found 3 results with similarity > 0.2
    Result 1: Page 9 (score: 0.255)
    Preview: The source code for MatterGen is available at GitHub (https://github.
com/micros...
    Result 2: Page 2 (score: 0.235)
    Preview: analysis). To assess stability, we perform DFT calculations on 1,024 
generated ...
    Result 3: Page 2 (score: 0.228)
    Preview: unique structures is 100% when generating 1,000 structures and only 
drops to 52...

Testing query: 'What does diffusi



## **Phase 3**: Question&Ansnwer System
Now we have our vector store from Phase 2, we will build the intelligent question-answer system that combines retrieval with generation. The Retrieval-Augmented Generation(RAG) Flow includes convert user question to vector, find most relevant document chuncks via vector search, send the context and question to LLM and the generative LLM generate answer based on retrieved context.

### Step 1: Install libraries and import components

In [11]:
# Install Q&A system libraries
!pip install -q transformers torch accelerate
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
import torch

### Step2 Load the language model

In [12]:
print("Loading language model...")

# Use Google's Flan-T5 model - good for Q&A and free
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

print("Model loaded: Flan-T5 Base")
print("Model type: Text-to-Text Generation")

Loading language model...


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Model loaded: Flan-T5 Base
Model type: Text-to-Text Generation


### Step 3: Create the Q&A Pipeline

In [13]:
print("Creating Q&A pipeline...")

# Create a text generation pipeline
qa_pipeline = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=512,
    temperature=0.3,
    do_sample=True,
    device=0 if torch.cuda.is_available() else -1  # Use GPU if available
)

print("Q&A pipeline ready!")

Device set to use cuda:0


Creating Q&A pipeline...
Q&A pipeline ready!


### Step 4: Create Prompt Template

In [14]:
def create_qa_prompt(question, context_chunks):
    """
    Creates a formatted prompt that combines context and question.
    This is where we tell the model HOW to answer.
    """

    # Combine all context chunks into one string
    context_text = ""
    for i, chunk in enumerate(context_chunks):
        page = chunk.metadata.get('page', 'Unknown')
        context_text += f"[Source {i+1}, Page {page}]: {chunk.page_content}\n\n"

    # Create the instruction prompt
    prompt = f"""Based on the following context, answer the question.
If the answer cannot be found in the context, say "I cannot find the answer in the provided documents."

Context:
{context_text}

Question: {question}

Answer:"""

    return prompt

### Step 5: Create the answer generation function

In [15]:
def answer_question(question, vector_store, num_context_chunks=3):
    """
    Main function that answers questions using RAG.

    Steps:
    1. Retrieve relevant context from vector store
    2. Create prompt with context + question
    3. Generate answer using language model
    4. Return answer with sources
    """

    print(f"Processing question: {question}")

    # Step 1: Retrieve relevant context
    print("  Searching for relevant context...")

    # Convert question to embedding and search
    if hasattr(vector_store, 'query'):
        # Direct ChromaDB approach
        query_embedding = model.encode([question]).tolist()
        results = vector_store.query(
            query_embeddings=query_embedding,
            n_results=num_context_chunks
        )
        context_chunks = []
        for i in range(len(results['documents'][0])):
            # Create a simple object to hold the document data
            class SimpleDoc:
                def __init__(self, content, metadata):
                    self.page_content = content
                    self.metadata = metadata
            context_chunks.append(SimpleDoc(
                results['documents'][0][i],
                results['metadatas'][0][i]
            ))
    else:
        # LangChain approach
        context_chunks = vector_store.similarity_search(question, k=num_context_chunks)

    print(f"  Found {len(context_chunks)} relevant context chunks")

    # Step 2: Create the prompt
    prompt = create_qa_prompt(question, context_chunks)

    # Step 3: Generate answer
    print("  Generating answer...")
    response = qa_pipeline(
        prompt,
        max_length=400,
        num_return_sequences=1,
        temperature=0.1,
        repetition_penalty=1.2
    )

    answer = response[0]['generated_text']
    print("  Answer generated successfully!")

    return answer, context_chunks

### Step 6: Test the Q&A System

In [16]:
def test_qa_system(vector_store):
    """Comprehensive testing of the Q&A system"""

    print("=" * 70)
    print("TESTING Q&A SYSTEM")
    print("=" * 70)

    test_questions = [
        "What is MatterGen?",
        "How does the diffusion process work?",
        "What are the main results or findings?",
        "What materials were generated successfully?",
        "How does MatterGen compare to other methods?"
    ]

    for i, question in enumerate(test_questions, 1):
        print(f"\n{'='*50}")
        print(f"TEST {i}: {question}")
        print('='*50)

        answer, sources = answer_question(question, vector_store)

        print(f"\nANSWER: {answer}")

        print(f"\nSOURCES (showing where the answer came from):")
        for j, source in enumerate(sources, 1):
            page = source.metadata.get('page', 'Unknown')
            print(f"  {j}. Page {page}: {source.page_content[:120]}...")

In [17]:
# Execute Phase 3
print("Starting Phase 3 execution...")
test_qa_system(vector_store)
print("\n🎉 PHASE 3 COMPLETE! Q&A system is fully functional.")

Token indices sequence length is longer than the specified maximum sequence length for this model (878 > 512). Running this sequence through the model will result in indexing errors


Starting Phase 3 execution...
TESTING Q&A SYSTEM

TEST 1: What is MatterGen?
Processing question: What is MatterGen?
  Searching for relevant context...
  Found 3 relevant context chunks
  Generating answer...
  Answer generated successfully!

ANSWER: The source code for MatterGen is available at GitHub (https://github. com/microsoft/mattergen).

SOURCES (showing where the answer came from):
  1. Page 9: The source code for MatterGen is available at GitHub (https://github.
com/microsoft/mattergen).
Acknowledgements We than...
  2. Page 2: analysis). To assess stability, we perform DFT calculations on 1,024 
generated structures. Figure  2b shows that 78% of...
  3. Page 2: unique structures is 100% when generating 1,000 structures and only 
drops to 52% after generating 10 million structures...

TEST 2: How does the diffusion process work?
Processing question: How does the diffusion process work?
  Searching for relevant context...
  Found 3 relevant context chunks
  Generating answer.

## Phase 4: Web interface

In [18]:
# === PHASE 4: WEB INTERFACE ===
!pip install -q gradio
import gradio as gr

print("Creating Document Q&A Web Interface...")

# Create interface function
def answer_question_interface(question):
    """
    Interface wrapper for the answer_question function
    """
    try:
        # Use the vector_store from previous phases
        answer, sources = answer_question(question, vector_store)

        # Format the response with sources
        response = f"**Answer:** {answer}\n\n"

        if sources:
            response += "**Sources from document:**\n"
            for i, source in enumerate(sources, 1):
                page = source.metadata.get('page', 'Unknown')
                content_preview = source.page_content[:150] + "..." if len(source.page_content) > 150 else source.page_content
                response += f"{i}. Page {page}: {content_preview}\n"
        else:
            response += "*No sources found in document*"

        return response

    except NameError:
        return "⚠️ Please run Phase 3 first to load documents and create vector store."
    except Exception as e:
        return f"Error: {str(e)}"

# Create the Gradio interface
interface = gr.Interface(
    fn=answer_question_interface,
    inputs=gr.Textbox(
        label="Question about Material Science Paper",
        placeholder="Ask about the research, methods, findings...",
        lines=3
    ),
    outputs=gr.Textbox(
        label="Answer with Document Sources",
        lines=10,
        show_copy_button=True
    ),
    title="📚 Material Science Research Q&A",
    description="Ask questions about your uploaded material science research papers. Answers are generated from the document content with citations.",
    examples=[
        ["What is the main research topic?"],
        ["What methodology was used?"],
        ["What are the key findings?"],
        ["What materials were studied?"],
        ["How does this compare to previous work?"]
    ],
    theme=gr.themes.Soft()
)

print("✅ Interface created successfully!")
print("\n" + "="*60)
print("🌐 LAUNCHING MATERIAL SCIENCE Q&A SYSTEM")
print("="*60)
print("Make sure you've:")
print("1. Run Phase 1-3 to process documents")
print("2. Have 'vector_store' variable available")
print("="*60 + "\n")

# Launch the interface
interface.launch(share=True)

Creating Document Q&A Web Interface...
✅ Interface created successfully!

🌐 LAUNCHING MATERIAL SCIENCE Q&A SYSTEM
Make sure you've:
1. Run Phase 1-3 to process documents
2. Have 'vector_store' variable available

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://b9201ace80deff4ecb.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## Dynamic Document Q&A System

In [19]:
!pip install -q gradio pypdf langchain-community sentence-transformers chromadb transformers torch

import gradio as gr
import tempfile
import os
import torch
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from transformers import pipeline

class CompleteDocumentQA:
    """
    Complete dynamic document Q&A system
    """

    def __init__(self):
        self.vector_store = None
        self.embeddings = HuggingFaceEmbeddings(
            model_name="sentence-transformers/all-MiniLM-L6-v2"
        )
        self.qa_pipeline = pipeline(
            "text2text-generation",
            model="google/flan-t5-base",
            max_length=300,
            temperature=0.1,
            device=0 if torch.cuda.is_available() else -1
        )
        self.current_doc_info = None

    def process_pdf(self, pdf_file):
        """Process uploaded PDF and create searchable database"""
        try:
            # Save uploaded file
            with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp_file:
                if hasattr(pdf_file, 'read'):
                    tmp_file.write(pdf_file.read())
                else:
                    import shutil
                    shutil.copy(pdf_file.name, tmp_file.name)
                temp_path = tmp_file.name

            # Load and process PDF
            loader = PyPDFLoader(temp_path)
            documents = loader.load()

            # Split into chunks
            splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
            chunks = splitter.split_documents(documents)

            # Remove duplicates
            unique_chunks = []
            seen = set()
            for chunk in chunks:
                key = f"page{chunk.metadata.get('page', 0)}:{chunk.page_content[:300]}"
                if key not in seen:
                    unique_chunks.append(chunk)
                    seen.add(key)

            # Create vector store
            self.vector_store = Chroma.from_documents(
                documents=unique_chunks,
                embedding=self.embeddings,
                persist_directory="./dynamic_docs"
            )

            # Store document info
            self.current_doc_info = {
                'name': os.path.basename(pdf_file.name) if hasattr(pdf_file, 'name') else "document.pdf",
                'pages': len(documents),
                'chunks': len(unique_chunks)
            }

            # Cleanup
            os.unlink(temp_path)

            return f"✅ Ready! Processed {len(documents)} pages into {len(unique_chunks)} chunks. Ask away!"

        except Exception as e:
            return f"❌ Error: {str(e)}"

    def ask_question(self, question):
        """Answer questions about the current document"""
        if not self.vector_store:
            return "Please upload a PDF document first!"

        # Get relevant context
        contexts = self.vector_store.similarity_search(question, k=3)

        # Build prompt
        context_text = "\n".join([
            f"[Page {c.metadata.get('page', '?')}]: {c.page_content[:400]}..."
            for c in contexts
        ])

        prompt = f"""Answer based on this context:

{context_text}

Question: {question}

Answer:"""

        # Generate answer
        response = self.qa_pipeline(prompt, max_length=200)
        answer = response[0]['generated_text'].strip()

        # Format with sources
        result = f"**Answer:** {answer}\n\n**Sources:**\n"
        for i, ctx in enumerate(contexts, 1):
            result += f"{i}. Page {ctx.metadata.get('page', '?')}\n"

        return result

# Initialize and launch
system = CompleteDocumentQA()

with gr.Blocks(theme=gr.themes.Soft()) as app:
    gr.Markdown("# 📚 Document Q&A - Upload & Ask")

    with gr.Row():
        with gr.Column():
            file = gr.File(label="Upload PDF", file_types=[".pdf"])
            status = gr.Textbox(label="Status", interactive=False)
            process_btn = gr.Button("Process PDF", variant="primary")

        with gr.Column():
            question = gr.Textbox(label="Your Question", lines=3)
            answer = gr.Textbox(label="Answer", lines=6)
            ask_btn = gr.Button("Ask Question", variant="primary")

    # Events
    process_btn.click(system.process_pdf, [file], [status])
    ask_btn.click(system.ask_question, [question], [answer])
    question.submit(system.ask_question, [question], [answer])

print("🌐 Launching dynamic document Q&A system...")
app.launch(share=True)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
  with gr.Blocks(theme=gr.themes.Soft()) as app:


🌐 Launching dynamic document Q&A system...
Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://bdd10484c90bd42f60.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [20]:
# === ENHANCED DOCUMENT Q&A ===

!pip install -q gradio pypdf langchain-community sentence-transformers chromadb transformers torch

import gradio as gr
import tempfile
import os
import torch
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from transformers import pipeline
import re

class EnhancedDocumentQA:

    def __init__(self):
        self.vector_store = None
        self.current_doc_info = None

        print("🚀 Loading enhanced models for T4 GPU...")

        # Better embedding model - still fits in T4 memory
        self.embeddings = HuggingFaceEmbeddings(
            model_name="sentence-transformers/all-mpnet-base-v2",  # Better than MiniLM
            model_kwargs={'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
        )

        # Enhanced Q&A model - larger but still T4-compatible
        try:
            self.qa_pipeline = pipeline(
                "text2text-generation",
                model="google/flan-t5-large",  # Larger and more capable
                max_length=400,
                temperature=0.1,
                device=0 if torch.cuda.is_available() else -1,
                torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
            )
            print("✅ Loaded Flan-T5-Large (enhanced model)")
        except Exception as e:
            print(f"⚠️ Flan-T5-Large failed, using base: {e}")
            self.qa_pipeline = pipeline(
                "text2text-generation",
                model="google/flan-t5-base",
                max_length=300,
                temperature=0.1,
                device=0 if torch.cuda.is_available() else -1
            )
            print("✅ Loaded Flan-T5-Base (reliable)")

    def process_pdf(self, pdf_file):
        """Enhanced PDF processing with better chunking"""
        try:
            # Save uploaded file
            with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp_file:
                if hasattr(pdf_file, 'read'):
                    tmp_file.write(pdf_file.read())
                else:
                    import shutil
                    shutil.copy(pdf_file.name, tmp_file.name)
                temp_path = tmp_file.name

            # Load and process PDF
            loader = PyPDFLoader(temp_path)
            documents = loader.load()

            # Enhanced chunking strategy
            splitter = RecursiveCharacterTextSplitter(
                chunk_size=800,  # Smaller chunks for better precision
                chunk_overlap=150,
                separators=["\n\n", "\n", ". ", "! ", "? ", " ", ""]
            )
            chunks = splitter.split_documents(documents)

            # Better deduplication
            unique_chunks = []
            seen_content = set()

            for chunk in chunks:
                # Use semantic content for deduplication
                content_preview = ' '.join(chunk.page_content[:400].strip().lower().split()[:50])
                content_hash = hash(content_preview)
                if content_hash not in seen_content:
                    unique_chunks.append(chunk)
                    seen_content.add(content_hash)

            # Create vector store
            self.vector_store = Chroma.from_documents(
                documents=unique_chunks,
                embedding=self.embeddings,
                persist_directory="./enhanced_docs"
            )

            # Store document info
            self.current_doc_info = {
                'name': os.path.basename(pdf_file.name) if hasattr(pdf_file, 'name') else "document.pdf",
                'pages': len(documents),
                'chunks': len(unique_chunks)
            }

            # Cleanup
            os.unlink(temp_path)

            return f"✅ Enhanced Processing Complete!\n- Pages: {len(documents)}\n- Intelligent chunks: {len(unique_chunks)}\n- Ready for high-quality Q&A!"

        except Exception as e:
            return f"❌ Error: {str(e)}"

    def ask_question(self, question):
        """Enhanced question answering with better prompting"""
        if not self.vector_store:
            return "📝 Please upload a PDF document first!"

        try:
            # Get relevant context
            contexts = self.vector_store.similarity_search(question, k=4)  # More context

            if not contexts:
                return "❌ No relevant information found for this question."

            # Build enhanced prompt
            context_text = ""
            for i, ctx in enumerate(contexts):
                page = ctx.metadata.get('page', '?')
                content = ctx.page_content
                # Clean the content
                clean_content = ' '.join(content.split())
                context_text += f"[Source {i+1}, Page {page}]: {clean_content}\n\n"

            # Enhanced prompt for better reasoning
            prompt = f"""Based on the following document excerpts, provide a comprehensive and accurate answer to the question. Be specific and reference the sources.

DOCUMENT CONTEXT:
{context_text}

QUESTION: {question}

INSTRUCTIONS:
1. Answer using only information from the provided context
2. Be specific and detailed
3. Reference which pages support your answer
4. If the context doesn't contain the answer, say so clearly

ANSWER:"""

            # Generate enhanced answer
            response = self.qa_pipeline(
                prompt,
                max_length=350,  # Longer for more detailed answers
                num_return_sequences=1,
                temperature=0.1
            )
            answer = response[0]['generated_text'].strip()

            # Enhanced formatting with better source display
            source_pages = sorted(list(set([
                ctx.metadata.get('page', '?') for ctx in contexts
            ])))

            result = f"""**Enhanced Answer:** {answer}

**Sources Analyzed:** Pages {', '.join(map(str, source_pages))}

*Answer generated using enhanced AI analysis*"""

            return result

        except Exception as e:
            return f"❌ Error generating answer: {str(e)}"

# Initialize enhanced system
print("🧠 Initializing Enhanced Document Q&A System...")
enhanced_system = EnhancedDocumentQA()

# Create enhanced interface
with gr.Blocks(
    title="Enhanced Document Q&A",
    theme=gr.themes.Soft(),
    css="""
    .gradio-container { max-width: 1000px; margin: auto; }
    .enhanced-badge { background: #4CAF50; color: white; padding: 4px 8px; border-radius: 4px; font-size: 0.9em; }
    """
) as app:

    gr.Markdown("""
    # 📚 Enhanced Document Q&A
    **Upload PDFs and get more intelligent, detailed answers using enhanced AI models.**
    """)

    with gr.Row():
        with gr.Column(scale=1):
            gr.Markdown("### 📄 1. Upload PDF")
            file = gr.File(
                label="Select PDF File",
                file_types=[".pdf"],
                file_count="single"
            )
            status = gr.Textbox(
                label="Processing Status",
                lines=3,
                interactive=False
            )
            process_btn = gr.Button("🚀 Enhanced Processing", variant="primary")

            gr.Markdown("""
            ### 💡 Enhanced Features
            - **Better Embeddings**: all-mpnet-base-v2 model
            - **Improved Q&A**: Larger language model
            - **Smart Chunking**: Optimized document processing
            - **Detailed Answers**: More comprehensive responses
            """)

        with gr.Column(scale=2):
            gr.Markdown("### ❓ 2. Ask Questions")
            question = gr.Textbox(
                label="Your Question",
                placeholder="Ask detailed questions about the document content...",
                lines=3
            )
            answer = gr.Textbox(
                label="Enhanced Answer",
                lines=8,
                show_copy_button=True
            )
            ask_btn = gr.Button("🧠 Get Enhanced Answer", variant="primary")

            gr.Markdown("### 🎯 Enhanced Question Examples")
            gr.Examples(
                examples=[
                    "What is the main research question and how is it addressed?",
                    "Explain the methodology used and its key steps",
                    "What are the primary findings and their significance?",
                    "How do the conclusions relate to the initial objectives?",
                    "What evidence supports the main arguments presented?"
                ],
                inputs=question,
                label="Try these detailed questions"
            )

    # Event handlers
    process_btn.click(
        enhanced_system.process_pdf,
        [file],
        [status]
    )

    def ask_wrapper(question_text):
        if not question_text.strip():
            return "Please enter a question."
        return enhanced_system.ask_question(question_text)

    ask_btn.click(
        ask_wrapper,
        [question],
        [answer]
    )

    question.submit(
        ask_wrapper,
        [question],
        [answer]
    )

print("🌐 Launching Enhanced Document Q&A System...")
print("This version uses better models while staying within T4 GPU limits!")
app.launch(share=True)

🧠 Initializing Enhanced Document Q&A System...
🚀 Loading enhanced models for T4 GPU...


config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


✅ Loaded Flan-T5-Large (enhanced model)


  with gr.Blocks(
  with gr.Blocks(


🌐 Launching Enhanced Document Q&A System...
This version uses better models while staying within T4 GPU limits!
Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://85134887cdcbe54c56.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


