## 1. Extract Text from PDFs

In [3]:
!pip install pypdf



In [5]:
import os
from pypdf import PdfReader

def extract_text_from_pdfs(pdf_folder):
    docs = []
    for filename in os.listdir(pdf_folder):
        if filename.endswith('.pdf'):
            path = os.path.join(pdf_folder, filename)
            reader = PdfReader(path)
            text = ''
            for page in reader.pages:
                text += page.extract_text() or ''
            docs.append({'filename': filename, 'text': text})
    return docs

pdf_folder = 'papers'
documents = extract_text_from_pdfs(pdf_folder)
print(f'Loaded {len(documents)} documents.')

Loaded 13 documents.


## 2. Chunk and Embed the Text

In [6]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

def chunk_text(text, chunk_size=500, overlap=100):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i+chunk_size])
        if chunk: chunks.append(chunk)
    return chunks

all_chunks = []
chunk_sources = []
for doc in documents:
    chunks = chunk_text(doc['text'])
    all_chunks.extend(chunks)
    chunk_sources.extend([doc['filename']] * len(chunks))

# Use a biomedical embedding model for better retrieval
model_embed = SentenceTransformer('pritamdeka/S-PubMedBert-MS-MARCO')
embeddings = model_embed.encode(all_chunks, show_progress_bar=True)

index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(np.array(embeddings))

Batches:   0%|          | 0/12 [00:00<?, ?it/s]

## 3. Define the RAG QA Function (using a HuggingFace LLM)

### How the RAG Pipeline Works with Ollama
- The notebook code handles all retrieval and prompt construction logic (RAG).
- When you ask a question, the code retrieves the most relevant chunks from your PDFs using embeddings and FAISS.
- These chunks are combined with your question to form a prompt.
- The prompt is sent to the Ollama model (e.g., Mistral, Llama 2) running locally via REST API.
- The model does not know it is part of a RAG pipeline; it simply receives the prompt and generates a response based on the provided context and question.
- This allows you to use powerful local LLMs for answer generation, while your notebook controls the retrieval and context.

In [8]:
import requests

# Use Ollama REST API for answer generation with Mistral or Llama 2
def ollama_rag_answer(question, top_k=2, max_context_chars=800, model="mistral"):
    q_emb = model_embed.encode([question])  # Use a different variable name for embedding model
    D, I = index.search(np.array(q_emb), top_k)
    context_chunks = [all_chunks[i] for i in I[0]]
    sources = [chunk_sources[i] for i in I[0]]
    context = '\n'.join(context_chunks)
    context = context[:max_context_chars]
    print("Retrieved context:\n", context)
    print("Sources (papers used):", list(dict.fromkeys(sources)))
    prompt = f"Answer the question based on the context.\nContext: {context}\nQuestion: {question}"
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": False}
    )
    
    return (response.json()["response"].strip())  

In [9]:
ollama_rag_answer("Summarize Rag2Mol paper in 2 sentences")


Retrieved context:
 screened by Rag2Mol-R significantly outperform those selected by other advanced virtual screening tools, covering a broader chemical landscape. Thus, Rag2Mol-G fits targets with multiple binding templates, while Rag2Mol-R excels with traditionally “undruggable” targets. Furthermore, both workflows could identify promising drug candidates for the challenging case of PTPN2, outperforming current active site inhibitors. Finally , although we employ widely recognized autoregressive-based methods, almost all AI-based SBDD methods could be integrated into our framework. Meanwhile, each module of the protocol exhibits convenient extensibility . Materials and methods Overview of Rag2Mol We choose RAG architecture to incorporate prior chemical knowl- edge and topological rules into the generation p
Sources (papers used): ['Paper_10.pdf', 'Paper_7.pdf']


'The Rag2Mol paper presents a novel method, utilizing Reinforced Autoencoder Graph (RAG) architecture, that incorporates prior chemical knowledge and topological rules to generate drug molecules. This approach outperforms other virtual screening tools, particularly excelling with traditionally "undruggable" targets, as demonstrated in the case of PTPN2.'

In [10]:
ollama_rag_answer("Name the experimental methods names to determine GPCR structures")


Retrieved context:
 many diseases Downloaded from https://academic.oup.com/bib/article/25/4/bbae281/7691386 by guest on 12 August 2025 2 | Zhang et al. associated with GPCRs still lack approved drugs that can effectively modulate them, thus underscoring the enormous potential of GPCRs as novel targets for disease curing. Unfor- tunately , the limited availability of GPCR–ligand structures (below 500) [ 5] and the difficulty in obtaining reliable GPCR structures because of their embedding in the lipid membrane pose significant challenges for developing structure-based GPCR– ligand prediction models. It is worth noting that the unique properties of GPCRs compared to soluble proteins in protein– ligand binding further complicate the modeling process. In recent years, deep learning has emerged as a powerful tool 
Sources (papers used): ['Paper_12.pdf', 'Paper_8.pdf']


'The context does not explicitly mention specific experimental methods for determining GPCR structures. However, it is suggested that the limited availability of GPCR–ligand structures and the difficulty in obtaining reliable GPCR structures are major challenges due to their embedding in lipid membranes. Common experimental methods used to determine GPCR structures include X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy (Cryo-EM). These techniques can help overcome the challenges mentioned by providing structural information about GPCRs.'

## 4. Build a Simple Gradio UI

In [11]:
!pip install gradio




In [56]:
!pip install --upgrade gradio huggingface_hub




In [13]:
import gradio as gr

def gradio_rag_interface(question):
    try:
        q_emb = model_embed.encode([question])
        D, I = index.search(np.array(q_emb), 2)  # top_k=2
        context_chunks = [all_chunks[i] for i in I[0]]
        sources = [chunk_sources[i] for i in I[0]]
        context = "\n".join(context_chunks)[:800]

        prompt = f"Answer the question based on the context.\nContext: {context}\nQuestion: {question}"
        response = requests.post(
            "http://localhost:11434/api/generate",
            json={"model": "mistral", "prompt": prompt, "stream": False}
        )
        answer = response.json()["response"].strip()

        # Return answer + unique sources
        unique_sources = list(dict.fromkeys(sources))
        return answer, ", ".join(unique_sources)

    except Exception as e:
        return f"Error: {str(e)}", ""

with gr.Blocks() as demo:
    gr.Markdown("## 📄 RAG Bio Papers QA")

    with gr.Row():
        question_box = gr.Textbox(label="Ask a question about your papers")

    with gr.Row():
        answer_box = gr.Textbox(label="Answer", lines=8)
    
    with gr.Row():
        sources_box = gr.Textbox(label="Sources", lines=2)

    ask_btn = gr.Button("Get Answer")

    ask_btn.click(
        fn=gradio_rag_interface,
        inputs=question_box,
        outputs=[answer_box, sources_box]
    )

demo.launch(share=True)


* Running on local URL:  http://127.0.0.1:7863
* Running on public URL: https://e73aa738dbc97991e0.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
* Running on public URL: https://e73aa738dbc97991e0.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


