In [3]:
import torch  # Deep learning framework for GPU-accelerated tensor operations
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig  # Hugging Face tools for loading and configuring language models
from sentence_transformers import SentenceTransformer  # For generating text embeddings/vectors
import faiss  # Fast similarity search and clustering of dense vectors
import numpy as np  # Numerical computing library for array operations
from typing import List, Dict  # Type hints for better code documentation
from scape import PaperScraper



In [None]:
from huggingface_hub import login

login("")

## Model Information:

In [5]:
# Initialize embedding model for generating text embeddings
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')  # Load the BGE large model for high quality embeddings
embedding_model.to('cuda')  # Move embedding model to GPU for faster inference

# Configure quantization for efficient GPU usage
bnb_config = BitsAndBytesConfig(  # Configure 4-bit quantization settings
    load_in_4bit=True,  # Enable 4-bit quantization for reduced memory usage
    bnb_4bit_quant_type="nf4",  # Use normalized float4 quantization for better accuracy
    bnb_4bit_compute_dtype=torch.float16,  # Use float16 for compute to balance speed and precision
    bnb_4bit_use_double_quant=True  # Enable double quantization for additional memory savings
)

# Initialize LLM and tokenizer
model_name = "mistralai/Mistral-7B-Instruct-v0.1"  # Specify the Mistral model to use
tokenizer = AutoTokenizer.from_pretrained(model_name)  # Load tokenizer for converting text to tokens
model = AutoModelForCausalLM.from_pretrained(  # Load the language model
    model_name,  # Use the specified Mistral model
    quantization_config=bnb_config,  # Apply the quantization settings
    torch_dtype=torch.float16,  # Use float16 for model weights
    device_map="auto",  # Automatically distribute model across available GPUs
)

INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cuda
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|██████████| 2/2 [00:14<00:00,  7.25s/it]


### Without retrieval (standard Transformer).

In [6]:
class ResearchAssistant:
    def __init__(self, model, tokenizer):
        """
        Initialize the standard Transformer model for direct QA.
        """
        self.model = model  # Pre-trained language model (e.g., GPT, Llama)
        self.tokenizer = tokenizer  # Tokenizer for the model

    def answer_question(self, question: str):
        """
        Answer a question using the transformer model WITHOUT retrieval.
        """
        # Construct the prompt (no retrieval context added)
        prompt = f"Answer the following question:\n\nQuestion: {question}\n\nAnswer:"

        # Tokenize the input question
        inputs = self.tokenizer(
            prompt, 
            return_tensors="pt",
            truncation=True,
            max_length=1024  # Limit token length
        ).to("cuda")

        # Generate response using the transformer model
        outputs = self.model.generate(
            **inputs,
            max_new_tokens=256,  # Limit response length
            temperature=0.7,  # Add some randomness
            num_return_sequences=1,  # Generate one response
            do_sample=True  # Use sampling for diversity
        )

        # Decode the generated output
        answer = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        return answer.strip()


In [7]:
# Initialize the assistant
assistant = ResearchAssistant(model,    tokenizer  )


# Ask questions
questions = [
    "What are the main challenges in using CRISPR for cancer therapy?",
]

for question in questions:
    print(f"\nQ: {question}")
    print(f"\nA: {assistant.answer_question(question)}")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Q: What are the main challenges in using CRISPR for cancer therapy?

A: Answer the following question:

Question: What are the main challenges in using CRISPR for cancer therapy?

Answer:

CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) is a powerful gene editing tool that has shown promise for the treatment of various diseases, including cancer. However, there are several challenges associated with using CRISPR for cancer therapy, which include:

1. Off-target effects: One of the major challenges in using CRISPR for cancer therapy is the possibility of off-target effects, which occur when the gene editing tool cuts DNA at unintended sites. This can lead to unintended consequences and potentially harmful side effects.
2. Tumor heterogeneity: Cancer tumors are complex and heterogeneous, meaning they contain a mixture of different types of cells. This can make it difficult to target specific cancer cells with CRISPR, as the tool may not be able to distinguish between 

### Build a simple document retriever using ChromaDB, indexing a small collection of scientific papers.

In [8]:
import chromadb

# Initialize ChromaDB client
chroma_client = chromadb.PersistentClient(path="chromadb_store")  # Update path if necessary

# Load collection where papers are stored
collection = chroma_client.get_or_create_collection(name="research_papers")

# Define a search query
query_text = "Graph Neural Networks for biological applications"

# Convert query to embedding (use your embedding model)
# Example: If using OpenAI embeddings, use openai.Embedding.create()
query_embedding = embedding_model.encode(query_text)  # Replace with actual embedding function

# Perform similarity search
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=10  # Number of relevant papers to retrieve
)

# Extract paper IDs
paper_ids = results['ids'][0]  # ChromaDB returns a list of lists
print("Relevant Paper IDs:", paper_ids)

INFO:chromadb.telemetry.product.posthog:Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
Batches: 100%|██████████| 1/1 [00:00<00:00, 12.43it/s]

Relevant Paper IDs: []





### With retrieval (RAG-based Transformer).

### BM25 (text-based search).

In [9]:
from rank_bm25 import BM25Okapi
import nltk

nltk.download('punkt')

class ResearchAssistantBM25:
    def __init__(self):
        self.scraper = PaperScraper()  # Paper retrieval module
        self.tokenizer = tokenizer  # LLM tokenizer
        self.model = model  # LLM model for answering questions
        
        self.paper_texts = []  # Store paper texts
        self.paper_metadata = []  # Store metadata (title, authors, etc)
        self.bm25 = None  # BM25 index (to be built)

    def search_papers(self, query: str, max_results: int = 10):
        """Search and download papers for the given query"""
        print(f"Searching for papers about: {query}")

        pmids = self.scraper.search_pubmed(query, max_results)
        papers = self.scraper.fetch_pubmed_details(pmids)

        for paper in papers:
            if pdf_url := paper.get('full_text_link'):
                try:
                    pdf_path = self.scraper.download_pdf(pdf_url, f"temp_{paper['pubmed_id']}.pdf")
                    text = self.scraper.extract_text_from_pdf(pdf_path)

                    self.paper_texts.append(text)
                    self.paper_metadata.append(paper)

                except Exception as e:
                    print(f"Error processing paper {paper['pubmed_id']}: {e}")

        self._build_bm25_index()
        print(f"Successfully processed {len(self.paper_texts)} papers")

    def _build_bm25_index(self):
        """Build BM25 index from paper texts"""
        self.chunks = []
        self.chunk_metadata = []

        for text, metadata in zip(self.paper_texts, self.paper_metadata):
            paragraphs = text.split('\n\n')
            for i in range(0, len(paragraphs), 2):  # Overlapping chunks
                chunk = ' '.join(paragraphs[i:i+2])
                if len(chunk.split()) > 30:
                    self.chunks.append(chunk)
                    self.chunk_metadata.append(metadata)

        # Tokenize each chunk for BM25
        tokenized_chunks = [nltk.word_tokenize(chunk.lower()) for chunk in self.chunks]

        # Initialize BM25 Index
        self.bm25 = BM25Okapi(tokenized_chunks)

    def answer_question(self, question: str, k: int = 5):
        """Retrieve relevant papers using BM25 and answer with LLM"""
        query_tokens = nltk.word_tokenize(question.lower())

        # Get top-k ranked documents
        scores = self.bm25.get_scores(query_tokens)
        ranked_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]

        # Build context from retrieved documents
        context = ""
        used_papers = set()
        total_tokens = 0
        max_tokens = 2048

        for idx in ranked_indices:
            chunk = self.chunks[idx]
            metadata = self.chunk_metadata[idx]
            paper_id = metadata["pubmed_id"]

            if paper_id not in used_papers:
                chunk_tokens = len(self.tokenizer.encode(chunk))
                if total_tokens + chunk_tokens > max_tokens:
                    break

                used_papers.add(paper_id)
                context += f"\nFrom paper '{metadata['title']}':\n{chunk}\n"
                total_tokens += chunk_tokens

        # Construct final LLM prompt
        prompt = f"""Answer based on these excerpts. Include citations.

        Excerpts: {context}

        Question: {question}

        Answer: """

        # Generate answer using LLM
        inputs = self.tokenizer(
            prompt, 
            return_tensors="pt",
            truncation=True,
            max_length=2048
        ).to("cuda")

        outputs = self.model.generate(
            **inputs,
            max_new_tokens=256,
            temperature=0.7,
            num_return_sequences=1,
            do_sample=True
        )
        input_length = inputs["input_ids"].shape[1]  # Number of tokens in input prompt
        output_token_length = outputs.shape[1] - input_length  # Total generated tokens excluding input
        # Extract answer
        answer = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        return answer.split("Answer: ")[-1].strip(),output_token_length


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\prart\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [10]:
# Initialize the assistant
assistant = ResearchAssistantBM25()

# Search for papers on a topic
assistant.search_papers(
    query="latest developments in CRISPR gene editing cancer therapy",
    max_results=20
)

# Ask questions
questions = [
    "What are the main challenges in using CRISPR for cancer therapy?",
]

for question in questions:
    print(f"\nQ: {question}")
    ans =assistant.answer_question(question)
    print(f"\nA: {ans[0]}")
    print(f"\nOutput token length: {ans[1]}")

INFO:scape:Searching PubMed for : latest developments in CRISPR gene editing cancer therapy


Searching for papers about: latest developments in CRISPR gene editing cancer therapy


INFO:scape:Found 28 results
INFO:scape:Fetching details for 28 papers...
INFO:scape:Successfully processed PMID 37356052
INFO:scape:Successfully processed PMID 36610813
INFO:scape:Successfully processed PMID 36272261
INFO:scape:Successfully processed PMID 35337340
INFO:scape:Successfully processed PMID 39708520
INFO:scape:Successfully processed PMID 38050977
INFO:scape:Successfully processed PMID 34411650
INFO:scape:Successfully processed PMID 31739699
INFO:scape:Successfully processed PMID 36560658
INFO:scape:Successfully processed PMID 33003295
INFO:scape:Successfully processed PMID 39317648
INFO:scape:Successfully processed PMID 35547744
INFO:scape:Successfully processed PMID 39292321
INFO:scape:Successfully processed PMID 35999480
INFO:scape:Successfully processed PMID 32264803
INFO:scape:Successfully processed PMID 38041049
INFO:scape:Successfully processed PMID 36139078
INFO:scape:Successfully processed PMID 29691470
INFO:scape:Successfully processed PMID 35358798
INFO:scape:Succ

Successfully processed 12 papers

Q: What are the main challenges in using CRISPR for cancer therapy?

A: One of the main challenges in using CRISPR for cancer therapy is the potential for unintended consequences or off-target effects. This refers to the possibility that the CRISPR-Cas9 system could accidentally edit genes that are not intended to be modified. This can lead to unintended mutations or changes in gene expression that could have harmful effects on healthy cells.

            Another challenge is the difficulty in precisely targeting cancer cells while avoiding healthy cells. This requires careful design of the CRISPR-Cas9 system to specifically recognize and bind to cancer cells, while leaving healthy cells unharmed.

            Additionally, the CRISPR-Cas9 system may not be effective in all types of cancer cells. Some cancer cells may have genetic mutations that make them resistant to the CRISPR-Cas9 system, or may have mechanisms to repair any damage caused by the sys

### Dense embeddings (sentence-transformers/all-MiniLM-L6-v2)

In [11]:
import chromadb
from chromadb.utils import embedding_functions

class ResearchAssistant_sentrans:
    def __init__(self):
        # Initialize components for paper processing and analysis
        self.scraper = PaperScraper()  # For retrieving papers from PubMed
        self.embedding_model = embedding_model  # For generating text embeddings
        self.tokenizer = tokenizer  # For tokenizing text for the LLM
        self.model = model  # The language model for answering questions
        self.paper_texts = []  # Store the full text of processed papers
        self.paper_metadata = []  # Store metadata (title, authors, etc) for papers
        
        # Initialize ChromaDB client and collection
        self.chroma_client = chromadb.PersistentClient(path="./chroma_db")  # Persist data
        self.collection = self.chroma_client.get_or_create_collection(name="sentence_papers")

    def search_papers(self, query: str, max_results: int = 10):
        """Search and download papers for the given query"""
        print(f"Searching for papers about: {query}")
        
        # Get paper IDs from PubMed search
        pmids = self.scraper.search_pubmed(query, max_results)
        # Fetch detailed information for each paper
        papers = self.scraper.fetch_pubmed_details(pmids)
        
        # Process each paper found
        for paper in papers:
            if pdf_url := paper.get('full_text_link'):  # Check if full text PDF is available
                try:
                    # Download PDF to temporary file
                    pdf_path = self.scraper.download_pdf(
                        pdf_url, 
                        f"temp_{paper['pubmed_id']}.pdf"
                    )
                    # Extract plain text from PDF
                    text = self.scraper.extract_text_from_pdf(pdf_path)
                    
                    # Save paper content and metadata
                    self.paper_texts.append(text)
                    self.paper_metadata.append(paper)
                    
                except Exception as e:
                    print(f"Error processing paper {paper['pubmed_id']}: {e}")
                    
        self._build_index()  # Create search index from processed papers
        print(f"Successfully processed {len(self.paper_texts)} papers")

    def _build_index(self):
        """Create ChromaDB index from paper embeddings"""
        self.chunks = []
        self.chunk_metadata = []

        # Process each paper into chunks
        for text, metadata in zip(self.paper_texts, self.paper_metadata):
            # Split text into paragraphs
            paragraphs = text.split('\n\n')
            # Create overlapping chunks of 3 paragraphs
            for i in range(0, len(paragraphs), 2):
                chunk = ' '.join(paragraphs[i:i+2])
                if len(chunk.split()) > 30:  # Only keep chunks with sufficient content
                    self.chunks.append(chunk)
                    self.chunk_metadata.append(metadata)

        # Generate embeddings for all chunks
        embeddings = self.embedding_model.encode(
            self.chunks,
            batch_size=4,  # Process 4 chunks at a time
            show_progress_bar=True,
            convert_to_numpy=True  # Convert to numpy for ChromaDB compatibility
        )

        # Store embeddings in ChromaDB
        for i, (chunk, metadata, embedding) in enumerate(zip(self.chunks, self.chunk_metadata, embeddings)):
            self.collection.add(
                ids=[str(i)], 
                embeddings=[embedding.tolist()], 
                metadatas=[{"title": metadata["title"], "pubmed_id": metadata["pubmed_id"], "chunk": chunk}]
            )

    def answer_question(self, question: str, k: int = 5):
        """Answer a question using RAG"""
        # Convert question to embedding vector
        q_embedding = self.embedding_model.encode([question])[0]

        # Find k most similar chunks using ChromaDB
        results = self.collection.query(
            query_embeddings=[q_embedding.tolist()],
            n_results=k
        )

        # Build context from relevant chunks - LIMIT TOTAL LENGTH
        context = ""
        used_papers = set()
        total_tokens = 0
        max_tokens = 2048  # Set a reasonable limit for context length
        
        for result in results["metadatas"][0]:
            chunk = result["chunk"]
            paper_id = result["pubmed_id"]

            # Only include first chunk from each paper and check token length
            if paper_id not in used_papers:
                chunk_tokens = len(self.tokenizer.encode(chunk))
                if total_tokens + chunk_tokens > max_tokens:
                    break

                used_papers.add(paper_id)
                context += f"\nFrom paper '{result['title']}':\n{chunk}\n"
                total_tokens += chunk_tokens

        # Construct shorter prompt
        prompt = f"""Answer based on these excerpts. Include citations.

        Excerpts: {context}

        Question: {question}

        Answer: """

        # Generate answer using LLM with controlled length
        inputs = self.tokenizer(
            prompt, 
            return_tensors="pt",
            truncation=True,
            max_length=2048  # Hard limit on input length
        ).to("cuda")
        
        outputs = self.model.generate(
            **inputs,
            max_new_tokens=256,  # Limit response length
            temperature=0.7,  # Add some randomness to generation
            num_return_sequences=1,  # Generate one response
            do_sample=True,  # Use sampling instead of greedy decoding
        )
        input_length = inputs["input_ids"].shape[1]  # Number of tokens in input prompt
        output_token_length = outputs.shape[1] - input_length  # Total generated tokens excluding input

        # Extract and clean up the generated answer
        answer = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        return answer.split("Answer: ")[-1].strip(), output_token_length


In [12]:
# Initialize the assistant
assistant = ResearchAssistant_sentrans()

INFO:chromadb.telemetry.product.posthog:Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


In [13]:


# Search for papers on a topic
assistant.search_papers(
    query="latest developments in CRISPR gene editing cancer therapy",
    max_results=20
)

# Ask questions
questions = [
    "What are the main challenges in using CRISPR for cancer therapy?",
]

for question in questions:
    print(f"\nQ: {question}")
    ans =assistant.answer_question(question)
    print(f"\nA: {ans[0]}")
    print(f"\nOutput token length: {ans[1]}")

INFO:scape:Searching PubMed for : latest developments in CRISPR gene editing cancer therapy


Searching for papers about: latest developments in CRISPR gene editing cancer therapy


INFO:scape:Found 28 results
INFO:scape:Fetching details for 28 papers...
INFO:scape:Successfully processed PMID 37356052
INFO:scape:Successfully processed PMID 36610813
INFO:scape:Successfully processed PMID 36272261
INFO:scape:Successfully processed PMID 35337340
INFO:scape:Successfully processed PMID 39708520
INFO:scape:Successfully processed PMID 38050977
INFO:scape:Successfully processed PMID 34411650
INFO:scape:Successfully processed PMID 31739699
INFO:scape:Successfully processed PMID 36560658
INFO:scape:Successfully processed PMID 33003295
INFO:scape:Successfully processed PMID 39317648
INFO:scape:Successfully processed PMID 35547744
INFO:scape:Successfully processed PMID 39292321
INFO:scape:Successfully processed PMID 35999480
INFO:scape:Successfully processed PMID 32264803
INFO:scape:Successfully processed PMID 38041049
INFO:scape:Successfully processed PMID 36139078
INFO:scape:Successfully processed PMID 29691470
INFO:scape:Successfully processed PMID 35358798
INFO:scape:Succ

Successfully processed 12 papers

Q: What are the main challenges in using CRISPR for cancer therapy?


Batches: 100%|██████████| 1/1 [00:00<00:00, 64.00it/s]
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



A: CRISPR is a revolutionary gene editing technology that has shown promise in cancer therapy. However, there are still several challenges that need to be overcome before it can be widely used in clinical settings. 

        One of the main challenges is the safety of using CRISPR in humans. While CRISPR has shown to be effective in animal models, there is still a risk of unintended consequences when used in humans. For example, there is a possibility of off-target effects, where CRISPR edits genes other than the intended target, which could lead to unintended side effects. 

        Another challenge is the delivery of CRISPR to cancer cells. CRISPR requires specific conditions to be effective, and it may be difficult to deliver it to cancer cells in the body. Additionally, the body's immune system may recognize CRISPR as foreign and attack it, further complicating the delivery process. 

        Finally, there is a lack of understanding of how CRISPR works in cancer cells, which mak

In [14]:


# Search for papers on a topic
assistant.search_papers(
    query="latest developments in CRISPR gene editing cancer therapy",
    max_results=20
)

# Ask questions
questions = [
    "What are the main challenges in using CRISPR for cancer therapy?",
]

for question in questions:
    print(f"\nQ: {question}")
    ans =assistant.answer_question(question)
    print(f"\nA: {ans[0]}")
    print(f"\nOutput token length: {ans[1]}")

INFO:scape:Searching PubMed for : latest developments in CRISPR gene editing cancer therapy


Searching for papers about: latest developments in CRISPR gene editing cancer therapy


INFO:scape:Found 28 results
INFO:scape:Fetching details for 28 papers...
INFO:scape:Successfully processed PMID 37356052
INFO:scape:Successfully processed PMID 36610813
INFO:scape:Successfully processed PMID 36272261
INFO:scape:Successfully processed PMID 35337340
INFO:scape:Successfully processed PMID 39708520
INFO:scape:Successfully processed PMID 38050977
INFO:scape:Successfully processed PMID 34411650
INFO:scape:Successfully processed PMID 31739699
INFO:scape:Successfully processed PMID 36560658
INFO:scape:Successfully processed PMID 33003295
INFO:scape:Successfully processed PMID 39317648
INFO:scape:Successfully processed PMID 35547744
INFO:scape:Successfully processed PMID 39292321
INFO:scape:Successfully processed PMID 35999480
INFO:scape:Successfully processed PMID 32264803
INFO:scape:Successfully processed PMID 38041049
INFO:scape:Successfully processed PMID 36139078
INFO:scape:Successfully processed PMID 29691470
INFO:scape:Successfully processed PMID 35358798
INFO:scape:Succ

Successfully processed 24 papers

Q: What are the main challenges in using CRISPR for cancer therapy?


Batches: 100%|██████████| 1/1 [00:00<00:00, 26.86it/s]
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



A: One of the main challenges in using CRISPR for cancer therapy is off-target effects. CRISPR-Cas9 relies on the specificity of the guide RNA to bind to a particular target sequence in the genome. However, if the guide RNA is not designed perfectly, it can bind to unintended targets, causing unintended effects. This can lead to the activation of healthy cells or the suppression of cancer cells that are essential for the body's immune system to function properly (Fried et al., 2013).

        Another challenge is the risk of generating mutations. CRISPR-Cas9 can introduce double-strand breaks in the genome, which can lead to the formation of mutations. While this can be beneficial in some cases, such as in the treatment of genetic diseases, it can also be harmful if it occurs in cancer cells (Rodriguez et al., 2015).

        Additionally, the complexity of the human genome and the variability in the DNA sequences of cancer cells make it difficult to design guide RNAs that will be eff