# Biomedical RAG Introduction

This notebook provides ccontext for RAG on varios types of documents that are relevant in biomedical research:
 
* URLs
* Academic papers (PDF)
* Clinical Trial Data (JSON)

## Dependencies 

### Packages

In [None]:
! pip install chromadb tiktoken pypdf langchainhub anthropic openai gpt4all pandas

### API keys

Required (to access OpenAI LLMs)
* `OPENAI_API_KEY`

Optional [Anthropic](https://docs.anthropic.com/claude/reference/getting-started-with-the-api) and [LangSmith](https://docs.smith.langchain.com/) for Anthropic LLMs and tracing:
* `ANTHROPIC_API_KEY` 


In [None]:
import os
# os.environ['OPENAI_API_KEY'] = 'xxx'

## Document Loading 

### PDFs

We can load academic papers ([e.g., Clinical trials of interest for 2023](https://www.nature.com/articles/s41591-022-02132-3)) via [PDF loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf).

In [25]:
from langchain.document_loaders import PyPDFLoader
# Define loader for PDF
path = "/Users/rlm/Desktop/GENE-workshop/s41591-022-02132-3.pdf"
loader = PyPDFLoader(path)
pdf_pages = loader.load()

Explore a broad set of loaders [here](https://python.langchain.com/docs/integrations/document_loaders) and [here](https://integrations.langchain.com/).

#### URLs

We can load from [urls](https://python.langchain.com/docs/integrations/document_loaders/url). 

In [3]:
from langchain.document_loaders import WebBaseLoader
# Define loader for URL
loader = WebBaseLoader("https://berthub.eu/articles/posts/amazing-dna/")
blog = loader.load()

#### JSON

Cliical trial data can be downloaded [as a database](https://classic.clinicaltrials.gov/api/gui/ref/download_all) of JSONs from clinicaltrials.gov.

The full zip file is `~2GB` and we can [load each JSON record](https://python.langchain.com/docs/modules/data_connection/document_loaders/json) with `JSONLoader`.

In [5]:
# Define the directory path
import os 
dir_path = "/Users/rlm/Desktop/Clinical-Trials/AllAPIJSON/NCT0000xxxx"

# List all files in the directory
all_files = os.listdir(dir_path)

# Filter only JSON files
json_files = [file for file in all_files if file.endswith('.json')]

# Sort and select the first 5 JSON files (you can customize the sorting if needed)
first_5_jsons = sorted(json_files)[:5]

We can look at the structure of each record.

In [3]:
import json
import pandas as pd

# Assuming first_5_jsons is a list of file paths.
with open(dir_path+'/'+first_5_jsons[0], 'r') as f:
    data = json.load(f)

# Normalize the JSON data to a DataFrame
df = pd.json_normalize(data)

# Print the columns to understand the structure
print(df.columns)

Index(['FullStudy.Rank',
       'FullStudy.Study.ProtocolSection.IdentificationModule.NCTId',
       'FullStudy.Study.ProtocolSection.IdentificationModule.OrgStudyIdInfo.OrgStudyId',
       'FullStudy.Study.ProtocolSection.IdentificationModule.SecondaryIdInfoList.SecondaryIdInfo',
       'FullStudy.Study.ProtocolSection.IdentificationModule.Organization.OrgFullName',
       'FullStudy.Study.ProtocolSection.IdentificationModule.Organization.OrgClass',
       'FullStudy.Study.ProtocolSection.IdentificationModule.BriefTitle',
       'FullStudy.Study.ProtocolSection.StatusModule.StatusVerifiedDate',
       'FullStudy.Study.ProtocolSection.StatusModule.OverallStatus',
       'FullStudy.Study.ProtocolSection.StatusModule.ExpandedAccessInfo.HasExpandedAccess',
       'FullStudy.Study.ProtocolSection.StatusModule.StudyFirstSubmitDate',
       'FullStudy.Study.ProtocolSection.StatusModule.StudyFirstSubmitQCDate',
       'FullStudy.Study.ProtocolSection.StatusModule.StudyFirstPostDateStruct.Stud

To build a vectostore, we should think about what we want as `metadata` and what we want to `embed` and semantically retrieve.

`JSONLoader` accepts a field that we can use to supply metadata.

The `jq_schema` is a dict dictionary that corresponds to the `ProtocolSection` of the study.
 
With `extract_metadata`, the function first attempts to access the `IdentificationModule` within the sample dictionary using the `get` method. 

If IdentificationModule is present, it'll return its value; otherwise, it'll return an empty dictionary ({}).

Next, the function attempts to access NCTId from the previously fetched value. 

If NCTId is present, its value is returned; otherwise, None is returned.

We can perform this for each desired metadata field.

In [1]:
from typing import Any, Dict
from langchain.docstore.document import Document
from langchain.document_loaders import JSONLoader

def extract_metadata(sample: Dict[str, Any], default_metadata: Dict[str, Any]) -> Dict[str, Any]:
    nctid = sample.get('IdentificationModule', {}).get('NCTId', None)
    study_type = sample.get('DesignModule', {}).get('StudyType', None)
    phase_list = sample.get('DesignModule', {}).get('PhaseList', {}).get('Phase', None)
    metadata = {
        **default_metadata,
        'NCTId': nctid,
        'StudyType': study_type,
        'PhaseList': str(phase_list) # Metadata needs to be str
    }
    return metadata

def load_json(dir: str) -> Document:
    
    loader = JSONLoader(
        file_path=dir,
        jq_schema='.FullStudy.Study.ProtocolSection',
        metadata_func=extract_metadata,
        text_content=False
    )
    
    data = loader.load()
    return data[0]

# Process each of the selected JSON files
clinical_trial_data_sample = []
for json_file in first_5_jsons:
    file_path = os.path.join(dir_path, json_file)
    data = load_json(file_path)
    clinical_trial_data_sample.append(data)

## Summarization

### Prompt Definition 

* See list of OpenAI models [here](https://platform.openai.com/docs/models/gpt-3-5).

In [26]:
# Prompt 
from langchain.prompts import ChatPromptTemplate
template = """Summarize the following context:
{context}
"""
prompt = ChatPromptTemplate.from_template(template)

# LLM
from langchain.chat_models import ChatOpenAI
llm_openai = ChatOpenAI(model="gpt-3.5-turbo-16k",
                        temperature=0)

# Chain
# The StrOutputParser is a built-in output parser that converts the output of a language model into a string.
from langchain.schema.output_parser import StrOutputParser
chain = prompt | llm_openai | StrOutputParser()

# Docs
all_pages = ' '.join([p.page_content for p in pdf_pages])

# Invoke
chain.invoke({"context" : all_pages})

"The article discusses 11 clinical trials that are expected to shape medicine in 2023. These trials cover a range of medical conditions, including Parkinson's disease, Alzheimer's disease, ovarian cancer, muscular dystrophy, cervical cancer, weight loss, sleeping sickness, metastatic breast cancer, and COVID-19 vaccination in individuals with HIV. Leading researchers provide insights into the significance and potential outcomes of these trials. The article highlights the challenges faced by the biopharmaceutical industry in 2022, including clinical trial failures and disruptions caused by COVID-19. Despite these challenges, the article emphasizes the potential for new advancements and breakthroughs in medicine in the coming year."

### LangSmith  

* Use LangSmith to view the trace [here](https://smith.langchain.com/public/8fa348b6-3aa5-494f-bb3e-4e61bdc97744/r).
* Reference on LangSmith: [here](https://docs.smith.langchain.com/)

### Vary the LLM and Prompt

Claude2 has a [100k token context window](https://www.anthropic.com/index/claude-2).

We can also use LangChain hub to explore different summarization prompts.
  
* [LangChain Docs](https://python.langchain.com/docs/integrations/chat/anthropic)
* [LangChain Hub](https://blog.langchain.dev/langchain-prompt-hub/)
* [Review of interesting prompts](https://blog.langchain.dev/the-prompt-landscape/)
* [Example summarization prompt](https://smith.langchain.com/hub/hwchase17/anthropic-paper-qa?ref=blog.langchain.dev)

In [6]:
from langchain import hub
from langchain.chat_models import ChatAnthropic

The summarization prompt is a fun, silly example to highlight the flexibility of Claude2.

In [None]:
# Prompt from the Hub
prompt = hub.pull("hwchase17/anthropic-paper-qa")

In [22]:
# LLM
llm_anthropic = ChatAnthropic(temperature=0, model='claude-2', max_tokens=10000)

# Chain
chain = prompt | llm_anthropic | StrOutputParser()

# Invoke the chain
chain.invoke({"text" : all_pages})

' <kindergarten_abstract>\nThe doctors did a bunch of studies to see what medicines and tests work best. They looked at medicines for Parkinson\'s, Alzheimer\'s, and other diseases. They also looked at tests for cancer and COVID vaccines. The results will help doctors treat patients better next year.\n</kindergarten_abstract>\n\n<moosewood_methods>\nIngredients:\n- 11 leading medical experts\n- A dash of speculation\n- A pinch of prognostication\n\nInstructions:\n1. Gather the experts in a room with coffee and pastries. Make sure they are well-caffeinated. \n2. Ask each expert to name one clinical trial they are most excited about in 2023. Let them ramble on for a bit about the details. Take notes.\n3. Stir the trials together and look for common themes. Do the trials focus on new drug treatments, improved screening methods, or preventative measures? Categorize accordingly.  \n4. Sprinkle in some educated guesses about when results will be announced and potential impacts on medical pra

## RAG 

We may want to perform question-answering based on document context. 

### PDF

We can apply this to the PDF.

In [31]:
# Split documents
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(pdf_pages)

# Embed and add to vectorDB
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_documents(
    documents=all_splits,
    collection_name="rag",
    embedding=OpenAIEmbeddings(),
)
retriever = vectorstore.as_retriever()

# Prompt
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
rag_prompt = ChatPromptTemplate.from_template(template)

# RAG chain
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
chain = (
    RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
    | rag_prompt
    | llm_openai
    | StrOutputParser()
)

In [30]:
chain.invoke("What are some exmaple clincal trials that are focused on cancer?")

'Based on the given context, one example of a clinical trial focused on cancer is the trial for mirvetuximab soravtan-sine from ImmunoGen for platinum-resistant ovarian cancer.'

Look at LangSmith trace [here](https://smith.langchain.com/public/c87b797b-78ef-42de-a5f3-3986af379943/r).

## Private RAG 

We may want to perform question-answering based on document context without passing anything to external APIs.

We can use [Ollama.ai](https://ollama.ai/).

Download the app, and then pull your LLM of choice:

e.g., `ollama pull zephyr` for [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha), a fine-tuned LLM on Mistral.

Also, we will use local embeddings from GPT4All (CPU-optimized BERT embeddings).

In [33]:
from langchain.chat_models import ChatOllama
from langchain.embeddings import GPT4AllEmbeddings

In [35]:
# Add to vectorDB
vectorstore_private = Chroma.from_documents(
    documents=all_splits,
    collection_name="rag-private",
    embedding=GPT4AllEmbeddings(),
)
retriever_private = vectorstore_private.as_retriever()

# LLM
ollama_llm = "zephyr"
llm_private = ChatOllama(model=ollama_llm)

# RAG chain
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
chain_private = (
    RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
    | rag_prompt
    | llm_private
    | StrOutputParser()
)

Found model file at  /Users/rlm/.cache/gpt4all/ggml-all-MiniLM-L6-v2-f16.bin


In [36]:
chain_private.invoke("What are some exmaple clincal trials that are focused on cancer?")

'Two examples of clinical trials that are focused on cancer mentioned in the given context are:\n1. Mirvetuximab soravtan-sin, a antibody-drug conjugate (ADC) for ovarian cancer. This trial resulted in accelerated approval by the US Food and Drug Administration based on results from a single-arm study enrolling 106 patients with platinum-resistant ovarian cancer whose tumors had high expression of a protein called folate receptor alpha (FRA). The author, Robert L. Coleman, expects this to be the most imminent and important upcoming trial result in his field in 2023.\n2. ADCs for previously treated cervical cancer, which are currently in development. Successful approval of these trials will provide a solid framework for clinical trials evaluating novel combinations in several disease settings.\nNote: The first example provided is specifically for ovarian cancer, while the second example is more general and encompasses other types of cancer as well.'

### Clinical Trial Data

We can apply this to the extracted clinical trial data.

In [2]:
# Split documents
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=0)
all_splits = text_splitter.split_documents(clinical_trial_data_sample)

We preserve the metadata with each split.

In [3]:
all_splits[0].metadata

{'source': '/Users/rlm/Desktop/GENE-workshop/AllAPIJSON/NCT0000xxxx/NCT00000102.json',
 'seq_num': 1,
 'NCTId': 'NCT00000102',
 'StudyType': 'Interventional',
 'PhaseList': "['Phase 1', 'Phase 2']"}

In [5]:
# Embed and add to vectorDB
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
vectorstore = Chroma.from_documents(
    documents=all_splits,
    collection_name="rag-biomedical",
    persist_directory='./vectorstore',
    embedding=OpenAIEmbeddings(),
)

In [6]:
# Test
retriever = vectorstore.as_retriever()
docs = retriever.get_relevant_documents("What was the focus of trial NCT00000102?")
[doc.metadata["NCTId"] for doc in docs]

['NCT00000104', 'NCT00000104', 'NCT00000105', 'NCT00000105']

We get a mix of results.

[Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma) allows for metadata filtering. 

* [LangChain guide](https://python.langchain.com/docs/integrations/vectorstores/chroma)
* [Chroma guide](https://docs.trychroma.com/usage-guide#filtering-by-metadata)

In [49]:
docs = vectorstore.get(where={"NCTId": "NCT00000102"})
len(docs)

4

We can build a retriever that reasons about metadata filter(s) from the user question.

In [45]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

# Provide context about the metadata
metadata_field_info = [
    
    AttributeInfo(
        name="NCTId",
        description="The unique identifier assigned to each clinical trial when registered on ClinicalTrials.gov. ",
        type="string",
    ),
    AttributeInfo(
        name="StudyType",
        description="The nature of the study, indicating whether participants receive specific interventions or are merely observed for specific outcomes.",
        type="string",
    ),
    AttributeInfo(
        name="PhaseList",
        description="This pertains to the phase of the study in drug trials.",
        type="string",
    )
]

# Overall context for the data
document_content_description = "Information about clinical trial on ClinicalTrials.gov"

# LLM
llm = OpenAI(temperature=0,)

# Retriever
retriever_self_query = SelfQueryRetriever.from_llm(
    llm, 
    vectorstore, 
    document_content_description, 
    metadata_field_info, 
    verbose=True
)

In [46]:
docs = retriever_self_query.get_relevant_documents("What was the focus of trial NCT00000102?")
[doc.metadata["NCTId"] for doc in docs]

['NCT00000102', 'NCT00000102', 'NCT00000102', 'NCT00000102']

## Other Resources

* [Semi-Structured RAG](https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_Structured_RAG.ipynb)
* [Multi-Modal RAG](https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_structured_and_multi_modal_RAG.ipynb)
* [Plate Reader Template](https://github.com/langchain-ai/langchain/tree/master/templates/plate-chain)