## Dependencies 

In [None]:
! pip install langchain unstructured[all-docs] pypdf langchainhub anthropic openai chromadb gpt4all

## Summarization

### Document Loading 

* File: [Clinical trials of interest for 2023](https://www.nature.com/articles/s41591-022-02132-3)
* Reference on loaders: [LangChain Docs](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf)

In [25]:
from langchain.document_loaders import PyPDFLoader
path = "/Users/rlm/Desktop/GENE-workshop/s41591-022-02132-3.pdf"
loader = PyPDFLoader(path)
pdf_pages = loader.load()

You can explore a broad set of loaders [here](https://python.langchain.com/docs/integrations/document_loaders) and [here](https://integrations.langchain.com/).

For example, we can load from urls or [CSVs](https://python.langchain.com/docs/integrations/document_loaders/csv) (e.g., from [Clinical Trials](https://clinicaltrials.gov/study/NCT04296890)). 

In [24]:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
blog = loader.load()

### Prompt Definition 

* See list of OpenAI models [here](https://platform.openai.com/docs/models/gpt-3-5).

In [None]:
import os
# os.environ['OPENAI_API_KEY'] = 'xxx'

In [26]:
# Prompt 
from langchain.prompts import ChatPromptTemplate
template = """Summarize the following context:
{context}
"""
prompt = ChatPromptTemplate.from_template(template)

# LLM
from langchain.chat_models import ChatOpenAI
llm_openai = ChatOpenAI(model="gpt-3.5-turbo-16k",temperature=0)

# Chain
from langchain.schema.output_parser import StrOutputParser
chain = prompt | llm_openai | StrOutputParser()

# Docs
all_pages = ' '.join([p.page_content for p in pdf_pages])

# Invoke
chain.invoke({"context" : all_pages})

"The article discusses 11 clinical trials that are expected to shape medicine in 2023. These trials cover a range of medical conditions, including Parkinson's disease, Alzheimer's disease, ovarian cancer, muscular dystrophy, cervical cancer, weight loss, sleeping sickness, metastatic breast cancer, and COVID-19 vaccination in individuals with HIV. Leading researchers provide insights into the significance and potential outcomes of these trials. The article highlights the challenges faced by the biopharmaceutical industry in 2022, including clinical trial failures and disruptions caused by COVID-19. Despite these challenges, the article emphasizes the potential for new advancements and breakthroughs in medicine in the coming year."

### LangSmith  

* Use LangSmith to view the trace [here](https://smith.langchain.com/public/8fa348b6-3aa5-494f-bb3e-4e61bdc97744/r).
* Reference on LangSmith: [here](https://docs.smith.langchain.com/)

### Vary the LLM and Prompt

Claude2 has a [100k token context window](https://www.anthropic.com/index/claude-2).

We can also use LangChain hub to explore different summarization prompts.
  
* [LangChain Docs](https://python.langchain.com/docs/integrations/chat/anthropic)
* [LangChain Hub](https://blog.langchain.dev/langchain-prompt-hub/)
* [Review of interesting prompts](https://blog.langchain.dev/the-prompt-landscape/)
* [Example summarization prompt](https://smith.langchain.com/hub/hwchase17/anthropic-paper-qa?ref=blog.langchain.dev)

In [5]:
from langchain import hub
from langchain.chat_models import ChatAnthropic

In [None]:
# os.environ['ANTHROPIC_API_KEY'] = 'xxx'

The summarization prompt is a fun, silly example to highlight the flexibility of Claude2.

In [22]:
# Prompt from the Hub
prompt = hub.pull("hwchase17/anthropic-paper-qa")

# LLM
llm_anthropic = ChatAnthropic(temperature=0, model='claude-2', max_tokens=10000)

# Chain
chain = prompt | llm_anthropic | StrOutputParser()

# Invoke the chain
chain.invoke({"text" : all_pages})

' <kindergarten_abstract>\nThe doctors did a bunch of studies to see what medicines and tests work best. They looked at medicines for Parkinson\'s, Alzheimer\'s, and other diseases. They also looked at tests for cancer and COVID vaccines. The results will help doctors treat patients better next year.\n</kindergarten_abstract>\n\n<moosewood_methods>\nIngredients:\n- 11 leading medical experts\n- A dash of speculation\n- A pinch of prognostication\n\nInstructions:\n1. Gather the experts in a room with coffee and pastries. Make sure they are well-caffeinated. \n2. Ask each expert to name one clinical trial they are most excited about in 2023. Let them ramble on for a bit about the details. Take notes.\n3. Stir the trials together and look for common themes. Do the trials focus on new drug treatments, improved screening methods, or preventative measures? Categorize accordingly.  \n4. Sprinkle in some educated guesses about when results will be announced and potential impacts on medical pra

## RAG 

We may want to perform question-answering based on document context. 

In [31]:
# Split documents
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(pdf_pages)

# Embed and add to vectorDB
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_documents(
    documents=all_splits,
    collection_name="rag",
    embedding=OpenAIEmbeddings(),
)
retriever = vectorstore.as_retriever()

# Prompt
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
rag_prompt = ChatPromptTemplate.from_template(template)

# RAG chain
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
chain = (
    RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
    | rag_prompt
    | llm_openai
    | StrOutputParser()
)

In [30]:
chain.invoke("What are some exmaple clincal trials that are focused on cancer?")

'Based on the given context, one example of a clinical trial focused on cancer is the trial for mirvetuximab soravtan-sine from ImmunoGen for platinum-resistant ovarian cancer.'

Look at LangSmith trace [here](https://smith.langchain.com/public/c87b797b-78ef-42de-a5f3-3986af379943/r).

## Private RAG 

We may want to perform question-answering based on document context without passing anything to external APIs.

We can use [Ollama.ai](https://ollama.ai/).

Download the app, and then pull your LLM of choice:

e.g., `ollama pull zephyr` for [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha), a fine-tuned LLM on Mistral.

Also, we will use local embeddings from GPT4All (CPU-optimized BERT embeddings).

In [33]:
from langchain.chat_models import ChatOllama
from langchain.embeddings import GPT4AllEmbeddings

In [35]:
# Add to vectorDB
vectorstore_private = Chroma.from_documents(
    documents=all_splits,
    collection_name="rag-private",
    embedding=GPT4AllEmbeddings(),
)
retriever_private = vectorstore_private.as_retriever()

# LLM
ollama_llm = "zephyr"
llm_private = ChatOllama(model=ollama_llm)

# RAG chain
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
chain_private = (
    RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
    | rag_prompt
    | llm_private
    | StrOutputParser()
)

Found model file at  /Users/rlm/.cache/gpt4all/ggml-all-MiniLM-L6-v2-f16.bin


In [36]:
chain_private.invoke("What are some exmaple clincal trials that are focused on cancer?")

'Two examples of clinical trials that are focused on cancer mentioned in the given context are:\n1. Mirvetuximab soravtan-sin, a antibody-drug conjugate (ADC) for ovarian cancer. This trial resulted in accelerated approval by the US Food and Drug Administration based on results from a single-arm study enrolling 106 patients with platinum-resistant ovarian cancer whose tumors had high expression of a protein called folate receptor alpha (FRA). The author, Robert L. Coleman, expects this to be the most imminent and important upcoming trial result in his field in 2023.\n2. ADCs for previously treated cervical cancer, which are currently in development. Successful approval of these trials will provide a solid framework for clinical trials evaluating novel combinations in several disease settings.\nNote: The first example provided is specifically for ovarian cancer, while the second example is more general and encompasses other types of cancer as well.'

## Semi-Structured RAG 

Cookbook here: 

https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_Structured_RAG.ipynb

## Multi-Modal RAG

Cookbook here:

https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_structured_and_multi_modal_RAG.ipynb

## Templates

Laboratory plates:

https://github.com/langchain-ai/langchain/tree/master/templates/plate-chain