## RAG using a pdf book
* see: https://python.langchain.com/docs/use_cases/question_answering/
* using Cohere embeddings
* using a custom prompt

In [1]:
# for pdf post processing
import re

# modified to load from Pdf
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import PromptTemplate

# two possible vector store
from langchain.vectorstores import Chroma
from langchain.vectorstores import FAISS

# removed OpenAI, using Cohere embeddings
from langchain.embeddings import CohereEmbeddings

from langchain import hub

# removed OpenAI, using OCI GenAI
from oci.config import from_file

# oci_llm is in a local file
from oci_llm import OCIGenAILLM

from langchain.schema.runnable import RunnablePassthrough

# private configs
from config_private import COMPARTMENT_OCID, COHERE_API_KEY

In [2]:
# to enable some debugging
DEBUG = False

#### Template for custom prompt

In [3]:
# this is the template for the prompt
template = """Use the following pieces of context to answer the question at the end. 
If you don't know the answer don't try to make up an answer. 
Use five sentences maximum. 
Always say "Thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""

In [4]:
# functions
def get_answer(rag_chain, question):
    response = rag_chain.invoke(question)

    print(f"Question: {question}")
    print()
    print("The response:")
    print(response)
    print()

In [5]:
# read OCI config to connect to OCI with API key
CONFIG_PROFILE = "DEFAULT"
config = from_file("~/.oci/config", CONFIG_PROFILE)

# OCI GenAI endpoint (for now Chicago)
ENDPOINT = "https://generativeai.aiservice.us-chicago-1.oci.oraclecloud.com"

# check the config to access to api keys
if DEBUG:
    print(config)

#### Loading the document

In [6]:
# BLOG_POST = "https://python.langchain.com/docs/get_started/introduction"
BOOK = "./CurrentEssentialsOfMedicine.pdf"

loader = PyPDFLoader(BOOK)

data = loader.load()

#### Splitting the document in chunks

In [7]:
CHUNK_SIZE = 2000
CHUNK_OVERLAP = 100

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP
)

splits = text_splitter.split_documents(data)

In [8]:
print(f"We have splitted the pdf in {len(splits)} splits...")

We have splitted the pdf in 686 splits...


In [9]:
# some post processing

# replace \n with blank
for split in splits:
    split.page_content = split.page_content.replace("\n", " ")
    split.page_content = re.sub("[^a-zA-Z0-9 \n\.]", " ", split.page_content)

In [10]:
# have a look at a single split
splits[20].page_content

'ReferencePoole Wilson PA  V ok  Z  Kirwan BA  de Brouwer S  Dunselman PH  Lubsen J  ACTION investigators. Clinical course of isolated stable angina due to coronaryheart disease. Eur Heart J 2007 28 1928.  PMID  17562665 '

#### Embeddings and Vectore Store

In [11]:
%%time

cohere = CohereEmbeddings(cohere_api_key=COHERE_API_KEY)

# using Chroma or FAISS as Vector store
vectorstore = Chroma.from_documents(documents=splits, embedding=cohere)
# vectorstore = FAISS.from_documents(documents=splits, embedding=hf)

retriever = vectorstore.as_retriever()

CPU times: user 2.93 s, sys: 254 ms, total: 3.19 s
Wall time: 6.38 s


#### Define the prompt structure

In [12]:
rag_prompt_custom = PromptTemplate.from_template(template)

#### Define the LLM: OCI GenAI

In [13]:
# compartment OCID from config_private.py

# using mostly defaults
llm = OCIGenAILLM(
    temperature=1.0,
    max_tokens=1500,
    config=config,
    compartment_id=COMPARTMENT_OCID,
    endpoint=ENDPOINT,
    debug=DEBUG,
)

#### Define the (Lang)Chain

In [14]:
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()} | rag_prompt_custom | llm
)

#### Process the question

In [15]:
# a list of possible questions
QUESTION1 = "What is the suggested treatment for Botulism?"
QUESTION2 = "List diagnosis for Botulims?"
QUESTION3 = "List the antibiotics commonly used for tubercolosis"

In [16]:
%%time

# the question
get_answer(rag_chain, question=QUESTION1)

Question: What is the suggested treatment for Botulism?

The response:
 The suggested treatment for botulism is removal of unabsorbed toxin from the gut and specific antitoxin. 
Passive immunization with tetanus immune globulin and concurrent active immunization may also be required. 
Intravenous immunoglobulin therapy may provide a short-term benefit in some cases, especially in myasthenic crisis.

CPU times: user 77 ms, sys: 7.54 ms, total: 84.6 ms
Wall time: 3.54 s


In [17]:
%%time

# the question
get_answer(rag_chain, question=QUESTION2)

Question: List diagnosis for Botulims?

The response:
 Here are some possible diagnoses for botulism:

- Clostridium botulinum: this is the most common cause of botulism, and it is usually associated with the consumption of contaminated food. 
- Bulbar poliomyelitis: this is a paralytic disease caused by a virus that affects the nervous system. 
- Myasthenia gravis: this is a disease that causes muscle weakness and fatigue. 
- Posterior cerebral circulation ischemia: this is a stroke caused by a blockage in the blood flow to the back of the brain. 
- Tick paralysis: this is a disease caused by a tick bite that can cause paralysis in the affected limb. 
- Guillain Barr syndrome or variant: this is a rare disorder of the immune system that can cause muscle weakness and paralysis. 
- Inorganic phosphorus poisoning: this is a toxic metal poisoning that can cause muscle weakness and neurological damage.

It is important to note that the diagnosis of botulism can be difficult, as the symptom

In [18]:
%%time

# the question
get_answer(rag_chain, question=QUESTION3)

Question: List the antibiotics commonly used for tubercolosis

The response:
 The most common antibiotics used for tuberculosis are isoniazid and rifampin. 
Other antibiotics that may be used include ethambutol, pyrazinamide and streptomycin. 
Isoniazid and rifampin are both considered first-line treatments for tuberculosis, as they have been found to be effective in treating the disease. 

Tuberculosis is caused by the bacteria Mycobacterium tuberculosis. It is a highly contagious disease that can be spread through the air by coughing or sneezing. It is most common in developing countries, but it can also be found in developed countries.

Tuberculosis can be treated with a combination of antibiotics and other medications. The most common treatment is a four-drug regimen that includes isoniazid and rifampin. Other drugs that may be used include ethambutol, pyrazinamide, and streptomycin.

It is important to note that tuberculosis is a serious disease that can be life-threatening if not

#### Explore the vectore store

In [19]:
# Retrieve relevant splits for any question using similarity search.

# This is simply "top K" retrieval where we select documents based on embedding similarity to the query.

TOP_K = 5

docs = vectorstore.similarity_search(QUESTION3, k=TOP_K)

len(docs)

5

In [20]:
for i, doc in enumerate(docs):
    print(f"chunk n. {i+1}")
    print(doc.page_content)
    print()

chunk n. 1
238 Current Essentials of Medicine 8Tuberculosis   Mycobacterium tuberculosis    Essentials of Diagnosis  Most infections subclinical  with positive skin test only  Symptoms progressive and include cough  dyspnea  fever  nightsweats  weight loss  and hemoptysis  In primary infection  mid lung  eld in ltrates with regional lym phadenopathy  pleural effusion common  Apical  bronodular pulmonary in ltrate on chest  lm  with orwithout cavitation  is most typical in reactivated disease  Posttussive rales noted on auscultation  Most common extrapulmonary manifestations include meningi tis  genitourinary infection  miliary disease  arthritis  with local ized symptoms and signs  Differential Diagnosis  Pneumonia of other cause  bacterial and fungal  histoplasmosis coccidioidomycosis  most similar  Other mycobacterial infection  HIV infection  may be associated   Prolonged fever of other cause  Urinary tract infection  oligoarticular arthritis of other cause  Carcinoma of the lung  L