## Retreival Augmented Generation RAG

We will build a Retreival Augmented Generation Application using the following steps that we have discussed in the lesson.
* Step 1: Document Loading
* Step 2: Splitting Text into Chunks
* Step 3: Storage Text as Vectorstore
* Step 4: Query and Retreival text
* Step 5: Output answer with retreival text and LLM Augmented Generation

##Load Libraries

In [None]:
!pip install langchain openai tiktoken chromadb python-dotenv langchain_community
!pip install U langchain_openai
!pip install docarray
!pip install python-dotenv
!pip install gdown
!pip install pypdf

Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting chromadb
  Downloading chromadb-0.6.2-py3-none-any.whl.metadata (6.8 kB)
Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting langchain_community
  Downloading langchain_community-0.3.14-py3-none-any.whl.metadata (2.9 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.7.5-py2.py3-

In [None]:

from langchain.chat_models import AzureChatOpenAI
from langchain.schema import HumanMessage
import os
from dotenv import load_dotenv

# load .env file to environment
load_dotenv()

AZURE_ENDPOINT = os.getenv('AZURE_ENDPOINT')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
DEPLOYMENT_NAME = os.getenv('DEPLOYMENT_NAME')
OPENAI_API_VERSION = os.getenv('OPENAI_API_VERSION')


llm = AzureChatOpenAI(deployment_name=DEPLOYMENT_NAME, openai_api_version=OPENAI_API_VERSION, openai_api_key=OPENAI_API_KEY, openai_api_base=AZURE_ENDPOINT, temperature=0.9)
#Test the LLM
print(llm.invoke([{'role':'user', 'content':'Which is the largest country by area in the world?'}]).content)

  llm = AzureChatOpenAI(deployment_name=DEPLOYMENT_NAME, openai_api_version=OPENAI_API_VERSION, openai_api_key=OPENAI_API_KEY, openai_api_base=AZURE_ENDPOINT, temperature=0.9)


The largest country by area in the world is Russia. It spans over 17 million square kilometers (approximately 6.6 million square miles) and covers a significant portion of Eastern Europe and northern Asia.


##Step 1: Document Loading

Create a directory name data. Copy the pdf document into the directory created.

In [None]:
!mkdir data

Read the pdf doucment into multiple pages of text.

In [None]:
from langchain.document_loaders import PyPDFLoader, DirectoryLoader
#***Extract Data From the PDF File***
def load_pdf_file(data):
    loader= DirectoryLoader(data,
                            glob="*.pdf",
                            loader_cls=PyPDFLoader)

    documents=loader.load()

    return documents

extracted_data=load_pdf_file(data='./data/')

In [None]:
len(extracted_data)

68

##Step 2: Splitting Text into Chunks

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

#***Split the Data into Text Chunks****
def text_split(extracted_data):
    text_splitter=RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
    text_chunks=text_splitter.split_documents(extracted_data)
    return text_chunks

text_chunks=text_split(extracted_data)
print("Length of Text Chunks", len(text_chunks))

Length of Text Chunks 561


In [None]:
text_chunks[0]

Document(metadata={'source': 'data/encyclopedia-of-medicine-shortv1.pdf', 'page': 1}, page_content='The GALE\nENCYCLOPEDIA\nof MEDICINE\nSECOND EDITION')

##Step 3: Storage Text as Vectorstore

In [None]:
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_openai import AzureOpenAIEmbeddings
import os
from dotenv import load_dotenv

embeddings = AzureOpenAIEmbeddings(model="text-embedding-3-large",api_key=OPENAI_API_KEY, openai_api_version=OPENAI_API_VERSION ,azure_endpoint=AZURE_ENDPOINT)

vectordb = Chroma.from_texts([t.page_content for t in text_chunks], embeddings, collection_name="meddoc", persist_directory="./meddoc_db")


##Step 4: Query and Retreival text

In [None]:
question = "What are Acupressure?"
results = vectordb.similarity_search(question, k=3)
print(results[0])


page_content='Definition
Acupressure is a form of touch therapy that utilizes
the principles of acupuncture and Chinese medicine. In
acupressure, the same points on the body are used as in
acupuncture, but are stimulated with finger pressure
GALE ENCYCLOPEDIA OF MEDICINE 2 35
Acupressure
GEM - 0001 to 0432 - A  10/22/03 1:41 PM  Page 35'


In [None]:
# Specifying top k
retriever = vectordb.as_retriever(search_kwargs={ "k" : 10})
print(retriever.invoke("What are Acupressure?")[0])


# Similarity score threshold retrieval
# naive_retriever = db.as_retriever(search_kwargs={"score_threshold": 0.8}, search_type="similarity_score_threshold")

# Maximum marginal relevance retrieval
# naive_retriever = db.as_retriever(search_type="mmr")

page_content='Definition
Acupressure is a form of touch therapy that utilizes
the principles of acupuncture and Chinese medicine. In
acupressure, the same points on the body are used as in
acupuncture, but are stimulated with finger pressure
GALE ENCYCLOPEDIA OF MEDICINE 2 35
Acupressure
GEM - 0001 to 0432 - A  10/22/03 1:41 PM  Page 35'


##Step 5: Output answer with retreival text and LLM Augmented Generation

Augmentation

In [None]:
from langchain_core.prompts import ChatPromptTemplate

TEMPLATE = """\
You are medical assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(TEMPLATE)

Generation

Finally, we are going to create a RAG Chain. For that, we are going to use LCEL (LangChain Expression Language) Runnable function.



In [None]:
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

setup_and_retrieval = RunnableParallel({"question": RunnablePassthrough(), "context": retriever })
output_parser = StrOutputParser()


retrieval_chain = setup_and_retrieval | rag_prompt | llm | output_parser


retrieval_chain.invoke( "What are Acupressure?")


"Acupressure is a form of touch therapy that utilizes the principles of acupuncture and Chinese medicine. Instead of using needles as in acupuncture, acupressure involves stimulating the same points on the body with finger pressure. This technique aims to activate specific pressure points or acupoints on the body's chi meridians to relieve symptoms, increase energy, reduce stress, and treat various health conditions. Acupressure can be performed by professionals or learned for self-treatment and is considered effective for a range of issues, including headaches, general aches and pains, colds, and more."

In [None]:
retrieval_chain.invoke( "What are the COVID?")


"I don't know. The provided context does not contain any information about COVID."

In [None]:
print(retrieval_chain.invoke( "What are the different between Acupressure and Acupuncture?"))


The main differences between acupressure and acupuncture are:

1. **Method of Stimulation**:
   - **Acupressure**: Involves stimulating specific points on the body using finger pressure.
   - **Acupuncture**: Involves inserting fine needles into specific points on the body.

2. **Invasiveness**:
   - **Acupressure**: Non-invasive as it does not involve breaking the skin.
   - **Acupuncture**: Invasive as it involves puncturing the skin with needles.

Both practices are based on similar principles of Chinese medicine, targeting the same points on the body to relieve various symptoms and conditions by manipulating the flow of chi (energy). However, acupressure can be performed by a layperson and is relatively easier to learn, while acupuncture generally requires a trained professional.
