# Building a RAG application using LangChain, Timescale, and Claude 3 Opus

## Data Preprocessing and Loading
In this section, I used the MedQuad dataset from [Kaggle](https://www.kaggle.com/datasets/jpmiller/layoutlm?select=medquad.csv) which has 16412 entries about different human diseases. I extracted the first 250 entries and saved it as the knowledge base. I omitted the question column to reduce the number of chunks generated and the repetitiveness among chunks. Here is a snapshot of the MedQuad dataset: 

In [33]:
import pandas as pd

medquad_dataset = pd.read_csv("medquad.csv")
medquad_dataset.head()

Unnamed: 0,question,answer,source,focus_area
0,What is (are) Glaucoma ?,Glaucoma is a group of diseases that can damag...,NIHSeniorHealth,Glaucoma
1,What causes Glaucoma ?,"Nearly 2.7 million people have glaucoma, a lea...",NIHSeniorHealth,Glaucoma
2,What are the symptoms of Glaucoma ?,Symptoms of Glaucoma Glaucoma can develop in ...,NIHSeniorHealth,Glaucoma
3,What are the treatments for Glaucoma ?,"Although open-angle glaucoma cannot be cured, ...",NIHSeniorHealth,Glaucoma
4,What is (are) Glaucoma ?,Glaucoma is a group of diseases that can damag...,NIHSeniorHealth,Glaucoma


In [34]:
medquad_subset = medquad_dataset['answer'].head(250)
medquad_subset.to_csv('medquad_250.csv', sep=',', index=False, encoding='utf-8')
medquad_subset.head()

0    Glaucoma is a group of diseases that can damag...
1    Nearly 2.7 million people have glaucoma, a lea...
2    Symptoms of Glaucoma  Glaucoma can develop in ...
3    Although open-angle glaucoma cannot be cured, ...
4    Glaucoma is a group of diseases that can damag...
Name: answer, dtype: object

## Document Chunking with LangChain

In this step, I used `RecursiveCharacterTextSplitter` to combine text into larger units to preserve context then split them into smaller segments for easy processing as we input it in the embedding model. 

In [None]:
# Install necessary packages
%pip install langchain openai tiktoken 

In [41]:
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = CSVLoader(file_path='medquad_250.csv',encoding='utf-8',csv_args={'delimiter': ','}) 
docs = loader.load() 

Splitter = RecursiveCharacterTextSplitter(chunk_size = 1500, chunk_overlap = 100) 
chunks = Splitter.create_documents([datum.page_content for datum in docs]) 
chunks[:5]

[Document(page_content="answer: Glaucoma is a group of diseases that can damage the eye's optic nerve and result in vision loss and blindness. While glaucoma can strike anyone, the risk is much greater for people over 60. How Glaucoma Develops  There are several different types of glaucoma. Most of these involve the drainage system within the eye. At the front of the eye there is a small space called the anterior chamber. A clear fluid flows through this chamber and bathes and nourishes the nearby tissues. (Watch the video to learn more about glaucoma. To enlarge the video, click the brackets in the lower right-hand corner. To reduce the video, press the Escape (Esc) button on your keyboard.) In glaucoma, for still unknown reasons, the fluid drains too slowly out of the eye. As the fluid builds up, the pressure inside the eye rises. Unless this pressure is controlled, it may cause damage to the optic nerve and other parts of the eye and result in loss of vision. Open-angle Glaucoma The

## Timescale Database Setup & Embedding Model 

In [None]:
%pip install psycopg2 pgvector timescale_vector

After creating an account on Timescale, I created a free service as I mentioned above it’s free for the first 90 days. The code chunk below is provided after your service is created and deployed. It makes it easy to use the database even if you don’t have any prior SQL experience.

### Timescale Database Connection 

In [36]:
import psycopg2

CONNECTION = "DATABASE_URL"
conn = psycopg2.connect(CONNECTION)
cursor = conn.cursor()
# use the cursor to interact with your database
cursor.execute("SELECT 'hello world'")
print(cursor.fetchone())

('hello world',)


### Embedding Model Setup

I defined my embedding model as `text-embedding-3-large` from OpenAI. As you can observe, I only needed the API key to access the OpenAI model since it is already integrated with LangChain. 

You can generate an OpenAI API key by following this [tutorial](https://platform.openai.com/docs/quickstart).

In [37]:
from langchain_openai import OpenAIEmbeddings
embeddings_model = OpenAIEmbeddings(
    model="text-embedding-3-large",
    api_key = "your-openai-api-key"
)

### Generating Embeddings

LangChain and Timescale’s compatibility shines in this part of the code. Using one function, [TimescaleVector.from_documents](https://python.langchain.com/v0.2/docs/integrations/vectorstores/timescalevector/), I was able to generate embeddings for all chunks of code and directly ingest them into my timescale database instance under the table name `human_disorder_embeddings.` This seamless integration is possible because of pgvector’s ability to store and query large numbers of vector embeddings. 

In [None]:
from langchain_community.vectorstores.timescalevector import TimescaleVector

# Create a Timescale Vector instance from the collection of documents
COLLECTION_NAME = "human_disorder_embeddings."
db = TimescaleVector.from_documents(
    embedding=embeddings_model,
    documents=chunks,
    collection_name=COLLECTION_NAME,
    service_url=CONNECTION,)

### Retrieval Setup
In this step, I define a contextual retrieval from the database object which stores the embedding. As you observe, we pass in the type of relationship search the retriever will use. You can also pass in the number of relevant documents to be retrieved for each query. The default number is 4.

In [38]:
retriever = db.as_retriever(search_type="similarity")

In [40]:
docs_related_to_anxiety_disorder = retriever.invoke("elaborate on the causes of anxiety disorders")
docs_related_to_anxiety_disorder

[Document(page_content="answer: Anxiety disorders sometimes run in families, but no one knows for sure why some people have them while others don't. Anxiety disorders are more common among younger adults than older adults, and they typically start in early life. However, anyone can develop an anxiety disorder at any time. Below are risk factors for these anxiety disorders. - Generalized Anxiety Disorder (GAD)  -  Social Anxiety Disorder (Social Phobia)  - Panic Disorder Generalized Anxiety Disorder (GAD) Social Anxiety Disorder (Social Phobia) Panic Disorder Generalized Anxiety Disorder - Risk Factors Generalized anxiety disorder (GAD) affects about 6.8 million American adults, including twice as many women as men. The disorder develops gradually and can begin at any point in the life cycle, although the years of highest risk are between childhood and middle age. The average age of onset is 31 years old. Social Phobia - Risk Factors Social phobia affects about 15 million American adult

## Prompt Creation

LangChain enables the developer to define a standardized version of the prompt sent to the LLM using `ChatPromptTemplate` which accepts the LLM’s instructions, the contextual information retrieved and the user’s query. Through this process of crafting prompting, we ensure that the LLM will always have a contextualized prompt.

In [None]:
from langchain.prompts import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    SystemMessagePromptTemplate,
)

# Define the bot's personality and capabilities
bot_instructions = """
You are a friendly bot capable of providing concise, accurate information about humandiseases and health conditions. 
Answer questions about descriptions, symptoms, causes, and treatments. Keep your responses brief and simple. 
"""

# Template for user input and context
human_message = """
This is the relevant information: {context}
This is the user input: {query}
"""

prompt = ChatPromptTemplate(
    messages=[
        SystemMessagePromptTemplate.from_template(bot_instructions),
        HumanMessagePromptTemplate.from_template(human_message), 
    ],
    input_variables=['context','query'], 
)

## LLM Integration

In this last step, I define the LLM model as Anthropic’s Claude 3.0 Opus for the augmented part of the RAG application. This model is the second best in the Claude LLM series and is known as a robust model for reasoning over text and problem-solving.

I also implemented the chaining of all the functions that I defined ahead according to the RAG application workflow discussed above as well. We first retrieve contextual documents and the query then form the prompt which is passed to Claude 3.0 Opus for response generation.

In [42]:
from langchain_anthropic import ChatAnthropic
from langchain_core.runnables import RunnablePassthrough

model = ChatAnthropic(model='claude-3-opus-20240229', # Specify which Anthropic model to use
                      api_key="your-anthropic-api-key")
# Define the RAG (Retrieval-Augmented Generation) chain
rag = (
    {"context": retriever, "query": RunnablePassthrough()}
    | prompt # Apply the prompt template
    | model# Send to the LLM model for generation
)

In [43]:
claude_output = rag.invoke("Tell me about anxiety disorders specifically about who are affected the most and the leading causes.")
print(claude_output.content)

Based on the provided information, here are the key points about who is most affected by anxiety disorders and their leading causes:

Who is most affected:
- Anxiety disorders are more common in younger adults than older adults.
- Women are twice as likely as men to have Generalized Anxiety Disorder (GAD) and Panic Disorder.
- Social Anxiety Disorder (Social Phobia) affects women and men equally.
- Anxiety disorders affect up to 15% of older adults in a given year.

Leading causes and risk factors:
- The exact causes are unknown, but anxiety disorders sometimes run in families, suggesting a genetic component.
- GAD typically develops gradually between childhood and middle age, with the average onset at 31 years old.
- Social Phobia usually begins in childhood or early adolescence.
- Panic Disorder often begins in late adolescence or early adulthood.
- The tendency to develop panic attacks appears to be inherited.
