<a href="https://colab.research.google.com/github/moienQ/RAG-and-AI-agents-for-healthcare/blob/main/RAG_and_AI_agents_for_Healthcare.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

       


# **RAG and AI agents for healthcare**





The integration of Retrieval-Augmented Generation (RAG) and AI agents is revolutionizing the healthcare sector by enabling efficient and intelligent information retrieval and decision-making. Leveraging LangChain, this project explores the development of advanced AI agents capable of understanding complex medical queries, accessing relevant data, and providing accurate, context-aware responses. By combining natural language processing with domain-specific knowledge, the system addresses key challenges in patient care and clinical decision-making. This innovative approach ensures scalability and adaptability, catering to a wide range of healthcare applications. Ultimately, it demonstrates how RAG and LangChain can bridge the gap between medical expertise and accessible, high-quality care.


In [None]:
#installing packages
!pip install langchain
!pip install langchain-community
!pip install PyPDF2 chromadb text-generation langchain sentence-transformers
!pip install pypdf

Collecting langchain-community
  Downloading langchain_community-0.3.13-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain<0.4.0,>=0.3.13 (from langchain-community)
  Downloading langchain-0.3.13-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.27 (from langchain-community)
  Downloading langchain_core-0.3.28-py3-none-any.whl.metadata (6.3 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.7.0-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.23.2-py3-none-any.whl.metadata (7.1 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-

In [None]:

#imports
from langchain_community.document_loaders import PyPDFLoader
import langchain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.embeddings import SentenceTransformerEmbeddings
from chromadb.utils import embedding_functions
from langchain_community.embeddings import HuggingFaceEmbeddings
from text_generation import InferenceAPIClient,Client
from langchain import hub
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_community.llms.huggingface_endpoint import HuggingFaceEndpoint
import os

from langchain.text_splitter import CharacterTextSplitter # For chunks creation - slpits the text into chunks
from PyPDF2 import PdfReader # For reading the content in pdf
import chromadb # Vector data base for creating embeddings and storing in a collection
from chromadb.utils import embedding_functions # Provides different embedding functions
from langchain.embeddings import HuggingFaceEmbeddings # For creating hugging face embeddings
from langchain.vectorstores import Chroma



## Connecting Colab with Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive



Mounted at /content/drive

In [None]:
#Ask the user to paste the path to the uploaded folder
folder_path = input("/content/healthcare_RAG_data")

#Validate the path
if not os.path.exists(folder_path):
  raise ValueError("Invalid folder path. Please check and try again.")

#List the contents of the folder using the provided path
folder_contents = os.listdir(folder_path)
print(folder_contents)

#Proceed with further processing using the files in the folder_path

/content/healthcare_RAG_data/content/healthcare_RAG_data
['Cancer - 2014 - Byers - Small cell lung cancer  Where do we go from here (1).pdf', 'Lung Cancer_ Types, Stages, Symptoms, Diagnosis & Treatment.pdf', 'What Is Lung Cancer_ _ Types of Lung Cancer _ American Cancer Society.pdf', 'Worldwide Overview of the Current Status.pdf', 'CA A Cancer J Clinicians - 2019 - Carbone - Mesothelioma  Scientific clues for prevention  diagnosis  and therapy.pdf', 'lung-cancer-where-are-we-today.pdf', 'lung cancer research paper (1).pdf', 'Lung cancer - Symptoms and causes - Mayo Clinic.pdf', 'Basic Information About Lung Cancer _ CDC.pdf', 'The pathogenesis of mesothelioma - ScienceDirect (1).pdf']


The code below combines several PDF files from a folder into one. It looks for PDFs in a folder, reads each one, grabs their pages, and puts them all together into a new PDF file

In [None]:
import os
import PyPDF2

# User-defined function to merge PDFs in a folder
def merge_pdfs(folder_path, output_file_name="MergedFiles1.pdf"):

  # Get a list of all PDF files in the folder
  pdf_files = [os.path.join(folder_path, filename) for filename in os.listdir(folder_path) if filename.endswith(".pdf")]

  if not pdf_files:
    print("No PDF files found in the specified folder.")
    return
     # Create a new PdfFileWriter object to store merged PDF
  pdf_writer = PyPDF2.PdfWriter()

  # Loop through each PDF file and append its pages to the writer
  for pdf_file in pdf_files:
    with open(pdf_file, 'rb') as file:
      pdf_reader = PyPDF2.PdfReader(file)
      for page_num in range(len(pdf_reader.pages)):
        page_obj = pdf_reader.pages[page_num]
        pdf_writer.add_page(page_obj)


         # Write the merged PDF to a new file
  with open(os.path.join(folder_path, output_file_name), 'wb') as output_file:
    pdf_writer.write(output_file)

  print(f"Merged PDF files successfully and saved as '{output_file_name}' in the folder.")

# Replace this with the actual path to your uploaded folder
folder_path = "/content/healthcare_RAG_data"

# Call the merge_pdfs function
merge_pdfs(folder_path)


Merged PDF files successfully and saved as 'MergedFiles1.pdf' in the folder.


In [None]:
from google.colab import files
files.download('/content/merged_file_new.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
loader = PyPDFLoader("/content/healthcare_RAG_data/MergedFiles1.pdf")
text = loader.load_and_split()

## Chunking the Data

In [None]:
def get_chunk(text):

    text_splitter = RecursiveCharacterTextSplitter(
        # separator="\n",  # Adjust separator if needed
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )
    chunks = text_splitter.split_documents(text)
    return chunks

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
text_chunks = text_splitter.split_documents(documents)


# Creating Embeddings and Storing in VectorDB
1.Using all-MiniLM-L6-v2 embeddings from HuggingFace

2.Storing Embeddings in Chroma DB

In [None]:
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
db = Chroma.from_documents(text_chunks,embeddings,persist_directory="chroma_persist")

  embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 4})

Example of getting related Documents using Retriever as per user query

In [None]:
retrieved_docs1 = retriever.invoke("What are the considerations for treating elderly patient")

In [None]:
retrieved_docs1

[Document(metadata={'row': 16, 'source': '/content/merged_file_new.csv'}, page_content='Question: What are the considerations for treating elderly patients with SCLC?\nAnswer: Elderly patients with SCLC may present unique challenges in treatment due to factors such as comorbidities and reduced tolerance to aggressive therapies. However, studies suggest that elderly patients with good performance status and normal organ function can tolerate standard chemotherapy and radiation therapy regimens similar to younger patients.'),
 Document(metadata={'row': 16, 'source': '/content/merged_file_new.csv'}, page_content="One study found that a reasonably high initial dose of chemotherapy is important for elderly patients with SCLC. Another study suggested that a hyperfractionated radiation regimen may be more effective for elderly patients with SCLC, but there is still debate over the optimal dose.\n\nIt's also worth noting that elderly patients with good performance status and normal organ funct

In [1]:
!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [2]:
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
import os

In [4]:
api_key = os.getenv("HUGGINGFACEHUB_API_TOKEN")



print(f"API Key: {api_key}")

API Key: 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


using Mistral-7B-Instruct (llm) as generator

In [None]:
repo_id = "mistralai/Mistral-7B-Instruct-v0.1"

llm = HuggingFaceEndpoint(repo_id=repo_id,huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN,temperature=0.05)


  llm = HuggingFaceEndpoint(repo_id=repo_id,huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN,temperature=0.05)


Other LLM models can be explored like Falcon,Mistral-7B etc

In [None]:

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


This code builds a pipeline that uses a factual passage and large language model to answer medical questions in a user-friendly way.

It retrieves relevant info, formats it, generates an answer, and outputs it as a string.

# Benchmarking RAG System
Using Cosine Similarity

In [None]:
import pandas as pd
df=pd.read_csv("/content/merged_file_new.csv")


Database contains Question, Ground Truth, Retriever Answer and Generator Answer

In [None]:
df.head()

Unnamed: 0,Question,Answer,Retriever answer,Generated answer
0,What potential drug targets are currently bein...,Novel drug targets under investigation in clin...,Document(page_content='(and rebiopsies at the ...,Several potential drug targets are currently ...
1,What recent advances in SCLC research have con...,"Several recent advances in SCLC research, incl...",[Document(page_content='Current Barriers and C...,Several recent advances in SCLC research have...
2,Does a chest X-ray show lung cancer?,X-rays arenâ€™t as good as CT scans for showin...,[Document(page_content='been updated to assess...,"\n\nA chest X-ray can detect lung cancer, but ..."
3,Who Should Be Screened for Lung Cancer?,Lung cancer screening is recommended only for ...,"[Document(page_content='2/25/24, 5:54 PM Basic...",Lung cancer screening is recommended only for...
4,Explain Lung cancer screening,You can increase your chances of catching canc...,[Document(page_content='Better understanding o...,Lung cancer screening refers to the process o...


In [None]:
df.isnull().sum()


Unnamed: 0,0
Question,0
Answer,0
Retriever answer,0
Generated answer,0


Pre-processing

In [None]:
import pandas as pd


# List to store modified retriever answers
modified_retriever_answers = []

# Iterate through all the entries in the 'Retriever answer' column
for index, row in df.iterrows():
    retriever_answer = row['Retriever answer']

    # Remove "[Document(page_content=" from the retriever answer
    retriever_answer = retriever_answer.replace("[Document(page_content=", "")

    # Remove "\n" from the retriever answer
    retriever_answer = retriever_answer.replace("\n", "")

 # Append the modified retriever answer to the list
    modified_retriever_answers.append(retriever_answer)

# Add the modified retriever answers to the DataFrame
df['Modified Retriever Answer'] = modified_retriever_answers

# Print the DataFrame with modified retriever answers
print(df)

                                             Question  \
0   What potential drug targets are currently bein...   
1   What recent advances in SCLC research have con...   
2                Does a chest X-ray show lung cancer?   
3             Who Should Be Screened for Lung Cancer?   
4                       Explain Lung cancer screening   
5           How Is Lung Cancer Diagnosed and Treated?   
6   What are the challenges associated with the ea...   
7               What do relative survival rates mean?   
8   What role does positron emission tomography (P...   
9    What are the challenges in improving survival...   
10  What is the current state of lung cancer incid...   
11  What are some challenges associated with the i...   
12       what are Signs and symptoms of lung cancer ?   
13                 PET (Positron Emission Tomography)   
14        Explain Genetic alterations in mesothelioma   
15  Transbronchial Lymph Node Sampling and Endobro...   
16  What are the considerations

In [None]:

import pandas as pd
import re

# Function to remove metadata part from the modified retriever answer
def remove_metadata(retriever_answer):
    # Use regular expression to match the metadata part
    return re.sub(r', metadata=.*\)', '', retriever_answer)

# Apply the remove_metadata function to each entry in the 'modified_retriever_answers' column
df['Modified Retriever Answer'] = df['Modified Retriever Answer'].apply(remove_metadata)

# Print the DataFrame with modified retriever answers
print(df)


                                             Question  \
0   What potential drug targets are currently bein...   
1   What recent advances in SCLC research have con...   
2                Does a chest X-ray show lung cancer?   
3             Who Should Be Screened for Lung Cancer?   
4                       Explain Lung cancer screening   
5           How Is Lung Cancer Diagnosed and Treated?   
6   What are the challenges associated with the ea...   
7               What do relative survival rates mean?   
8   What role does positron emission tomography (P...   
9    What are the challenges in improving survival...   
10  What is the current state of lung cancer incid...   
11  What are some challenges associated with the i...   
12       what are Signs and symptoms of lung cancer ?   
13                 PET (Positron Emission Tomography)   
14        Explain Genetic alterations in mesothelioma   
15  Transbronchial Lymph Node Sampling and Endobro...   
16  What are the considerations

In [None]:

df.head(5)

Unnamed: 0,Question,Answer,Retriever answer,Generated answer,Modified Retriever Answer
0,What potential drug targets are currently bein...,Novel drug targets under investigation in clin...,Document(page_content='(and rebiopsies at the ...,Several potential drug targets are currently ...,Document(page_content='(and rebiopsies at the ...
1,What recent advances in SCLC research have con...,"Several recent advances in SCLC research, incl...",[Document(page_content='Current Barriers and C...,Several recent advances in SCLC research have...,'Current Barriers and Challenges in Translatio...
2,Does a chest X-ray show lung cancer?,X-rays arenâ€™t as good as CT scans for showin...,[Document(page_content='been updated to assess...,"\n\nA chest X-ray can detect lung cancer, but ...",'been updated to assess the incidence of lung ...
3,Who Should Be Screened for Lung Cancer?,Lung cancer screening is recommended only for ...,"[Document(page_content='2/25/24, 5:54 PM Basic...",Lung cancer screening is recommended only for...,"'2/25/24, 5:54 PM Basic Information About Lung..."
4,Explain Lung cancer screening,You can increase your chances of catching canc...,[Document(page_content='Better understanding o...,Lung cancer screening refers to the process o...,'Better understanding of genetic predispositio...


In [None]:
# Remove the 'Retriever answer' column from the DataFrame
df = df.drop(columns=['Retriever answer'])

# Print the DataFrame after removing the 'Retriever answer' column
df.head(5)

Unnamed: 0,Question,Answer,Generated answer,Modified Retriever Answer
0,What potential drug targets are currently bein...,Novel drug targets under investigation in clin...,Several potential drug targets are currently ...,Document(page_content='(and rebiopsies at the ...
1,What recent advances in SCLC research have con...,"Several recent advances in SCLC research, incl...",Several recent advances in SCLC research have...,'Current Barriers and Challenges in Translatio...
2,Does a chest X-ray show lung cancer?,X-rays arenâ€™t as good as CT scans for showin...,"\n\nA chest X-ray can detect lung cancer, but ...",'been updated to assess the incidence of lung ...
3,Who Should Be Screened for Lung Cancer?,Lung cancer screening is recommended only for ...,Lung cancer screening is recommended only for...,"'2/25/24, 5:54 PM Basic Information About Lung..."
4,Explain Lung cancer screening,You can increase your chances of catching canc...,Lung cancer screening refers to the process o...,'Better understanding of genetic predispositio...



Generator Benchmarking

Used BERT tokenizer for Embeddings

In [None]:
import pandas as pd
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import torch


# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to get BERT embeddings for a given text
def get_bert_embedding(text):
    tokens = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**tokens)
    embeddings = outputs['last_hidden_state']
    # Take the mean along the sequence length dimension to get a single vector
    embeddings = torch.mean(embeddings, dim=1)
    return embeddings.numpy()


    # Function to calculate cosine similarity between two vectors
def cosine_similarity_score(vector1, vector2):
    return cosine_similarity(vector1.reshape(1, -1), vector2.reshape(1, -1))[0][0]

# List to store cosine similarity values
cosine_similarities = []

# Calculate cosine similarity for each pair of entries in the DataFrame
for index, row in df.iterrows():
    answer = row['Answer']
    generator_answer = row['Generated answer']

 # Get BERT embeddings for answer and generator_answer
    answer_embedding = get_bert_embedding(answer)
    generator_answer_embedding = get_bert_embedding(generator_answer)

    # Calculate cosine similarity
    similarity = cosine_similarity_score(answer_embedding, generator_answer_embedding)
    cosine_similarities.append(similarity)

# Print the list of cosine similarity values
print("Cosine Similarities:", cosine_similarities)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Cosine Similarities: [0.89000005, 0.90225536, 0.870137, 0.8971709, 0.81383157, 0.9525387, 0.9581766, 0.92783666, 0.90392053, 0.9113158, 0.9377673, 0.94270825, 0.9080872, 0.9335466, 0.9512731, 0.93486977, 0.9172662, 0.9728339, 0.9385454, 0.9753946, 0.92300117, 0.9514225, 0.9331232, 0.9261077, 0.85460573, 0.9071871, 0.9620663, 0.8966063, 0.9150687, 0.8196472, 0.96358895, 0.9097562, 0.9071934, 0.8516846, 0.9444642]


In [None]:

sum = 0
for i in cosine_similarities:
  sum = sum+i
l = len(cosine_similarities)
avg = sum/l
avg

0.9172856773648943

Retirever Benchmarking

In [None]:
import pandas as pd
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to get BERT embeddings for a given text
def get_bert_embedding(text):
    tokens = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**tokens)
    embeddings = outputs['last_hidden_state']
    # Take the mean along the sequence length dimension to get a single vector
    embeddings = torch.mean(embeddings, dim=1)
    return embeddings.numpy()

    # Function to calculate cosine similarity between two vectors
def cosine_similarity_score(vector1, vector2):
    return cosine_similarity(vector1.reshape(1, -1), vector2.reshape(1, -1))[0][0]



# List to store cosine similarity values
cosine_similarities1 = []

# Calculate cosine similarity for each pair of entries in the DataFrame
for index, row in df.iterrows():
    answer = row['Answer']
    retriever_answer = row['Modified Retriever Answer']

     # Get BERT embeddings for answer and generator_answer
    answer_embedding = get_bert_embedding(answer)
    retriever_answer_embedding = get_bert_embedding(retriever_answer)

    # Calculate cosine similarity
    similarity = cosine_similarity_score(answer_embedding,retriever_answer_embedding)
    cosine_similarities1.append(similarity)

# Print the list of cosine similarity values
print("Cosine Similarities:", cosine_similarities1)


Cosine Similarities: [0.7601465, 0.75728923, 0.8269283, 0.765139, 0.7468462, 0.79164267, 0.7298228, 0.59746623, 0.79644644, 0.7478242, 0.7600353, 0.8007215, 0.7870269, 0.8406664, 0.86307013, 0.8252269, 0.823068, 0.7612905, 0.93521696, 0.92186165, 0.8792753, 0.9051857, 0.78233635, 0.8015131, 0.87585384, 0.79394984, 0.8996691, 0.72691417, 0.8681529, 0.7622239, 0.81189245, 0.6391486, 0.8197434, 0.82287145, 0.7696135]


In [None]:
sum = 0
for i in cosine_similarities1 :
  sum = sum+i
l = len(cosine_similarities1)
avg1 = sum/l
avg1

0.7998879841395787