# Medical Intelligence Agent
This notebook covers a simple version and address the queries on Tuberculosis - symptoms, treatment, prevention and recent research and development in vaccines. The agent retrieves the data from the trusted sources such as WHO and PudMed

## Loading Environment variables

In [40]:
import os
from dotenv import load_dotenv
load_dotenv()

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY")


## Part 1 - Document Loading

In [41]:
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://www.who.int/news-room/questions-and-answers/item/tuberculosis")
docs1 = loader.load()

from langchain_community.document_loaders import PubMedLoader
loader = PubMedLoader("Tuberculosis", load_max_docs=3)
docs2 = loader.load()

# future work - check out pubmed tools, retriever, PubMedAPIWrapper

In [42]:
docs1

[Document(metadata={'source': 'https://www.who.int/news-room/questions-and-answers/item/tuberculosis', 'title': '\r\n\tTuberculosis\r\n', 'language': 'en'}, page_content="      \r\n\tTuberculosis\r\n                     \n   Skip to main content       \n\n\n \n\n\n\n\n\n\n\nGlobal\n\n\nRegions\n\n\n\n\n\n\n\nWHO Regional websites\n\n\n\n\n\n\n\nAfrica\n\n\n\n\n\nAmericas\n\n\n\n\n\nSouth-East Asia\n\n\n\n\n\nEurope\n\n\n\n\n\nEastern Mediterranean\n\n\n\n\n\nWestern Pacific\n\n\n\n\n\n\n\n\n\n   \n\n\n\n\n\n\n\n\n\n\n\n\n\nWhen autocomplete results are available use up and down arrows to review and enter to select.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\r\n        Select language\r\n    \n\nSelect language\nEnglish\nالعربية\n中文\nFrançais\nРусский\nEspañol\nPortuguês\n\n\n\n\n        \n            \n\n\n\n\n\n\n\n\n\n\n       \n\n\n\n\n\n\n\n\n\n\n\nHome\n\n\n\n\n\n\n\n\n\n\n\n\n\nHealth Topics\n\n\n\n\n\n\n\n\nAll topicsABCDEFGHIJKLMNOPQRSTUVWXYZ\n\n\n\n\n\n\n\nResources\n\n\nFact sheets\n\n\

In [43]:
docs2

[Document(metadata={'uid': '40985018', 'Title': 'Community-driven strategies and policies for drug-resistant tuberculosis control in Banyumas Regency, Indonesia: A comprehensive 2023 analysis.', 'Published': '2025-09-20', 'Copyright Information': '© The Author(s) 2025.'}, page_content='BACKGROUND: Central Java, Indonesia, struggles with low drug-resistant tuberculosis (DR-TB) case detection (33%) and treatment rates (25%), far below the 60% target. Despite policies, including Minister of Health Regulation No. 67/2016 and Presidential Regulation No. 67/2021, along with the National TB Strategy for Tuberculosis Control 2020-2024 have been implemented, targets remain unmet due to weak community involvement. This study analyzed TB policy implementation in high-prevalence Banyumas Regency, focusing on cadres and community organizations.\nDESIGN AND METHODS: This qualitative study employed the Van Meter and Van Horn framework to assess policy implementation. Key stakeholders involved informa

In [44]:
print(type(docs1))  
print(len(docs1))
print(docs1[0].page_content[:1000])

<class 'list'>
1
      
	Tuberculosis
                     
   Skip to main content       


 







Global


Regions







WHO Regional websites







Africa





Americas





South-East Asia





Europe





Eastern Mediterranean





Western Pacific









   













When autocomplete results are available use up and down arrows to review and enter to select.
















        Select language
    

Select language
English
العربية
中文
Français
Русский
Español
Português




        
            










       











Home













Health Topics








All topicsABCDEFGHIJKLMNOPQRSTUVWXYZ







Resources


Fact sheets


Facts in pictures


Multimedia


Podcasts


Publications


Questions and answers


Tools and toolkits










Popular


Dengue


Endometriosis


Excessive heat


Herpes


Mental disorders


Mpox

















Countries








All countriesABCDEFGHIJKLMNOPQRSTUVWXYZ







Regions


Africa


Americas


Europe


Eastern Mediterranean




In [46]:
import re

raw_text = docs1[0].page_content

# Collapse multiple newlines and strip leading/trailing whitespace
clean_text = re.sub(r'\n+', '\n', raw_text).strip()

# Optionally remove extra spaces too
docs1[0].page_content = re.sub(r'[ \t]+', ' ', clean_text)

print(docs1[0].page_content[:1000])

Tuberculosis
 
 Skip to main content 
 
Global
Regions
WHO Regional websites
Africa
Americas
South-East Asia
Europe
Eastern Mediterranean
Western Pacific
 
When autocomplete results are available use up and down arrows to review and enter to select.

 Select language
 
Select language
English
العربية
中文
Français
Русский
Español
Português
 
 
 
Home
Health Topics
All topicsABCDEFGHIJKLMNOPQRSTUVWXYZ
Resources
Fact sheets
Facts in pictures
Multimedia
Podcasts
Publications
Questions and answers
Tools and toolkits
Popular
Dengue
Endometriosis
Excessive heat
Herpes
Mental disorders
Mpox
Countries
All countriesABCDEFGHIJKLMNOPQRSTUVWXYZ
Regions
Africa
Americas
Europe
Eastern Mediterranean
South-East Asia
Western Pacific
WHO in countries
Data by country
Country presence 
Country cooperation strategies 
Country office profiles
Strengthening country offices 
Newsroom
All news
News releases
Statements
Campaigns
Events
Feature stories
Press conferences
Speeches
Commentaries
Photo library
Head


In [38]:
print(type(docs2))             
print(len(docs2))              
print(type(docs2[2]))          
print(docs2[2].page_content) 

<class 'list'>
3
<class 'langchain_core.documents.base.Document'>
PURPOSE: Pulmonary or pleural tuberculosis and malignancy are the most common causes of lymphocyte-rich pleural effusion. In resource-limited settings, establishing a definitive diagnosis for patients with pleural effusion (PEs) may be challenging, emphasizing the need for a simple diagnostic test. This study aimed to investigate the diagnostic value of quantitative C-reactive protein (qCRP) for differentiating between tuberculous pleural effusion (TPE) and malignant pleural effusion (MPE).
METHODS: A cross-sectional study was performed on 81 patients admitted for lymphocyte-rich exudative pleural effusion at a tertiary care center. Comprehensive biochemical analyses, including qCRP levels in the pleural fluid and pathological examinations of blood, pleural fluid, and/or tissue, were performed. The t-test was used for continuous variables, and the chi-square test was used for categorical variables. To assess the independ

## Part 2 - Text Splitting

In [47]:

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs1_chunks = text_splitter.split_documents(docs1)
docs2_chunks = text_splitter.split_documents(docs2)

## Part 3 - Embedding

In [48]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


## Part 4 - Storing the documents in the vector store

In [56]:
from langchain_community.vectorstores import FAISS

vectorstore = FAISS.from_documents(docs1, embeddings)
vectorstore.add_documents(docs2)

['9324b6f3-4fc2-4c55-baf3-c24a9fc4369e',
 'b1bf0f69-cac1-4ac3-bb8a-039df48aa85b',
 'fe77d455-66dd-46e4-aa87-7c8213dbb3f9']

## Retrieval
Finds the most relevant document chunks based on user query similarity

In [62]:
#creating retriever object
retriever = vectorstore.as_retriever()

#retrieve documents
query = "What is Tuberculosis?"
retrieved_docs = retriever.get_relevant_documents(query)
print(f"Number of related documents: {len(retrieved_docs)}")

Number of related documents: 4


## Generation
Uses retrieved context to generate accurate, contextual responses with an LLM.

In [87]:
from langchain_groq import ChatGroq
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

#initialize LLM
llm = ChatGroq(model="llama-3.3-70b-versatile", temperature=0)

#prompt template
prompt = ChatPromptTemplate.from_template(
    """
    You are a medical intelligence assistant. Answer the following question, based on the context provided.
    If you don't know the answer, say that you don't know.
    Context: {context}
    Question: {input}
    """
)

#adding string output parser to the chain
output_parser = StrOutputParser()

#creating retrieval chains
document_chain = create_stuff_documents_chain(llm, prompt) | output_parser
retrieval_chain = create_retrieval_chain(retriever, document_chain)


In [88]:
#generate response
query = "What is Tuberculosis?"
response = retrieval_chain.invoke({"input": query})
response['answer']

"Tuberculosis (TB) is caused by bacteria (Mycobacterium tuberculosis) that most often affect the lungs. About one-quarter of the world's population has been infected with TB, most of whom are not ill and cannot transmit TB."

In [91]:
response2 = retrieval_chain.invoke({"input": "What are the symptoms of Tuberculosis?"})
print(response2['answer'])

The symptoms of Tuberculosis (TB) disease include:

1. A persistent cough, sometimes with mucus or even blood.
2. Chest pain.
3. Weakness or fatigue.
4. Unintentional weight loss.
5. Fever.
6. Night sweats.

These symptoms can be mild at first and develop slowly. If you experience these signs, seeking medical attention is crucial for diagnosis and treatment of TB.


In [92]:
response3 = retrieval_chain.invoke({"input": "What are the treatment options for Tuberculosis?"})
print(response3['answer'])

According to the World Health Organization (WHO), Tuberculosis (TB) is mostly curable with a course of 4 antimicrobials lasting 6 months, or even 4 months in some cases. Information, supervision, and patient support by a health worker or trained volunteer are key to a successful outcome. Additionally, TB preventive treatment (TPT) can be given to avoid progression from TB infection to disease in people at risk.


In [95]:
response4 = retrieval_chain.invoke({"input": "what is the recent research on Tuberculosis?"})
print(response4['answer'])

Recent research on Tuberculosis (TB) includes several studies and findings. 

1. A study in Central Java, Indonesia, analyzed TB policy implementation in high-prevalence Banyumas Regency, focusing on cadres and community organizations. The study found that policies lack sufficient local budget allocation, and there is a weak motivation of TB cadres, lack of commitment among regional organizations, and persistent stigma in the community.

2. The World Health Organization (WHO) reports that TB is the world's leading cause of death from a single infectious agent, with 1.25 million people dying from TB in 2023. The organization also notes that TB is largely preventable, and TB preventive treatment (TPT) can be given to avoid progression from TB infection to disease in people at risk.

3. A cross-sectional study investigated the diagnostic value of quantitative C-reactive protein (qCRP) for differentiating between tuberculous pleural effusion (TPE) and malignant pleural effusion (MPE). The 

In [97]:
response5 = retrieval_chain.invoke({"input": "What are the side effects of Tuberculosis treatment?"})
print(response5['answer'])

I don't know the specific side effects of Tuberculosis treatment from the provided context. The context primarily discusses the diagnosis, symptoms, and prevention of Tuberculosis, as well as some general information about its treatment, but it does not provide a detailed list of side effects associated with TB treatment. For accurate and comprehensive information on the side effects of TB treatment, I recommend consulting a medical professional or a reliable medical resource.


In [98]:
response6 = retrieval_chain.invoke({"input": "How is Tuberculosis diagnosed?"})
print(response6['answer'])

The provided context does not explicitly explain how Tuberculosis (TB) is diagnosed. However, it does mention that "Comprehensive biochemical analyses, including qCRP levels in the pleural fluid and pathological examinations of blood, pleural fluid, and/or tissue, were performed" in a study related to differentiating between tuberculous pleural effusion and malignant pleural effusion. 

For a definitive answer on how TB is diagnosed, I would need more specific information. Generally, TB diagnosis involves a combination of clinical evaluation, imaging studies (like chest X-rays), sputum tests (to detect the bacteria), and sometimes biopsies or other invasive procedures to confirm the presence of the disease. However, without more specific details in the context provided, I cannot give a precise answer on the diagnostic methods for TB.


In [None]:
response7 = retrieval_chain.invoke({"input": "What is the recent development in Tuberculosis vaccines?"})
print(response7['answer'])


I don't know the recent developments in Tuberculosis vaccines. The provided context and text do not mention any recent developments in Tuberculosis vaccines. They discuss the current state of TB, its symptoms, treatment, and prevention, as well as some studies related to TB diagnosis, but do not provide information on recent vaccine developments.


In [None]:
#to be added - diagnosis, recent development in TB Vaccines, side effects of treatment

In [None]:
disease = input("Enter the disease you want to research: ")
custom_query = f"What is {disease}?"

custom_data_load = PubMedLoader("Tuberculosis", load_max_docs=3)
custom_doc = custom_data_load.load()

custom_doc_chunks = text_splitter.split_documents(custom_doc)
custom_vectorstore = FAISS.from_documents(custom_doc_chunks, embeddings)
custom_retriever = custom_vectorstore.as_retriever()


#retrieve documents
retrieved_docs_custom = custom_retriever.get_relevant_documents(custom_query)
print(f"Number of related documents: {len(retrieved_docs_custom)}")