<a href="https://colab.research.google.com/github/heerthiraja/Generative-AI/blob/main/BioMistral_ChatBot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build BioMistral Medical RAG Chatbot using BioMistral Open Source LLM
## In the notebook we will build a Medical Chatbot with BioMistral LLM and Heart Health pdf file.
## Load the google drive

In [1]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Installation


In [2]:
#!pip install langchain sentence-transformers chromadb llama-cpp-python langchain_community pypdf

Collecting opentelemetry-proto==1.28.2 (from opentelemetry-exporter-otlp-proto-grpc>=1.2.0->chromadb)
  Using cached opentelemetry_proto-1.28.2-py3-none-any.whl.metadata (2.3 kB)
Collecting protobuf (from onnxruntime>=1.14.1->chromadb)
  Using cached protobuf-5.29.0-cp38-abi3-manylinux2014_x86_64.whl.metadata (592 bytes)
Using cached opentelemetry_proto-1.28.2-py3-none-any.whl (55 kB)
Using cached protobuf-5.29.0-cp38-abi3-manylinux2014_x86_64.whl (319 kB)
Installing collected packages: protobuf, opentelemetry-proto
  Attempting uninstall: protobuf
    Found existing installation: protobuf 4.23.4
    Uninstalling protobuf-4.23.4:
      Successfully uninstalled protobuf-4.23.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.17.1 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3, but you have protobuf 5.29

## Importing libraries


In [11]:
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma
from langchain_community.llms import LlamaCpp
from langchain.chains import RetrievalQA, LLMChain

## Import the document


In [12]:
loader = PyPDFDirectoryLoader("/content/drive/MyDrive/BioMistral/Data")
docs = loader.load()

In [13]:
len(docs)  # number of pages

95

In [18]:
docs[6]

Document(metadata={'source': '/content/drive/MyDrive/BioMistral/Data/healthyheart.pdf', 'page': 6}, page_content='2\nThese facts may seem frightening, but they need not be. The good\nnews is that you have a lot of power to protect and improve your\nheart health. This guidebook will help you find out your own risk\nof heart disease and take steps to prevent it.\n“But,” you may still be thinking, “I take pretty good care of myself.\nI’m unlikely to get heart disease.” Yet a recent national survey shows\nthat only 3 percent of U.S. adults practice all of the “Big Four”\nhabits that help to prevent heart disease: eating a healthy diet, \ngetting regular physical activity, maintaining a healthy weight, and\navoiding smoking. Many young people are also vulnerable. A\nrecent study showed that about two-thirds of teenagers already have\nat least one risk factor for heart disease.\nEvery risk factor counts. Research shows that each individual risk\nfactor greatly increases the chances of develo

## Chunking


In [15]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
chunks = text_splitter.split_documents(docs)

In [16]:
len(chunks)

585

In [20]:
chunks[400]

Document(metadata={'source': '/content/drive/MyDrive/BioMistral/Data/healthyheart.pdf', 'page': 64}, page_content='you should eat, depending on how many calories you take in\neach day. If you have high blood cholesterol or heart disease,\nthe amount of saturated fat will be different. (See “Give Your\nHeart a Little TLC,” on page 55.) Check the Nutrition Facts')

## Embeddings creations


In [21]:
import os
os.environ['HUGGINGFACEHUB_API_TOKEN'] = "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"

In [22]:
embeddings = SentenceTransformerEmbeddings(model_name="NeuML/pubmedbert-base-embeddings")

  embeddings = SentenceTransformerEmbeddings(model_name="NeuML/pubmedbert-base-embeddings")
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/6.12k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/667 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/706k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Vector Store creation


In [24]:
vectorstore = Chroma.from_documents(chunks, embeddings)

In [25]:
query = "Who is at risk of heart disease?"

search_results = vectorstore.similarity_search(query)

In [26]:
search_results

[Document(metadata={'page': 8, 'source': '/content/drive/MyDrive/BioMistral/Data/healthyheart.pdf'}, page_content='4\nWho Is at Risk?\nRisk factors are conditions or habits that make a person more likely\nto develop a disease. They can also increase the chances that an\nexisting disease will get worse. Important risk factors for heart dis-\nease that you can do something about are cigarette smoking, high'),
 Document(metadata={'page': 8, 'source': '/content/drive/MyDrive/BioMistral/Data/healthyheart.pdf'}, page_content='heart disease risk increases enormously. The message is clear: You\nneed to take heart disease risk seriously, and the best time to reduce\nthat risk is now.\nYour Guide to a Healthy Heart'),
 Document(metadata={'page': 6, 'source': '/content/drive/MyDrive/BioMistral/Data/healthyheart.pdf'}, page_content='at least one risk factor for heart disease.\nEvery risk factor counts. Research shows that each individual risk\nfactor greatly increases the chances of developing hea

In [27]:
retriever = vectorstore.as_retriever(search_kwargs={'k':5}) #k is knearestneighbour

In [28]:
retriever.get_relevant_documents(query)

  retriever.get_relevant_documents(query)


[Document(metadata={'page': 8, 'source': '/content/drive/MyDrive/BioMistral/Data/healthyheart.pdf'}, page_content='4\nWho Is at Risk?\nRisk factors are conditions or habits that make a person more likely\nto develop a disease. They can also increase the chances that an\nexisting disease will get worse. Important risk factors for heart dis-\nease that you can do something about are cigarette smoking, high'),
 Document(metadata={'page': 8, 'source': '/content/drive/MyDrive/BioMistral/Data/healthyheart.pdf'}, page_content='heart disease risk increases enormously. The message is clear: You\nneed to take heart disease risk seriously, and the best time to reduce\nthat risk is now.\nYour Guide to a Healthy Heart'),
 Document(metadata={'page': 6, 'source': '/content/drive/MyDrive/BioMistral/Data/healthyheart.pdf'}, page_content='at least one risk factor for heart disease.\nEvery risk factor counts. Research shows that each individual risk\nfactor greatly increases the chances of developing hea

## LLM Model loading

In [35]:
llm = LlamaCpp(
    model_path="/content/drive/MyDrive/BioMistral/BioMistral-7B.Q4_K_M.gguf",
    temperature=0.2,
    max_tokens = 2048,
    top_p=1
)

llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /content/drive/MyDrive/BioMistral/BioMistral-7B.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = hub
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.att

## Use LLM and retriver and query, to generate final response


In [37]:
template = """
<|context|>
You are an Medical Assistant that follows the instructions and generate the accurate response based on the query and the context provided.
Please be truthful and give direct answers.

<|user|>
{query}

<|assistant|>
"""

In [38]:
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain.prompts import ChatPromptTemplate

In [39]:
prompt = ChatPromptTemplate.from_template(template)

In [40]:
rag_chain = (
    {"context": retriever, "query": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [41]:
response=rag_chain.invoke(query)

llama_perf_context_print:        load time =   28424.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /    69 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /    68 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   70962.76 ms /   137 tokens


In [42]:
response

'The risk of heart disease is influenced by a variety of factors, including age, family history, lifestyle habits such as smoking and physical activity, and medical conditions such as high blood pressure or diabetes. It is generally recommended to follow a healthy diet, exercise regularly, maintain a healthy weight, and avoid smoking to reduce the risk of heart disease.'

In [43]:
import sys

while True:
  user_input = input(f"Input query: ")
  if user_input == 'exit':
    print("Exiting...")
    sys.exit()
  if user_input=="":
    continue
  result = rag_chain.invoke(user_input)
  print("Answer: ", result)

Input query: What are the diseases that affect heart health?


Llama.generate: 52 prefix-match hit, remaining 18 prompt tokens to eval
llama_perf_context_print:        load time =   28424.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /    18 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /    35 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   30386.68 ms /    53 tokens


Answer:  The diseases that affect heart health are: high blood pressure, coronary artery disease, congestive heart failure, arrhythmia, and cardiomyopathy.
Input query: what are the preventive measures


Llama.generate: 52 prefix-match hit, remaining 15 prompt tokens to eval
llama_perf_context_print:        load time =   28424.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /    15 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   221 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  146264.70 ms /   236 tokens


Answer:  The preventive measures include: 1. Washing your hands frequently with soap and water for at least 20 seconds especially after being in a public place or after blowing your nose, coughing or sneezing. If soap and water are not available, use an alcohol-based hand sanitizer containing at least 60% alcohol. 2. Avoid touching your eyes, nose and mouth with unwashed hands. 3. Avoid close contact with anyone showing symptoms of respiratory illness such as coughing and sneezing. 4. Stay home when you are sick. 5. Cover coughs and sneezes with your elbow or a tissue, then throw the tissue in the trash. 6. Clean and disinfect frequently touched objects and surfaces daily. Use EPA-registered household disinfectants. Follow the instructions carefully on the label. 7. Wear a mask or cloth face coverings (not goggles) when in public. 8. Stay informed about the latest guidance from CDC and other public health officials.
Input query: How High blood Cholesterol affect heart health?


Llama.generate: 52 prefix-match hit, remaining 20 prompt tokens to eval
llama_perf_context_print:        load time =   28424.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /    20 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /    55 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   43003.19 ms /    75 tokens


Answer:  High blood cholesterol is a risk factor for heart disease. It can cause the blood vessels to become narrow and hard, which increases the pressure on the heart and reduces the amount of blood and oxygen that reaches it. This can lead to heart failure or an attack.
Input query: exit
Exiting...


SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


# Thank You!