<a href="https://colab.research.google.com/github/leighvdveen/chatsnip/blob/main/MH_Wells_BioMistral7B_augmented.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build BioMistral RAG Agent I/F for Wells PoC with BioMistral Open Source LLM

Wellness Agent with BioMistral LLM and additional PDF sources.

## Installation

In [None]:
!pip install langchain sentence-transformers chromadb llama-cpp-python langchain_community pypdf

Collecting langchain
  Downloading langchain-0.2.16-py3-none-any.whl.metadata (7.1 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting chromadb
  Downloading chromadb-0.5.5-py3-none-any.whl.metadata (6.8 kB)
Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.90.tar.gz (63.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.8/63.8 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting langchain_community
  Downloading langchain_community-0.2.16-py3-none-any.whl.metadata (2.7 kB)
Collecting pypdf
  Downloading pypdf-4.3.1-py3-none-any.whl.metadata (7.4 kB)
Collecting langchain-core<0.3.0,>=0.2.38 (from langchain)
  Downloading langchain_core-0.2.38-py3-n

## Import libraries

In [None]:
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import CharacterTextSplitter,RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS, Chroma
from langchain_community.llms import LlamaCpp
from langchain.chains import RetrievalQA, LLMChain

In [None]:
import pathlib
import textwrap
from IPython.display import display
from IPython.display import Markdown



def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [None]:
# Used to securely store your API key
from google.colab import userdata

## Setup HuggingFace Access Token

- Log in to [HuggingFace.co](https://huggingface.co/)
- Click on your profile icon at the top-right corner, then choose [“Settings.”](https://huggingface.co/settings/)
- In the left sidebar, navigate to [“Access Token”](https://huggingface.co/settings/tokens)
- Generate a new access token, assigning it the “write” role.


In [None]:
# Or use `os.getenv('HUGGINGFACEHUB_API_TOKEN')` to fetch an environment variable.
import os
from getpass import getpass

HUGGINGFACEHUB_API_TOKEN = userdata.get("HUGGINGFACEHUB_API_TOKEN")
os.environ["HUGGINGFACEHUB_API_TOKEN"] = "HUGGINGFACEHUB_API_TOKEN"

## Import document

In [None]:
loader = PyPDFDirectoryLoader("/content/sample_data")
docs = loader.load()

In [None]:
docs

[Document(metadata={'source': '/content/sample_data/outlive.pdf', 'page': 0}, page_content='Chapter\n1-\nThe\nLong\nGame:\nFrom\nFast\nDeath\nto\nSlow\nDeath\nScience\nand\nActionable\nStrategies\nin\nthe\nText:\n●\nLongevity\nvs.\nImmortality:\nThe\ntext\nemphasizes\nthat\nlongevity\ndoesn\'t\nmean\nimmortality.\nEveryone\nwill \neventually\ndie,\nbut\nthe\ngoal\nis\nto\nextend\nthe\nhealthy\nyears\nof\nlife.\n●\nThe\nImportance\nof\nEarly\nIntervention:\nThe\nkey\ntakeaway\nis\nthat\nmodern\nmedicine\noften\nintervenes\ntoo \nlate\nin\nthe\nprogression\nof\nchronic\ndiseases.\nThe\nauthor\nsuggests\nthat\nthe\nbest\napproach\nis\nto\nact\nearly \nto\nprevent\nthese\ndiseases\nrather\nthan\nwaiting\nuntil\nthey\nare\nentrenched. \n●\n●\nUnderstanding\nRoot\nCauses:\nThe\ntext\nmentions\nthat\nmainstream\nmedicine\nsometimes \nmisunderstands\nthe\nroot\ncauses\nof\ndiseases\nlike\nheart\ndisease,\ncancer,\nand\ntype\n2\ndiabetes.\nIt \nsuggests\nthat\naddressing\nthese\nroot\ncauses\ne

## Text Splitting - Chunking

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
chunks = text_splitter.split_documents(docs)

In [None]:
len(chunks)

393

In [None]:
chunks[0]

Document(metadata={'source': '/content/sample_data/outlive.pdf', 'page': 0}, page_content="Chapter\n1-\nThe\nLong\nGame:\nFrom\nFast\nDeath\nto\nSlow\nDeath\nScience\nand\nActionable\nStrategies\nin\nthe\nText:\n●\nLongevity\nvs.\nImmortality:\nThe\ntext\nemphasizes\nthat\nlongevity\ndoesn't\nmean\nimmortality.\nEveryone\nwill \neventually\ndie,\nbut\nthe\ngoal\nis\nto\nextend\nthe\nhealthy\nyears\nof\nlife.\n●\nThe\nImportance\nof")

In [None]:
chunks[1]

Document(metadata={'source': '/content/sample_data/outlive.pdf', 'page': 0}, page_content='the\nhealthy\nyears\nof\nlife.\n●\nThe\nImportance\nof\nEarly\nIntervention:\nThe\nkey\ntakeaway\nis\nthat\nmodern\nmedicine\noften\nintervenes\ntoo \nlate\nin\nthe\nprogression\nof\nchronic\ndiseases.\nThe\nauthor\nsuggests\nthat\nthe\nbest\napproach\nis\nto\nact\nearly \nto\nprevent\nthese\ndiseases\nrather\nthan\nwaiting\nuntil\nthey\nare')

In [None]:
chunks[2]

Document(metadata={'source': '/content/sample_data/outlive.pdf', 'page': 0}, page_content='these\ndiseases\nrather\nthan\nwaiting\nuntil\nthey\nare\nentrenched. \n●\n●\nUnderstanding\nRoot\nCauses:\nThe\ntext\nmentions\nthat\nmainstream\nmedicine\nsometimes \nmisunderstands\nthe\nroot\ncauses\nof\ndiseases\nlike\nheart\ndisease,\ncancer,\nand\ntype\n2\ndiabetes.\nIt \nsuggests\nthat\naddressing\nthese\nroot\ncauses\nearly\nis')

In [None]:
chunks[3]

Document(metadata={'source': '/content/sample_data/outlive.pdf', 'page': 0}, page_content='that\naddressing\nthese\nroot\ncauses\nearly\nis\ncrucial\nfor\nprevention.\n●\nNutrition\nand\nMetabolism:\nThe\nauthor\ndelves\ninto\nthe\nimportance\nof\nunderstanding\nnutrition\nand \nmetabolism.\nNutrition\nplays\na\nsignificant\nrole\nin\npreventing\nchronic\ndiseases,\nand\ndifferent\nindividuals \nmay\nrequire\ndifferent')

In [None]:
chunks[4]

Document(metadata={'source': '/content/sample_data/outlive.pdf', 'page': 0}, page_content='and\ndifferent\nindividuals \nmay\nrequire\ndifferent\ndietary\npatterns. \n●\n●\nThe\nRole\nof\nProtein:\nThe\ntext\nhighlights\nthe\nsignificance\nof\nprotein,\nparticularly\nas\npeople\nage.\nProtein \nintake\nand\nquality\nare\nessential\nfor\nmaintaining\nhealth.\n●\nExercise\nas\na\nLongevity\nTool:\nExercise\nis\npresented\nas\nthe')

## Embeddings

In [None]:
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")

## Vector Store - FAISS or ChromaDB

In [None]:
vectorstore = Chroma.from_documents(chunks, embeddings)

In [None]:
vectorstore

<langchain_community.vectorstores.chroma.Chroma at 0x7e57127cb8b0>

In [None]:
query = "who is at risk of heart disease"
search = vectorstore.similarity_search(query)

In [None]:
to_markdown(search[0].page_content)

> Chapter
> 7
> -
> The
> Ticker:
> Confronting
> and
> Preventing
> Heart
> Disease,
> the
> Deadliest
> Killer
> on
> the
> Planet
> Here
> the
> risks
> and
> progression
> of
> atherosclerotic
> cardiovascular
> disease
> (ASCVD)
> are
> discussed,
> particularly 
> focusing
> on
> heart
> disease.
> The
> importance
> of
> understanding
> the
> complexities
> of

## Retriever

In [None]:
retriever = vectorstore.as_retriever(
    search_kwargs={'k': 5}
)

In [None]:
retriever.get_relevant_documents(query)

  retriever.get_relevant_documents(query)


[Document(metadata={'page': 10, 'source': '/content/sample_data/outlive.pdf'}, page_content='Chapter\n7\n-\nThe\nTicker:\nConfronting\nand\nPreventing\nHeart\nDisease,\nthe\nDeadliest\nKiller\non\nthe\nPlanet\nHere\nthe\nrisks\nand\nprogression\nof\natherosclerotic\ncardiovascular\ndisease\n(ASCVD)\nare\ndiscussed,\nparticularly \nfocusing\non\nheart\ndisease.\nThe\nimportance\nof\nunderstanding\nthe\ncomplexities\nof'),
 Document(metadata={'page': 11, 'source': '/content/sample_data/outlive.pdf'}, page_content='disease\n(ASCVD). \n●\nLp(a)\nis\nhighlighted\nas\na\nsignificant\nrisk\nfactor,\nespecially\nfor\nindividuals\nwith\na\nfamily\nhistory\nof \npremature\nheart\ndisease.\nDiet\nand\nLifestyle\nModifications:\n●\nLifestyle\nchanges,\nincluding\ndiet\nmodifications,\ncan\nhelp\nreduce\ncardiovascular\nrisk. \n●\nSpecific\ndietary'),
 Document(metadata={'page': 10, 'source': '/content/sample_data/outlive.pdf'}, page_content='Key\npoints\nand\nactionable\nstrategies:\n●\nFamily\nHi

## Large Language Model - Open Source

In [None]:
#connect to google drive
from google.colab import drive

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [43]:
llm = LlamaCpp(
    model_path= "/content/drive/MyDrive/Model&Data/BioMistral-7B.Q4_K_M.gguf",
    temperature=0.3,
    max_tokens=2048,
    top_p=1)

ValidationError: 1 validation error for LlamaCpp
__root__
  Could not load Llama model from path: /content/drive/MyDrive/Model&Data/BioMistral-7B.Q4_K_M.gguf. Received error Model path does not exist: /content/drive/MyDrive/Model&Data/BioMistral-7B.Q4_K_M.gguf (type=value_error)

## RAG Chain

In [None]:
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain.prompts import ChatPromptTemplate

In [None]:
template = """
<|context|>
You are an AI assistant that follows instruction extremely well.
Please be truthful and give direct answers
</s>
<|user|>
{query}
</s>
 <|assistant|>
"""

In [None]:
prompt = ChatPromptTemplate.from_template(template)

In [None]:
rag_chain = (
    {"context": retriever,  "query": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
response = rag_chain.invoke("what disease affect the heart?")

Llama.generate: prefix-match hit

llama_print_timings:        load time =    4075.93 ms
llama_print_timings:      sample time =      79.20 ms /   103 runs   (    0.77 ms per token,  1300.52 tokens per second)
llama_print_timings: prompt eval time =    9099.07 ms /    16 tokens (  568.69 ms per token,     1.76 tokens per second)
llama_print_timings:        eval time =   87709.65 ms /   102 runs   (  859.90 ms per token,     1.16 tokens per second)
llama_print_timings:       total time =   97416.49 ms /   118 tokens


In [None]:
to_markdown(response)

> The heart is affected by many diseases, some of which include coronary artery disease, cardiomyopathy, endocarditis, myocarditis, arrhythmia, congestive heart failure, atherosclerosis, hypertrophic cardiomyopathy, valvular heart disease, arrhythmogenic right ventricular cardiomyopathy, dilated cardiomyopathy, hypertension, and heart valve stenosis.

In [None]:
import sys

while True:
  user_input = input(f"Input Prompt: ")
  if user_input == 'exit':
    print('Exiting')
    sys.exit()
  if user_input == '':
    continue
  result = rag_chain.invoke(user_input)
  print("Answer: ",result)

Llama.generate: prefix-match hit

llama_print_timings:        load time =    4075.93 ms
llama_print_timings:      sample time =     251.28 ms /   368 runs   (    0.68 ms per token,  1464.47 tokens per second)
llama_print_timings: prompt eval time =    7480.05 ms /    14 tokens (  534.29 ms per token,     1.87 tokens per second)
llama_print_timings:        eval time =  300823.51 ms /   367 runs   (  819.68 ms per token,     1.22 tokens per second)
llama_print_timings:       total time =  310512.93 ms /   381 tokens


Answer:   Heart diseases refer to a group of conditions that involve the heart and blood vessels, including coronary artery disease, heart failure, arrhythmias, and congenital heart defects. Coronary artery disease is the most common type of heart disease and involves the buildup of plaque in the arteries that supply blood to the heart, which can lead to reduced blood flow and oxygen supply to the heart muscle. This can cause chest pain, shortness of breath, and other symptoms, and can increase the risk of heart attack and stroke. Heart failure is another common type of heart disease, which occurs when the heart muscle becomes weakened or damaged, causing it to pump blood less efficiently than it should. This can lead to symptoms such as fatigue, shortness of breath, swelling in the legs, and weight gain, and can increase the risk of premature death. Arrhythmias are another type of heart disease, which occur when there are abnormalities in the rate or rhythm of the heartbeat. This can 