<a href="https://colab.research.google.com/github/kswanjara/ayurveda_llm_rag/blob/main/Ayurveda_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install requests pymongo together langchain_community pypdf langchain langchain-together
!apt-get -qq install poppler-utils tesseract-ocr
# # Upgrade Pillow to latest version
%pip install -q --user --upgrade pillow
# # Install Python Packages
!pip install unstructured[pdf]
!pip install --quiet langchain_experimental langchain_openai
!pip install unstructured==0.7.12



In [4]:
import os
import together
import pymongo
from pymongo.server_api import ServerApi
from google.colab import userdata

TOGETHER_API_KEY = userdata.get('TOGETHER_API_KEY')
together.api_key = TOGETHER_API_KEY
os.environ['TOGETHER_API_KEY'] = TOGETHER_API_KEY
URI = "mongodb+srv://kunalw:12345@cluster0.6xreqjm.mongodb.net/?retryWrites=true&w=majority"

# Tutorial

In [None]:
from typing import List

def generate_embeddings(input_texts: List[str], model_api_string: str) -> List[List[float]]:
    """Generate embeddings from Together python library.

    Args:
        input_texts: a list of string input texts.
        model_api_string: str. An API string for a specific embedding model of your choice.

    Returns:
        embeddings_list: a list of embeddings. Each element corresponds to the each input text.
    """
    together_client = together.Together()
    outputs = together_client.embeddings.create(
        input=input_texts,
        model=model_api_string,
    )
    return [x.embedding for x in outputs.data]


In [None]:
embedding_model_string = 'togethercomputer/m2-bert-80M-8k-retrieval' # model API string from Together.
vector_database_field_name = 'embedding_together_m2-bert-8k-retrieval' # define your embedding field name.
NUM_DOC_LIMIT = 200 # the number of documents you will process and generate embeddings.

sample_output = generate_embeddings(["This is a test."], embedding_model_string)
print(f"Embedding size is: {str(len(sample_output[0]))}")

Embedding size is: 768


In [None]:
from tqdm import tqdm
import time

db = client.sample_airbnb
collection = db.listingsAndReviews

keys_to_extract = ["name", "summary", "space", "description", "neighborhood_overview", "notes", "transit", "access", "interaction", "house_rules", "property_type", "room_type", "bed_type", "minimum_nights", "maximum_nights", "accommodates", "bedrooms", "beds"]

for doc in tqdm(collection.find({"summary":{"$exists": True}}).limit(NUM_DOC_LIMIT), desc = "Documents processing "):
  extracted_str = "\n".join([k + ": " + str(doc[k]) for k in keys_to_extract if k in doc])
  if vector_database_field_name not in doc:
    doc[vector_database_field_name] = generate_embeddings([extracted_str], embedding_model_string)[0]
  collection.replace_one({'_id': doc['_id']}, doc)
  time.sleep(1)

# Ayurveda

Read data

In [5]:
!wget https://ia902808.us.archive.org/21/items/TheCompleteBookOfAyurvedicHomeRemedies/The%20Complete%20Book%20of%20Ayurvedic%20Home%20Remedies.pdf -P ayurveda_pdfs

--2024-05-02 06:57:05--  https://ia902808.us.archive.org/21/items/TheCompleteBookOfAyurvedicHomeRemedies/The%20Complete%20Book%20of%20Ayurvedic%20Home%20Remedies.pdf
Resolving ia902808.us.archive.org (ia902808.us.archive.org)... 207.241.232.108
Connecting to ia902808.us.archive.org (ia902808.us.archive.org)|207.241.232.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3808289 (3.6M) [application/pdf]
Saving to: ‘ayurveda_pdfs/The Complete Book of Ayurvedic Home Remedies.pdf’


2024-05-02 06:57:06 (6.96 MB/s) - ‘ayurveda_pdfs/The Complete Book of Ayurvedic Home Remedies.pdf’ saved [3808289/3808289]



In [25]:
from langchain_community.document_loaders import PyPDFLoader, UnstructuredFileLoader

loader = UnstructuredFileLoader("ayurveda_pdfs/The Complete Book of Ayurvedic Home Remedies.pdf")
data = loader.load()
#data

In [7]:
## Blind chunking is perhaps not effective, need to try Semantic Chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
recursive_split_docs = text_splitter.split_documents(data)

# text_splitter = SemanticChunker(together_embedding)
# semantic_chunked_docs = text_splitter.create_documents([data[0].page_content])


Debugging chunks

In [9]:
# Write parsed text in Google Drive, for debugging

from google.colab import drive
drive.mount('/content/drive')

file_path = '/content/drive/My Drive/rag_llm_expt/recursive.txt'

with open(file_path, 'w') as f:
  for d in recursive_split_docs:
    f.write(d.page_content)
    f.write("\n\n\n---------\n\n\n")

print("File created successfully at:", file_path)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
File created successfully at: /content/drive/My Drive/rag_llm_expt/recursive.txt


Database Connection

In [32]:
from tqdm import tqdm
from langchain_community.vectorstores import MongoDBAtlasVectorSearch
from langchain_experimental.text_splitter import SemanticChunker
from langchain_together.embeddings import TogetherEmbeddings


together_embedding = TogetherEmbeddings(model='togethercomputer/m2-bert-80M-8k-retrieval')
vector_database_field_name = 'embedding_together_m2-bert-8k-retrieval' # define your embedding field name.

# Create a new client and connect to the server
client = pymongo.MongoClient(URI, server_api=ServerApi('1'))

# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

DB_NAME = "ayurveda_db"
COLLECTION_NAME = "vasant_lad_unstructured_recursive"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "together_embedding"

MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME]

Pinged your deployment. You successfully connected to MongoDB!


Create Index

In [33]:
vectorstore = MongoDBAtlasVectorSearch(MONGODB_COLLECTION, together_embedding)
# insert the documents in MongoDB Atlas with their embedding
for d in tqdm(recursive_split_docs):
  vector_search = MongoDBAtlasVectorSearch.from_documents(
      documents=[d],
      embedding=together_embedding,
      collection=MONGODB_COLLECTION,
      index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
  )

100%|██████████| 860/860 [08:39<00:00,  1.66it/s]


In [34]:
# Example query.
# query = "Pitta"
# query_emb = generate_embeddings([query], 'togethercomputer/m2-bert-80M-8k-retrieval')[0]

# results = MONGODB_COLLECTION.aggregate([
#   {
#     "$vectorSearch": {
#       "queryVector": query_emb,
#       "path": 'embedding',
#       "numCandidates": 100, # this should be 10-20x the limit
#       "limit": 10, # the number of documents to return in the results
#       "index": "remedy_embedding", # the index name you used in Step 4.
#     }
#   }
# ])

# results_as_dict = {doc['text']: doc for doc in results}

# print(f"From your query \"{query}\", the following docs were found:\n")
# print("\n".join([str(i+1) + ". " + name for (i, name) in enumerate(results_as_dict.keys())]))

query = "I cannot sleep"
print("Querying from Database %s and Collection %s"%(DB_NAME, COLLECTION_NAME))
vector_search = MongoDBAtlasVectorSearch.from_connection_string(
    URI,
    DB_NAME + "." + COLLECTION_NAME,
    together_embedding,
    index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
)
results = vector_search.similarity_search_with_score(
    query=query,
    k=5,
)

# # Display results
for result in results:
    print("---")
    print(result[0].page_content)
    print("---")

Querying from Database ayurveda_db and Collection vasant_lad_unstructured_recursive


Initialize Model, Vector Store and Prompt

In [35]:
from langchain.prompts import PromptTemplate
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain_together import Together


# 1. If the question is to request documents, please only return the source documents with no answer.
# 2. If you don't know the answer, don't try to make up an answer. Just say **I can't find the final answer but you may want to check the following links** and add the source documents as a list.
# 3. If you find the answer, write the answer in a concise way and add the list of sources that are **directly** used to derive the answer. Exclude the sources that are irrelevant to the final answer.
# 4. Keep the context as close as possible to the actual query.
# For instance, if someone has cough, a possible home remedy could be having warm water with Ginger and Turmeric.

together_llm = Together(
    model="mistralai/Mixtral-8x7B-Instruct-v0.1",
    temperature=0.0,
    max_tokens=512,
    top_k=10,
    top_p=0.8,
)

ayurveda_retriever = vector_search.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5},
)


# Let's diagnose you?

## Identify Body Type

In [36]:
most_energetic = 'Evening' # @param ["Morning", "Afternoon", "Evening"]
digestion_pattern = 'Slow and methodical' # @param ["Quick and fiery", "Consistent and Balanced", "Slow and methodical"]
body_frame = 'Solid and Well-built' # @param ["Lean and Flexible", "Athletic and Muscular", "Solid and Well-built"]


In [42]:
body_type_prompt_template = """<s>[INST] You are an Ayurveda Doctor. You are supposed to answer in a single word only.
{context}

Question: {question} [/INST]
"""
BODY_TYPE_PROMPT = PromptTemplate(
    template=body_type_prompt_template, input_variables=["context", "question"]
)

body_type_qa = RetrievalQA.from_chain_type(
    llm=together_llm,
    chain_type="stuff",
    retriever=ayurveda_retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": BODY_TYPE_PROMPT},
)

body_type_result = body_type_qa({"query": "I am most energetic during %s, my digestive pattern is %s and I have a %s body frame. Tell me my body type, Answer in 1 word without any description."%(most_energetic, digestion_pattern, body_frame)})

print(body_type_result["result"])

Kapha. (This refers to the Ayurvedic body type that is characterized by slow digestion, solid body frame, and being most energetic in the evening.)


## State your problem

In [43]:
ailment = 'I have diabetes. What should I do?' # @param {type:"string"}

In [44]:
prompt_template = """<s>[INST] You are an Ayurveda Doctor. You are supposed to answer a question related to an ailment. Structure the answer into two parts: What is the relation of the ailment to the body type and what home remedies can used to treat or keep the ailment under control. Respond "Unsure about answer" if not sure about the answer.
{context}

Question: {question} [/INST]
"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

ailment_qa = RetrievalQA.from_chain_type(
    llm=together_llm,
    chain_type="stuff",
    retriever=ayurveda_retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT},
)

ailment_result = ailment_qa({"query": "My body type is %s. %s"%(body_type_result["result"], ailment)})

print(ailment_result["result"])
# for d in results["source_documents"]:
#   print(d.page_content)
#   print("-------")

Ailment and Body Type:
In Ayurveda, each dosha or body type has certain characteristics that can make individuals more prone to specific health issues. As a Kapha body type, you generally have a solid frame, slow digestion, and are most energetic in the evening. Kapha dosha is also associated with anabolism, which includes the storage and regulation of water and fat in the body.

Diabetes, particularly type 2, is often linked to an imbalance in Kapha dosha due to the accumulation of excess fat and water in the body. This can lead to insulin resistance, making it difficult for your body to regulate blood sugar levels.

Home Remedies:

1. Diet: Focus on a Kapha-pacifying diet, which includes light, warm, and dry foods. Include more bitter, astringent, and pungent tastes in your meals. Foods like leafy greens, broccoli, cauliflower, beans, lentils, and spices such as turmeric, cinnamon, and black pepper can help manage blood sugar levels.

2. Exercise: Regular physical activity is essenti