<h1>IMPORT LIBRARY</h1>

In [None]:
# %pip install pandas==1.4.2 sentence-transformers==3.0.1 qdrant-client==1.10.1 transformers==4.44.1 tensorflow-text==2.10.0 tensorflow-gpu==2.10.0

In [2]:
%pip install numpy==1.22.4



In [3]:
%pip install -r /content/requirements.txt



In [4]:
#Import Library
import pandas as pd
import numpy as np
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams, PointStruct
import tensorflow as tf
from transformers import TFAutoModel
from transformers import AutoTokenizer
import time

<h1>DATA CLEANING</h1>

In [5]:
dataset = pd.read_csv('medquad.csv')
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16412 entries, 0 to 16411
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   question    16412 non-null  object
 1   answer      16407 non-null  object
 2   source      16412 non-null  object
 3   focus_area  16398 non-null  object
dtypes: object(4)
memory usage: 513.0+ KB


In [6]:
dataset.head()

Unnamed: 0,question,answer,source,focus_area
0,What is (are) Glaucoma ?,Glaucoma is a group of diseases that can damag...,NIHSeniorHealth,Glaucoma
1,What causes Glaucoma ?,"Nearly 2.7 million people have glaucoma, a lea...",NIHSeniorHealth,Glaucoma
2,What are the symptoms of Glaucoma ?,Symptoms of Glaucoma Glaucoma can develop in ...,NIHSeniorHealth,Glaucoma
3,What are the treatments for Glaucoma ?,"Although open-angle glaucoma cannot be cured, ...",NIHSeniorHealth,Glaucoma
4,What is (are) Glaucoma ?,Glaucoma is a group of diseases that can damag...,NIHSeniorHealth,Glaucoma


In [7]:
print("number of duplications : ", dataset.duplicated().sum())

number of duplications :  48


In [8]:
dataset.drop_duplicates(inplace=True)
print("number of duplications after cleaning : ", dataset.duplicated().sum())

number of duplications after cleaning :  0


In [9]:
dataset.isna().sum()

Unnamed: 0,0
question,0
answer,5
source,0
focus_area,14


In [10]:
print('number of NaN : '), dataset.dropna(inplace=True)

number of NaN : 


(None, None)

In [11]:
#Save Dataset
dataset.to_csv('cleaned_medquad.csv', index=False)

<h1>TOKENIZATION & EMBEDDING</h1>

In [12]:
class TFSentenceTransformer(tf.keras.layers.Layer):
    def __init__(self, model_name_or_path, **kwargs):
        super(TFSentenceTransformer, self).__init__()
        #Load transformers model
        self.model = TFAutoModel.from_pretrained(model_name_or_path, **kwargs)

    def call(self, inputs, normalize=True):
        #Run model on inputs
        model_output = self.model(inputs)
        #Perform pooling.
        embeddings = self.mean_pooling(model_output, inputs['attention_mask'])
        #Normalize the embeddings
        if normalize:
            embeddings = self.normalize(embeddings)
        return embeddings

    def mean_pooling(self, model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = tf.cast(
            tf.broadcast_to(tf.expand_dims(attention_mask, -1), tf.shape(token_embeddings)),
            tf.float32
        )
        return tf.math.reduce_sum(token_embeddings * input_mask_expanded, axis=1) / tf.clip_by_value(tf.math.reduce_sum(input_mask_expanded, axis=1), 1e-9, tf.float32.max)

    def normalize(self, embeddings):
        embeddings, _ = tf.linalg.normalize(embeddings, 2, axis=1)
        return embeddings

In [13]:
#Model ID
model_id = 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'
#Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = TFSentenceTransformer(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['embeddings.position_ids']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [None]:
#Combine question and answer columns
dataset['question_answer'] = dataset['question'].fillna('') + ' ' + dataset['answer'].fillna('')

batch_size = 32

#Function to process the data in batches
def process_in_batches(data, batch_size):
    for i in range(0, len(data), batch_size):
        yield data[i:i + batch_size]

#Create a list from the question_answer column
qa = dataset['question_answer'].tolist()

#Tokenization
tokenized_qa = tokenizer(qa, padding=True, truncation=True, return_tensors='tf')

qa_dataset = tf.data.Dataset.from_tensor_slices(tokenized_qa)
qa_dataset = qa_dataset.batch(batch_size)
qa_dataset = qa_dataset.prefetch(tf.data.AUTOTUNE)

#Get embeddings from the model
all_embeddings = []
batch_num = 1

#Start measuring processing time
start_time = time.time()

for batch in qa_dataset:
    batch_embeddings = model(batch)
    embeddings_list = [embedding.numpy().tolist() for embedding in batch_embeddings]
    all_embeddings.extend(embeddings_list)
    #Displays the results of the batch being processed
    print(f"Uploaded Batch {batch_num}")
    batch_num += 1

#Calculates total time
total_time = time.time() - start_time
print(f"Total Processing Time: {total_time:.2f} seconds")

Uploaded Batch 1
Uploaded Batch 2
Uploaded Batch 3
Uploaded Batch 4
Uploaded Batch 5
Uploaded Batch 6
Uploaded Batch 7
Uploaded Batch 8
Uploaded Batch 9
Uploaded Batch 10
Uploaded Batch 11
Uploaded Batch 12
Uploaded Batch 13
Uploaded Batch 14
Uploaded Batch 15
Uploaded Batch 16
Uploaded Batch 17
Uploaded Batch 18
Uploaded Batch 19
Uploaded Batch 20
Uploaded Batch 21
Uploaded Batch 22
Uploaded Batch 23
Uploaded Batch 24
Uploaded Batch 25
Uploaded Batch 26
Uploaded Batch 27
Uploaded Batch 28
Uploaded Batch 29
Uploaded Batch 30
Uploaded Batch 31
Uploaded Batch 32
Uploaded Batch 33
Uploaded Batch 34
Uploaded Batch 35
Uploaded Batch 36
Uploaded Batch 37
Uploaded Batch 38
Uploaded Batch 39
Uploaded Batch 40
Uploaded Batch 41
Uploaded Batch 42
Uploaded Batch 43
Uploaded Batch 44
Uploaded Batch 45
Uploaded Batch 46
Uploaded Batch 47
Uploaded Batch 48
Uploaded Batch 49
Uploaded Batch 50
Uploaded Batch 51
Uploaded Batch 52
Uploaded Batch 53
Uploaded Batch 54
Uploaded Batch 55
Uploaded Batch 56
U

<h1>QDRANT</h1>

In [None]:
#Database Initialization
client = QdrantClient(
    "http://34.101.137.149:6333",
)

In [None]:
#Input Data to Qdrant
client.recreate_collection(
    collection_name='Healthcare',
    vectors_config=VectorParams(
        size=(len(all_embeddings[0])),
        distance=Distance.COSINE
    )
)

points = [
    PointStruct(
        id=i,
        vector=all_embeddings[i],
        payload={"question" : dataset['question'].iloc[i], 'answer' : dataset['answer'].iloc[i]}
    )
    for i in range(len(all_embeddings))
]

batch_size = 500

#Split data to smaller batches
for i in range(0, len(points), batch_size):
    batch_points = points[i:i+batch_size]

    client.upsert(
        collection_name='Healthcare',
        wait=True,
        points=batch_points
    )
    print(f'Uploaded batch {i // batch_size + 1}')

In [None]:
def search(query):
    # Tokenize query
    query_vector = tokenizer(query, padding=True, truncation=True, return_tensors="tf")

    # Generate embeddings using the model
    query_vector = model(query_vector).numpy().tolist()

    # Perform search in Qdrant
    results = client.search(
        collection_name='Healthcare',
        query_vector=query_vector[0],  # Use the first embedding in the batch
        limit=3
    )

    # Sort results by score
    sorted_result = sorted(results, key=lambda x: x.score, reverse=True)

    # Return formatted results
    return [res.payload['question'] + ' ' + res.payload['answer'] for res in sorted_result]

query = 'I have blurred vision, eye pain and redness, seeing flashes of light. What disease do I suffer from?'
results = search(query)
for result in results:
    print(result)

### READ DATA FROM QDRANT USING LANGCHAIN

In [None]:
%pip install langchain_ollama -U langchain-community

In [None]:
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="ollama3.2:1B")

In [None]:
%pip install langchain-qdrant

In [None]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(url="http://34.101.137.149:6333")

In [None]:
from langchain.vectorstores import Qdrant

vector_store = Qdrant(
    client=client,
    collection_name="Healthcare",
    embeddings=embeddings
)

In [None]:
from langchain_ollama import ChatOllama

llm = ChatOllama(
    model="llama3.2:1b",
    temperature=0.5,
)

In [None]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

input_text = (
    "You are a healthcare chatbot. "
    "Please answer the question in a professional and friendly tone. "
    "Here are some rules on how to answer: "
    "- Use language that is professional and friendly. "
    "- Before answering say: 'Thank you for consulting with DoCare AI.' "
    "- At the end of the answer say: 'Hope this information helps and wish you a speedy recovery. Thank you.' "
    "- If the answer is not available, don't answer; do not make up an answer. "
    "- Answer questions based on the language of the question given. "
    "- Give all the answers related to the question disease. If there is no answer related to the disease, don't say 'no information,' but say: 'This is the only information I got.' Do not make up an answer. "
    "The following is informational text to answer questions later: "
    "{context}"
)

prompt = ChatPromptTemplate.from_messages([
    ("system", input_text),
    ("human", "{input}")
])

In [None]:
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 5})

In [None]:
question_answer_chain = create_stuff_documents_chain(llm, prompt)

rag_chain = create_retrieval_chain(
    retriever,
    question_answer_chain
)

response = rag_chain.invoke({"input": "I got hurt with my teeth. why i felt that?"})
print(response['answer'])

In [None]:
# #Model ID
# model_id_2 = 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2'
# #Load model and tokenizer
# tokenizer = AutoTokenizer.from_pretrained(model_id_2)
# model_2 = TFSentenceTransformer(model_id)

In [None]:
# #Combine question and answer columns
# dataset['question_answer'] = dataset['question'].fillna('') + ' ' + dataset['answer'].fillna('')

# batch_size = 32

# #Function to process the data in batches
# def process_in_batches(data, batch_size):
#     for i in range(0, len(data), batch_size):
#         yield data[i:i + batch_size]

# #Create a list from the question_answer column
# qa = dataset['question_answer'].tolist()

# #Tokenization
# tokenized_qa = tokenizer(qa, padding=True, truncation=True, return_tensors='tf')

# qa_dataset = tf.data.Dataset.from_tensor_slices(tokenized_qa)
# qa_dataset = qa_dataset.batch(batch_size)
# qa_dataset = qa_dataset.prefetch(tf.data.AUTOTUNE)

# #Get embeddings from the model
# all_embeddings = []
# batch_num = 1

# #Start measuring processing time
# start_time = time.time()

# for batch in qa_dataset:
#     batch_embeddings = model(batch)
#     embeddings_list = [embedding.numpy().tolist() for embedding in batch_embeddings]
#     all_embeddings.extend(embeddings_list)
#     #Displays the results of the batch being processed
#     print(f"Uploaded Batch {batch_num}")
#     batch_num += 1

# #Calculates total time
# total_time = time.time() - start_time
# print(f"Total Processing Time: {total_time:.2f} seconds")

In [None]:
# #Input Data to Qdrant
# client.recreate_collection(
#     collection_name='Healthcare_2',
#     vectors_config=VectorParams(
#         size=(len(all_embeddings[0])),
#         distance=Distance.COSINE
#     )
# )

# points = [
#     PointStruct(
#         id=i,
#         vector=all_embeddings[i],
#         payload={"question" : dataset['question'].iloc[i], 'answer' : dataset['answer'].iloc[i]}
#     )
#     for i in range(len(all_embeddings))
# ]

# batch_size = 500

# #Split data to smaller batches
# for i in range(0, len(points), batch_size):
#     batch_points = points[i:i+batch_size]

#     client.upsert(
#         collection_name='Healthcare_2',
#         wait=True,
#         points=batch_points
#     )
#     print(f'Uploaded batch {i // batch_size + 1}')

  client.recreate_collection(


Uploaded batch 1
Uploaded batch 2
Uploaded batch 3
Uploaded batch 4
Uploaded batch 5
Uploaded batch 6
Uploaded batch 7
Uploaded batch 8
Uploaded batch 9
Uploaded batch 10
Uploaded batch 11
Uploaded batch 12
Uploaded batch 13
Uploaded batch 14
Uploaded batch 15
Uploaded batch 16
Uploaded batch 17
Uploaded batch 18
Uploaded batch 19
Uploaded batch 20
Uploaded batch 21
Uploaded batch 22
Uploaded batch 23
Uploaded batch 24
Uploaded batch 25
Uploaded batch 26
Uploaded batch 27
Uploaded batch 28
Uploaded batch 29
Uploaded batch 30
Uploaded batch 31
Uploaded batch 32
Uploaded batch 33


In [None]:
# def search(query):
#     # Tokenize query
#     query_vector = tokenizer(query, padding=True, truncation=True, return_tensors="tf")

#     # Generate embeddings using the model
#     query_vector = model(query_vector).numpy().tolist()

#     # Perform search in Qdrant
#     results = client.search(
#         collection_name='Healthcare_2',
#         query_vector=query_vector[0],  # Use the first embedding in the batch
#         limit=3
#     )

#     # Sort results by score
#     sorted_result = sorted(results, key=lambda x: x.score, reverse=True)

#     # Return formatted results
#     return [res.payload['question'] + ' ' + res.payload['answer'] for res in sorted_result]

# query = 'I have blurred vision, eye pain and redness, seeing flashes of light. What disease do I suffer from?'
# results = search(query)
# for result in results:
#     print(result)

What is (are) Coats disease ? Coats disease is an eye disorder characterized by abnormal development of the blood vessels in the retina (retinal telangiectasia). Most affected people begin showing symptoms of the condition in childhood. Early signs and symptoms vary but may include vision loss, crossed eyes (strabismus), and a white mass in the pupil behind the lens of the eye (leukocoria). Overtime, coats disease may also lead to retinal detachment, glaucoma, and clouding of the lens of the eye (cataracts) as the disease progresses. In most cases, only one eye is affected (unilateral). The exact underlying cause is not known but some cases may be due to somatic mutations in the NDP gene. Treatment depends on the symptoms present and may include cryotherapy, laser therapy, and/or surgery.
What is (are) Eales disease ? Eales disease is a rare vision disorder that appears as an inflammation and white haze around the outercoat of the veins in the retina. This condition is most common amon