<h1>IMPORT LIBRARY</h1>

In [1]:
#Import Library
import pandas as pd
import numpy as np
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams, PointStruct
import tensorflow as tf

2024-11-25 08:41:21.746404: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-25 08:41:21.874114: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-25 08:41:21.913817: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-25 08:41:22.645388: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; 

<h1>DATA CLEANING</h1>

In [2]:
dataset = pd.read_csv('medquad.csv')
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16412 entries, 0 to 16411
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   question    16412 non-null  object
 1   answer      16407 non-null  object
 2   source      16412 non-null  object
 3   focus_area  16398 non-null  object
dtypes: object(4)
memory usage: 513.0+ KB


In [3]:
dataset.head()

Unnamed: 0,question,answer,source,focus_area
0,What is (are) Glaucoma ?,Glaucoma is a group of diseases that can damag...,NIHSeniorHealth,Glaucoma
1,What causes Glaucoma ?,"Nearly 2.7 million people have glaucoma, a lea...",NIHSeniorHealth,Glaucoma
2,What are the symptoms of Glaucoma ?,Symptoms of Glaucoma Glaucoma can develop in ...,NIHSeniorHealth,Glaucoma
3,What are the treatments for Glaucoma ?,"Although open-angle glaucoma cannot be cured, ...",NIHSeniorHealth,Glaucoma
4,What is (are) Glaucoma ?,Glaucoma is a group of diseases that can damag...,NIHSeniorHealth,Glaucoma


In [4]:
print("number of duplications : ", dataset.duplicated().sum())

number of duplications :  48


In [5]:
dataset.drop_duplicates(inplace=True)
print("number of duplications after cleaning : ", dataset.duplicated().sum())

number of duplications after cleaning :  0


In [6]:
dataset.isna().sum()

question       0
answer         5
source         0
focus_area    14
dtype: int64

In [7]:
print('number of NaN : '), dataset.dropna(inplace=True)

number of NaN : 


(None, None)

In [8]:
#Save Dataset
dataset.to_csv('cleaned_medquad.csv', index=False)

<h1>TOKENIZATION & EMBEDDING</h1>

In [9]:
question = dataset['question'].tolist()
answer = dataset['answer'].tolist()
qa_combined = (dataset['question'] + " " + dataset['answer']).tolist()


tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(qa_combined)
qa = tokenizer.texts_to_sequences(qa_combined)
maxlen_qa = max([len(x) for x in qa])
padded_qa = tf.keras.preprocessing.sequence.pad_sequences(qa, maxlen=maxlen_qa, padding='post')

In [10]:
#Dimension of embedding vectors
embedding_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=768, input_length=maxlen_qa),
    tf.keras.layers.GlobalAveragePooling1D()
])

2024-11-25 08:41:28.269381: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:
2024-11-25 08:41:28.269417: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2024-11-25 08:41:28.270587: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropria

In [11]:
embedding_model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 4253, 768)         22765056  
                                                                 
 global_average_pooling1d (G  (None, 768)              0         
 lobalAveragePooling1D)                                          
                                                                 
Total params: 22,765,056
Trainable params: 22,765,056
Non-trainable params: 0
_________________________________________________________________


In [12]:
#Save Model
embedding_model.save("./embedding_model.h5")



In [13]:
embedding_qa = embedding_model.predict(padded_qa)



<h1>QDRANT</h1>

In [14]:
#Database Initialization
client = QdrantClient("http://10.12.9.105:6333")

In [15]:
#Input Data to Qdrant
client.recreate_collection(
    collection_name='Healthcare_2',
    vectors_config=VectorParams(
        size=embedding_qa.shape[1],
        distance=Distance.COSINE
    )
)

points = [
    PointStruct(
        id=i,
        vector=embedding_qa[i].tolist(),
        payload={"question" : dataset['question'].iloc[i], 'answer' : dataset['answer'].iloc[i]}
    )
    for i in range(len(embedding_qa))
]

batch_size = 500

#Split data to smaller batches
for i in range(0, len(points), batch_size):
    batch_points = points[i:i+batch_size]
    
    client.upsert(
        collection_name='Healthcare_2',
        wait=True,
        points=batch_points
    )
    print(f'Uploaded batch {i // batch_size + 1}')

  client.recreate_collection(


Uploaded batch 1
Uploaded batch 2
Uploaded batch 3
Uploaded batch 4
Uploaded batch 5
Uploaded batch 6
Uploaded batch 7
Uploaded batch 8
Uploaded batch 9
Uploaded batch 10
Uploaded batch 11
Uploaded batch 12
Uploaded batch 13
Uploaded batch 14
Uploaded batch 15
Uploaded batch 16
Uploaded batch 17
Uploaded batch 18
Uploaded batch 19
Uploaded batch 20
Uploaded batch 21
Uploaded batch 22
Uploaded batch 23
Uploaded batch 24
Uploaded batch 25
Uploaded batch 26
Uploaded batch 27
Uploaded batch 28
Uploaded batch 29
Uploaded batch 30
Uploaded batch 31
Uploaded batch 32
Uploaded batch 33


In [16]:
def search(query):
    query_vector = tokenizer.texts_to_sequences([query])
    padded_query = tf.keras.preprocessing.sequence.pad_sequences(query_vector, maxlen=maxlen_qa, padding='post')
    
    #Get the embedding vector for the query
    embedding_query = embedding_model.predict(padded_query)[0]
    
    #Perform the search
    results = client.search(
        collection_name='Healthcare_2',
        query_vector=embedding_query.tolist(),
        limit=3
    )

    #Sort results by score
    sorted_result = sorted(results, key=lambda x: x.score, reverse=True)
    
    return [res.payload['question'] + ' ' + res.payload['answer'] for res in sorted_result]

query = 'I have blurred vision, eye pain and redness, seeing flashes of light. What disease do I suffer from?'
results = search(query)
for result in results:
    print(result)

what can i do to prevent poisoning by marine toxins? General guidelines for safe seafood consumption:
What causes Childhood Ependymoma ? The cause of most childhood brain tumors is unknown.
What is the outlook for Thyrotoxic Myopathy ? With treatment, muscle weakness may improve or be reversed.


In [17]:
#Import Library
from qdrant_client import QdrantClient
from sentence_transformers import SentenceTransformer
import torch
from qdrant_client.http.models import Distance, VectorParams, PointStruct

#Inisialisasi Database dan Model Embedding
client = QdrantClient("http://10.12.9.105:6333")

model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2', device='cuda' if torch.cuda.is_available() else 'cpu')

def search(query):
    query_vector= model.encode(query)
    results = client.search(
        collection_name = 'Healthcare',
        query_vector=query_vector,
        limit = 3
    )
    #Mengurutkan hasil berdasarkan skor relevansi
    sorted_result = sorted(results, key=lambda x: x.score, reverse=True)
    return [res.payload['question'] + ' ' + res.payload['answer'] for res in sorted_result]

query = 'I have blurred vision, eye pain and redness, seeing flashes of light. What disease do I suffer from?'
results = search(query)
for result in results:
    print(result)

  from tqdm.autonotebook import tqdm, trange


What is (are) Eye Diseases ? Some eye problems are minor and don't last long. But some can lead to a permanent loss of vision.    Common eye problems include       - Refractive errors    - Cataracts - clouded lenses    - Glaucoma - a disorder caused by damage to the optic nerve    - Retinal disorders - problems with the nerve layer at the back of the eye    - Macular degeneration - a disease that destroys sharp, central vision    - Diabetic eye problems    - Conjunctivitis - an infection also known as pinkeye       Your best defense is to have regular checkups, because eye diseases do not always have symptoms. Early detection and treatment could prevent vision loss. See an eye care professional right away if you have a sudden change in vision, if everything looks dim, or if you see flashes of light. Other symptoms that need quick attention are pain, double vision, fluid coming from the eye, and inflammation.    NIH: National Eye Institute
What are the symptoms of Diabetic Retinopathy ?