<a href="https://colab.research.google.com/github/khanfawaz/MedChat/blob/main/MedChat_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SCOPE OF THE RESEARCH

## SCOPE:

The primary aim of this research is to analyse and assess the performance and effectiveness of an open-source, small-scale LLM as a chatbot application within the medical domain. The study will explore the collaboration between chatbot functionality, open-source platforms, and the utilisation of small-scale LLMs in the medical landscape. This study seeks to determine the Generative AI's accuracy, reliability, and overall utility in providing medical information and support by employing a comprehensive dataset of medical terms and their corresponding explanations. The ultimate goal is to enhance the accessibility and quality of healthcare communication through cost-effective AI-driven solutions.

Retrieval Augmented Generation (RAG) is employed to improve the performance of the medical chatbot model. This technique will be used to build the chatbot by retrieving relevant medical information from a knowledge base during the generation process.


## TECHNOLOGY STACK:

1. Python
2. Data Analysis
3. Large Language Model (LLM)
4. LangChain
5. Retrieval Augmented Generation (RAG)
6. Streamlit or Gradio
7. GitHub
8. Amazon Sagemaker
9. Google Collaboratory

## DATASET USED:

https://huggingface.co/datasets/gamino/wiki_medical_terms

## DISCLAIMER:

This model is intended for the research purposes only and should not be used as a substitute for professional medical advice. Always consult with a qualified healthcare provider for any medical concerns.

# DATA ANALYSIS

In [None]:
# Importing Necessary Libraries for Data Analysis
import pandas as pd
import numpy as np

# Libary to suppress the warnings
import warnings
warnings.filterwarnings(action="ignore")

In [None]:
# Loading the Dataset
df = pd.read_parquet("hf://datasets/gamino/wiki_medical_terms/wiki_medical_terms.parquet")

In [None]:
# Display the Dataset
df.head()

Unnamed: 0,page_title,page_text
0,Paracetamol poisoning,"Paracetamol poisoning, also known as acetamino..."
1,Acromegaly,Acromegaly is a disorder that results from exc...
2,Actinic keratosis,"Actinic keratosis (AK), sometimes called solar..."
3,Congenital adrenal hyperplasia,Congenital adrenal hyperplasia (CAH) is a grou...
4,Adrenocortical carcinoma,Adrenocortical carcinoma (ACC) is an aggressi...


In [None]:
# Shape of the dataset:
print(df.shape)

(6861, 2)


In [None]:
# Summary of the Dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6861 entries, 0 to 7275
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   page_title  6861 non-null   object
 1   page_text   6861 non-null   object
dtypes: object(2)
memory usage: 160.8+ KB


In [None]:
# Check for missing values
print(df.isnull().sum())

page_title    0
page_text     0
dtype: int64


<b>INSIGHT:</b>
- `No missing values`

In [None]:
# Print the duplicate values
print(df.duplicated().sum())

99


<b>INSIGHT:</b>
- `Showing 99 duplicate values`

In [None]:
df['page_title'].duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
7271    False
7272    False
7273    False
7274    False
7275    False
Name: page_title, Length: 6861, dtype: bool

<b>INSIGHT:</b>
- `Could not see any duplicate values due to large dataset`

In [None]:
# Display the first duplicate values by position
df.loc[df['page_title'].duplicated()]

Unnamed: 0,page_title,page_text
62,Paracetamol poisoning,"Paracetamol poisoning, also known as acetamino..."
63,Acromegaly,Acromegaly is a disorder that results from exc...
64,Actinic keratosis,"Actinic keratosis (AK), sometimes called solar..."
65,Congenital adrenal hyperplasia,Congenital adrenal hyperplasia (CAH) is a grou...
66,Adrenocortical carcinoma,Adrenocortical carcinoma (ACC) is an aggressi...
...,...,...
708,Benign prostatic hyperplasia,"Benign prostatic hyperplasia (BPH), also calle..."
713,Megaloblastic anemia,Megaloblastic anemia is a type of macrocytic a...
716,Ascending cholangitis,"Ascending cholangitis, also known as acute cho..."
722,Colitis,Colitis is an inflammation of the colon. Colit...


<b>INSIGHT:</b>
- `Clearly showing the duplicate values. "Paracetamol poisoning" is at index 0 from df.head() and here at index 62.`

In [None]:
df['page_title'].nunique()

6762

<b>INSIGHT:</b>
- `This re-confirms the presence of 99 duplicate values. The shape of the dataset is (6861, 2) but it is showing only 6762 duplicate values.`

In [None]:
# This shows which are not duplicate
df[~df.page_title.duplicated()]

Unnamed: 0,page_title,page_text
0,Paracetamol poisoning,"Paracetamol poisoning, also known as acetamino..."
1,Acromegaly,Acromegaly is a disorder that results from exc...
2,Actinic keratosis,"Actinic keratosis (AK), sometimes called solar..."
3,Congenital adrenal hyperplasia,Congenital adrenal hyperplasia (CAH) is a grou...
4,Adrenocortical carcinoma,Adrenocortical carcinoma (ACC) is an aggressi...
...,...,...
7271,Gephyrophobia,Gephyrophobia is the anxiety disorder or speci...
7272,Coronary artery bypass surgery,"Coronary artery bypass surgery, also known as ..."
7273,Unemployment,"Unemployment, according to the OECD (Organisat..."
7274,Surgical instrument,A surgical instrument is a tool or device for ...


<b>INSIGHT:</b>
- `Shows the values which are not duplicate and also shows the shape of the non-duplicate values.`

In [None]:
# Drop the duplicates based on the column name
df.drop_duplicates(subset=['page_title'], keep = 'first', inplace=True)

<b>INSIGHT:</b>
- `Drops the duplicate values`

In [None]:
# Check the shape to confirm the dropping of duplicate values
print(df.shape)

(6762, 2)


<b>INSIGHT:</b>

- `Confirming the drop of duplicate values.
The new dataset does not contain any Missing values or Duplicates.`

- `Now this dataset is ready for MedChat development.`