# MEDIBOT - SOCIAL HEALTH INSURANCE CHATBOT PROJECT

Authors : Asam Olala, Immaculate Kithei, Derrick Waititu, Stephen Ochieng, Elizabeth Atieno

### Table of Contents 

[Data Exploration](#Data-Exploration)

[Data Preprocessing](#Data-Preprocessing)


[Modelling](#Modelling)

-[XG Boost](#XG-Boost)

-[Bert Model](#Bert-Model)

-[Rasa Model](#Rasa-Model)

[Model Comparison](#Model-Comparison)

[Conclusion](Conclusion)

[Recommendation](#Recommendation)

[Next Steps](#Next-Steps)





 ### 1.Business Understanding

#### Overview 

The introduction of Social Health Authority(SHA) as a replacement of National Health Insurance Fund(NHIF) has caused confusion among  Kenyans due to a lack of clear,timely  and accessible information. This has led to misinformation, low enrollment and difficulties in registration and claims. 
As a result, customer service centers are overwhelmed, healthcare access is delayed, and trust in the new scheme is declining.

#### Problem Statement 
Kenya is undergoing a major shift in its healthcare insurance system as the government pushes for full implementation of SHA. However, challenges such as misinformation, registration issues, and policy misunderstandings are slowing adoption. Proactively addressing these issues is key to ensuring a seamless transition for millions who depend on health insurance.

This project aims to close the information gap on SHA by providing accurate, timely, and clear details on healthcare coverage. Through technology, public engagement, and stakeholder collaboration, it will support a smooth transition from NHIF to SHA, enhancing healthcare access and efficiency in Kenya. m.

#### Main Objective 
Develop an AI Chatbot aimed at promoting understanding and increased adoption of SHA in the country by providing accurate, instant and accessible information about the scheme. 

#### Specific Objective 
- Implement an NLP model to understand about the scheme eligibility, benefits, contributions, claim procedures and the enrollment process. 
- Implement and deploy a web-based chatbot using Flask to ensure user interactivity

#### Research Questions

- How can an NLP model accurately determine a user's eligibility for SHA based on provided details?
- Can an NLP model effectively summarize SHA’s benefits based on user queries?
- Can the chatbot validate and guide users through claim submission based on SHA’s policies?
- Can the chatbot pre-fill application forms based on extracted user input?

#### Data Understanding
The dataset for the SHA Medical chatbot was sourced from the official SHA Website [Social Health Insurance (General) Regulations, 2023](https://www.health.go.ke/sites/default/files/2023-11/SOCIAL%20HEALTH%20INSURANCE%20%28GENERAL%29%20REGULATIONS%20%2C2023.pdf)

The Intents, Domain and Stories have been developed based on the SHA Act(2023) to guarantee precise responses concerning scheme elegibility, benefits, contributions, claim procedures and enrollment process.

##### - Question and Answer Dataset

The Question and Answer dataset for the SHA medical chatbot project contains 7,904 entries, sourced from the SHA website. It includes fully populated Question and Answer columns, capturing user inquiries and responses about the Social Health Authority scheme

This dataset serves as a core resource for training the NLP model to address SHA-related topics like eligibility, benefits, and enrollment.

In [22]:
import pandas as pd 

df = pd.read_csv( 'sha_dataset.csv')
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Question,Answer
0,0,0.0,What is the purpose of the Primary Healthcare ...,"To purchase primary healthcare services, pay h..."
1,1,1.0,Who is eligible to access healthcare under the...,Every person resident in Kenya who is register...
2,2,2.0,What services are covered under the Primary He...,"Promotive, preventive, curative, rehabilitativ..."
3,3,3.0,What is the role of the Authority in financing...,To mobilize resources for purchasing primary h...
4,4,4.0,What is the Social Health Insurance Fund used ...,To pool all contributions and purchase healthc...


##### - Frequently Asked Questions(FAQ) Dataset

A collection of common questions and answers about the Social Health Authority (SHA), sourced from official materials. It provides clear, user-friendly details on the scheme, supporting the chatbot’s NLP training.



In [23]:
#FAQ's Dataset 

import os
os.path.abspath("data/Frequently-Asked-Questions-FAQs-on-Social-Health-Authority-SHA-.pdf")

import pdfplumber

file_path = r"C:\Users\User\Documents\Flatiron\CapstoneProject\capstone_project_group2-\data\Frequently-Asked-Questions-FAQs-on-Social-Health-Authority-SHA-.pdf"

with pdfplumber.open(file_path) as pdf:
    sha_act = "".join(page.extract_text() for page in pdf.pages)

#Preview the first 500 characaters 
sha_act[:500]




'SOCIAL HEALTH INSURANCE ACT 2023 &\nSOCIAL HEALTH INSURANCE REGULATIONS 2024\nFREQUENTLY ASKED QUESTIONS\nImportant areas to understand:\nA. Understanding Social Health\nAuthority (SHA).\nB. Institutions created by the Universal\nHealth Coverage laws and transition\nprocess.\nC. NHIF staff considerations during the\ntransition process.\nD. Primary Health Care & the PHC fund.\nE. Emergency, chronic and critical illness\nfund.\nF. Registration, means testing &\ncontributions.\nG. Benefits, tarrifs & claims\nmanage'

#### -Social Health Insurance(SHA) Act Dataset

This dataset has been extracted from the Social Health Insurance (General) Regulations, 2023. 

It is based on the official legal framework of SHA in Kenya. Sourced from the Ministry of Health, it covers eligibility, benefits, contributions, claims, and enrollment, forming the chatbot’s core knowledge base.

In [56]:
#SHA Act Dataset
import os

pdf_path = r"data/SOCIAL HEALTH INSURANCE (GENERAL) REGULATIONS, 2023.pdf"

import pdfplumber

file_path = r"C:\Users\User\Documents\Flatiron\CapstoneProject\capstone_project_group2-\data\Frequently-Asked-Questions-FAQs-on-Social-Health-Authority-SHA-.pdf"

with pdfplumber.open(file_path) as pdf:
    sha_act = "".join(page.extract_text() for page in pdf.pages)

#Preview the first 500 characaters 
sha_act[:500]




'SOCIAL HEALTH INSURANCE ACT 2023 &\nSOCIAL HEALTH INSURANCE REGULATIONS 2024\nFREQUENTLY ASKED QUESTIONS\nImportant areas to understand:\nA. Understanding Social Health\nAuthority (SHA).\nB. Institutions created by the Universal\nHealth Coverage laws and transition\nprocess.\nC. NHIF staff considerations during the\ntransition process.\nD. Primary Health Care & the PHC fund.\nE. Emergency, chronic and critical illness\nfund.\nF. Registration, means testing &\ncontributions.\nG. Benefits, tarrifs & claims\nmanage'

#### Data Limitations 
- SHA Act (2023) lacks real-world user scenarios for training intents and stories.
- PDF format of regulations hinders easy NLP parsing.
- No historical NHIF user query data to compare with SHA.

The essential libraries and functions utilized in this project were consolidated in a separate notebook named My_functions, located within the project’s directory.

This My_functions notebook was subsequently imported into the current notebook, as below.

In [25]:
import My_functions as myf

#### Data Exploration

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7904 entries, 0 to 7903
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0.1  7904 non-null   int64  
 1   Unnamed: 0    932 non-null    float64
 2   Question      7904 non-null   object 
 3   Answer        7904 non-null   object 
dtypes: float64(1), int64(1), object(2)
memory usage: 247.1+ KB


#### Summary Of Data

The dataset consists of 7,904 entries and 4 columns. The first column serves as an index-like column with no missing entries.
column 0 is largely incomplete, with only 932 non-null values.
Questioncolumn is fully populated with text-based questions.
Answer column is also fully populated with text-based responses.

#### Data Preprocessing

Imports relevant libraries that will be used to clean and process text data, then train a model.

In [27]:
import re
import string
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('stopwords')
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [28]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')  # Optional but recommended


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

The DataPreprocessor is designed to clean and preprocess text data in a pandas DataFrame

In [29]:
class DataPreprocessor:
    def __init__(self, df, text_columns):
        # Create a copy of the DataFrame to avoid modifying the original
        """Initialize with dataframe and text columns to process."""
        self.df = df.copy()
        # Drop unnecessary index-like columns, ignoring errors if they don't exist
        self.df.drop(columns=['Unnamed: 0.1', 'Unnamed: 0'], errors='ignore', inplace=True)
        # Store the columns to preprocess
        self.text_columns = text_columns
        # Load English stopwords from NLTK
        self.stop_words = set(stopwords.words('english'))
        # Initialize NLTK's WordNet lemmatizer
        self.lemmatizer = WordNetLemmatizer()
    
    def normalize_case(self, text):
        """Convert text to lowercase."""
        return text.lower()
    
    def remove_punctuation(self, text):
        """Remove punctuation from text."""
        return text.translate(str.maketrans('', '', string.punctuation))
    
    def tokenize(self, text):
        """Tokenize the text."""
        return word_tokenize(text)
    
    def convert_non_alphabetic(self, tokens):
        """Remove non-alphabetic tokens."""
        return [token for token in tokens if token.isalpha()]
    
    def lemmatize_tokens(self, tokens):
        """Apply lemmatization to tokens."""
        return [self.lemmatizer.lemmatize(token) for token in tokens]
    
    def filter_stopwords(self, tokens):
        """Remove stopwords from token list."""
        return [token for token in tokens if token not in self.stop_words]
    
    def preprocess_text(self, text):
        """Apply all preprocessing steps to a given text."""
        #Convert to lowercase
        text = self.normalize_case(text)
        #Remove punctuation
        text = self.remove_punctuation(text)
        #Split into tokens
        tokens = self.tokenize(text)
        #Keep only alphabetic tokens
        tokens = self.convert_non_alphabetic(tokens)
        #Remove stopwords
        tokens = self.filter_stopwords(tokens)
        #Lemmatize tokens
        tokens = self.lemmatize_tokens(tokens)
        return ' '.join(tokens)
    
    def apply_preprocessing(self):
        """Apply preprocessing to all specified text columns."""
        # Iterate over each text column
            # Convert to string and apply preprocessing to every entry in the column
        for col in self.text_columns:
            self.df[col] = self.df[col].astype(str).apply(self.preprocess_text)
            # Return the fully processed DataFrame
        return self.df

In [30]:
# Instantiate the DataPreprocessor class with the original DataFrame and specify text columns to clean
# Then apply the preprocessing pipeline to the 'Question' and 'Answer' columns
df_cleaned = DataPreprocessor(df, ['Question', 'Answer']).apply_preprocessing()

# Display the first 5 rows of the cleaned DataFrame to inspect the preprocessed results
df_cleaned.head()

Unnamed: 0,Question,Answer
0,purpose primary healthcare fund,purchase primary healthcare service pay health...
1,eligible access healthcare primary healthcare ...,every person resident kenya registered member ...
2,service covered primary healthcare fund,promotive preventive curative rehabilitative p...
3,role authority financing primary healthcare fund,mobilize resource purchasing primary healthcar...
4,social health insurance fund used,pool contribution purchase healthcare service ...


In [31]:
# Display the last 20 rows of the cleaned DataFrame to inspect the preprocessed results
df_cleaned.tail(20)

Unnamed: 0,Question,Answer
7884,explain appeal process denied claim,healthcare provider appeal claim denial disput...
7885,would define appeal process denied claim,healthcare provider appeal claim denial disput...
7886,could clarify appeal process denied claim,healthcare provider appeal claim denial disput...
7887,simple term appeal process denied claim,healthcare provider appeal claim denial disput...
7888,appeal process denied claim,healthcare provider appeal claim denial disput...
7889,healthcare cost controlled within scheme,tariff reviewed periodically align economic he...
7890,right contributor scheme,contributor right quality healthcare transpare...
7891,scheme integrate private health insurance,private insurer complement coverage social hea...
7892,healthcare provider charge additional fee bene...,must adhere prescribed tariff without unauthor...
7893,scheme ensure equitable access healthcare,subsidizing contribution indigent household en...


In [32]:
## Count the total number of duplicate rows in the cleaned DataFrame
df_cleaned.duplicated().sum()

2950

In [33]:
#remove the duplicates
df_cleaned=df_cleaned.drop_duplicates(keep="first")
df_cleaned.duplicated().sum()

0

In [34]:
# Export the cleaned DataFrame to a CSV file named 'cleaned_sha_data.csv' for storage or further use
df_cleaned.to_csv("cleaned_sha_data.csv")

#### Modelling

####  XG Boost

Import the necessary libraries for modelling.

In [35]:
import pandas as pd
import numpy as np
import nltk
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import accuracy_score

Preprocesses text by converting it to lowercase, removing punctuation, tokenizing it into words using NLTK and filtering out stopwords.

In [36]:
def preprocess_text(text):
    text = text.lower().translate(str.maketrans('', '', string.punctuation))
    tokens = nltk.word_tokenize(text)
    return ' '.join([word for word in tokens if word not in stop_words])

In [39]:
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')  
stop_words = set(stopwords.words('english'))  


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [41]:
df["Processed_Question"] = df["Question"].apply(preprocess_text)
df["Processed_Question"]

0                         purpose primary healthcare fund
1       eligible access healthcare primary healthcare ...
2                services covered primary healthcare fund
3        role authority financing primary healthcare fund
4                       social health insurance fund used
                              ...                        
7899                  primary healthcare fund aim achieve
7900                     function primary healthcare fund
7901                    primary healthcare fund important
7902    primary healthcare fund support healthcare ser...
7903             responsibilities primary healthcare fund
Name: Processed_Question, Length: 7904, dtype: object

In [42]:
# Initialize the TF-IDF vectorizer
vectorizer = TfidfVectorizer()
# Transform the 'Processed_Question' column into a TF-IDF matrix
X = vectorizer.fit_transform(df["Processed_Question"])

In [43]:
# Convert the 'Answer' column to categorical type and encode categories as numeric values
df["Answer_Label"] = df["Answer"].astype('category').cat.codes
# Assign the encoded labels to 'y' as the target variable
y = df["Answer_Label"]

In [44]:
# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [45]:
# Initialize XGBoost classifier for multi-class classification
xgb_classifier = xgb.XGBClassifier(objective="multi:softmax", num_class=len(df["Answer"].unique()), eval_metric="mlogloss")

# Train the classifier on the training dataset
xgb_classifier.fit(X_train, y_train)


In [47]:
# Predict the class labels for the test dataset using the trained XGBoost model
y_pred = xgb_classifier.predict(X_test)
y_pred

array([261, 163, 412, ..., 311, 355, 149])

In [48]:
# Calculate the accuracy of the model by comparing predicted labels with actual labels
accuracy = accuracy_score(y_test, y_pred)

# Print the model's accuracy, formatted to two decimal places
print(f"Model Accuracy: {accuracy:.2f}")

Model Accuracy: 0.89


In [49]:
def get_answer(user_query):
    # Preprocess the input question
    user_query = preprocess_text(user_query)
    user_vector = vectorizer.transform([user_query])
    
    # Predict the best answer category
    predicted_label = xgb_classifier.predict(user_vector)[0]
    
    # Retrieve the corresponding answer
    answer = df[df["Answer_Label"] == predicted_label]["Answer"].values[0]
    return answer

# Example usage
user_query = "What is the purpose of the primary healthcare fund?"
response = get_answer(user_query)
print("Chatbot:", response)


Chatbot: To purchase primary healthcare services, pay health facilities, and establish a pool for receipt and payment of funds.


In [54]:
user_query = "The main purpose of the healthcare fund?"
response = get_answer(user_query)
print("Chatbot:", response)

Chatbot: Purchase primary healthcare services from primary healthcare facilities. Pay health facilities for providing quality primary healthcare services. Establish a pool for receipt and payment of funds for primary healthcare.


The model achieves a strong accuracy of 89%, delivering correct responses when the question matches one in the dataset. However, when the question is rephrased in a way that is not present in the dataset, the response may not be as accurate.

####  Bert Model

In [50]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import RobertaForQuestionAnswering, RobertaTokenizer, AdamW
from sentence_transformers import SentenceTransformer, util
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

- This code defines a BERT-based Question Answering model using RoBERTa fine-tuned on SQuAD2. It also incorporates SentenceTransformer for retrieving the most relevant context from a dataset before generating an answer.

In [51]:
class QADataset(Dataset):
    def __init__(self, dataframe, model_name='all-mpnet-base-v2'):
        self.questions = dataframe['Question'].tolist()
        self.answers = dataframe['Answer'].tolist()
        self.model = SentenceTransformer(model_name)
    
    def __len__(self):
        return len(self.questions)
    
    def __getitem__(self, idx):
        question_embedding = self.model.encode(self.questions[idx], convert_to_tensor=True)
        answer_embedding = self.model.encode(self.answers[idx], convert_to_tensor=True)
        
        return {
            'question_embedding': question_embedding,
            'answer_embedding': answer_embedding,
        }
class BERTQuestionAnswering:
    def __init__(self, model_name='deepset/roberta-base-squad2', lr=3e-5):
        self.tokenizer = RobertaTokenizer.from_pretrained(model_name)
        self.model = RobertaForQuestionAnswering.from_pretrained(model_name)
        self.optimizer = AdamW(self.model.parameters(), lr=lr)
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)
        self.embedding_model = SentenceTransformer('all-mpnet-base-v2')  # Improved embedding model

    def find_relevant_context(self, question, dataframe, threshold=0.5, top_n=3):
        contexts = dataframe['Answer'].tolist()
        question_embedding = self.embedding_model.encode([question], convert_to_tensor=False)
        context_embeddings = self.embedding_model.encode(contexts, convert_to_tensor=False)
        similarities = cosine_similarity(question_embedding, context_embeddings)[0]
        
        best_indices = similarities.argsort()[-top_n:][::-1]  # Get top N most similar
        best_similarities = [similarities[i] for i in best_indices]
        
        if max(best_similarities) < threshold:
            return "No relevant context found."
        
        best_contexts = " ".join([contexts[i] for i in best_indices])
        return best_contexts
    
    def answer_question(self, question, dataframe):
        context = self.find_relevant_context(question, dataframe)
        if context == "No relevant context found.":
            return context
        
        inputs = self.tokenizer(question, context, return_tensors='pt', truncation=True, padding='longest', max_length=384).to(self.device)
        outputs = self.model(**inputs)
        
        answer_start = torch.argmax(outputs.start_logits).item()
        answer_end = torch.argmax(outputs.end_logits).item() + 1
        
        if answer_start >= answer_end or (answer_end - answer_start) > 30:
            return "Sorry, I couldn't find a relevant answer."
        
        answer = self.tokenizer.convert_tokens_to_string(
            self.tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][answer_start:answer_end])
        )
        
        answer_start_score = torch.max(outputs.start_logits).item()
        answer_end_score = torch.max(outputs.end_logits).item()
        confidence = (answer_start_score + answer_end_score) / 2  # Average confidence
        
        return f"Answer: {answer} (Confidence: {confidence:.2f})"

# Load dataset
dataframe = pd.read_csv("sha_dataset.csv").drop(columns=['Unnamed: 0'])

# Instantiate the model
model = BERTQuestionAnswering()

# Ask a question without specifying context
question = "Who is eligible to access healthcare?"
answer = model.answer_question(question, dataframe)

print("Predicted Answer:", answer)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Predicted Answer: Answer:  if they are registered and contribute to the Social Health Insurance Fund (Confidence: 1.10)


In [55]:
# Ask a question without specifying context
question = "What is the role of the Primary Healthcare Fund??"
answer = model.answer_question(question, dataframe)

print("\n✅ Predicted Answer:", answer)


✅ Predicted Answer: Answer:  provides essential healthcare services to eligible individuals (Confidence: 6.08)
