#### **CLM Service Delivery Prediction App**

CLM stands for Community-led Monitoring. CLM is a framework for collection of data where the data is owned by the community. The community in this case are PLHIV (People Living with HIV) patients who were interviewed at various facilities in Kenya to obtain their thoughts and opinions on the service delivery. Service delivery is crucial, and can sway the outcome of the patient's health, therefore, should be treated with the utmost care.

The project focuses on predicting patient satisfaction for HIV healthcare services by analyzing several factors like service availability, quality of care, confidentiality, facility accessibility, and patient demographics. The goal is to enhance the quality of care for HIV patients. The predictive model will help identify improvement areas, optimize resource allocation, and improve patient satisfaction and health outcomes. 

The dataset consists of approximately 36,000 patient responses from healthcare facility surveys, including demographic details, healthcare experiences, and service delivery satisfaction levels. The objective is to use machine learning to predict satisfaction, determine key satisfaction drivers, and guide healthcare providers in resource allocation for service delivery improvements.

##### About the app

This app pre-processes the survey data and applies a machine learning algorithm to predict if the patient is satisfied with the services or not. The outputs of the predictive model can be used in enhancing the decision making process of improving service delivery and thus, satisfying the patient.

##### Import the dependencies and the data

In [1]:
import numpy as np
import pandas as pd
import regex as re
from scipy.stats import randint, uniform
import joblib

import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler, label_binarize
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score, precision_recall_curve, roc_curve, auc, average_precision_score, confusion_matrix

from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

In [2]:
imonitor = pd.read_csv('data/imonitor_1703.csv')
imonitor.head()

  imonitor = pd.read_csv('data/imonitor_1703.csv')


Unnamed: 0,Survey ID,Created Date,Facility ownership,Please specify,County,What is your month; and year of birth,How do you consider yourself?,What is the highest level of education you completed?,Please specify.1,What is your current marital status?,...,how long do you wait on average to get a service; which service was that?,Do you consider the waiting time for lab test results long?,how long do you wait on average to get your lab test result?,Does the facility offer support groups?,Specify the support group you belong to,In your opinion are the services offered at this facility youth friendly?,What measures have been put in place to create GBV awareness and its harmful effects within the community?,Please Specify,PWD In your opinion are the services offered at this facility persons-with-disability friendly?,What are the top 1-3 things you don’t like about this facility with regards to care and treatment?
0,2390063,04-Dec-23,GOK,,Nairobi,03/09/1977,Male,Primary school,,Married,...,,No,,Yes,Adults,Yes,Presence of GBV Desk;,Chiefs office,Yes,
1,2390062,04-Dec-23,GOK,,Nairobi,12/08/1972,Female,Secondary school,,Married,...,,No,,No,,Yes,Presence of GBV Desk;,Chiefs office,,
2,2390061,04-Dec-23,GOK,,Nairobi,31/08/1984,Female,Primary school,,Married,...,,Yes,2 hours,No,,Yes,Presence of GBV Desk;,Chiefs office,Yes,
3,2390060,04-Dec-23,GOK,,Nairobi,07/05/1977,Female,Primary school,,Married,...,,No,,No,,Yes,Presence of GBV Desk;,Police station,,
4,2390059,04-Dec-23,GOK,,Nairobi,13/06/1987,Male,Vocational training or technician,,Married,...,1 hour,Yes,2 hours,Yes,Adults,Yes,Presence of GBV Desk;,Police station,,


In [3]:
# Check the shape of the data
imonitor.shape

(36511, 82)

##### Cleaning and pre-processing the data

This step removes columns that are not necessary for analysis, likely because they contain redundant or uninformative text such as "Please specify". Pruning helps to focus on more relevant features.

Columns are renamed to have more descriptive titles. This improves readability and may aid in understanding the features' roles in subsequent analysis.

It also contains custom functions to standardize and clean various aspects of the data. The functions look for inconsistencies in the data and standardize them to a uniform format, which is important for ensuring data quality.

In [4]:
# Find and drop columns that contain "Please specify" or "Please Specify"
cols_to_drop = [col for col in imonitor.columns if "Please specify" in col or "Please Specify" in col]

# Drop these columns from the DataFrame in a single operation
imonitor.drop(cols_to_drop, axis=1, inplace=True)

In [5]:
imonitor.columns = imonitor.columns.map(lambda x: x.strip())

In [6]:
for c in imonitor.columns:
    print(c)

Survey ID
Created Date
Facility ownership
County
What is your month; and year of birth
How do you consider yourself?
What is the highest level of education you completed?
What is your current marital status?
Which county do you currently live in?
What are your sources of income?
For how long have you been accessing services (based on the expected package of services) in this facility?
Are you aware of the package of services that you are entitled to?
According to you; which HIV related services are you likely to receive in this facility?
Is there a service that you needed that was not provided?
For that service that was not provided; were you referred?
If referred; did you receive the service where you were referred to?
If Yes which Service/Test/Medicine
On a scale of 1 to 5; how satisfied are you with the package of services received in this facility? If 1 is VERY UNSATISFIED and 5 is VERY SATISFIED.
What did you like about the services you received?
What did you not like about the se

In [7]:
columns_to_drop = [
    "Survey ID",
    "What is your month; and year of birth",
    "How do you consider yourself?",
    "What is the highest level of education you completed?",
    "What is your current marital status?",
    "Which county do you currently live in?",
    "What are your sources of income?",
    "What did you like about the services you received?",
    "What did you not like about the services you received?",
    "In your opinion what would you like to be improved?",
    "In your opinion what can be done to improve access to the services you seek at the facility?",
    "Why",
    "Were reasons provided as to why these services were not available?",
    "Were reasons provided as to why these services were not available?.1",
    "What are the barriers to uptake of VMMC by males 25+years and above?",
    "What are some of the current site level practices that community members like and would love to maintain for KP/PP ?",
    "What would you like this facility to change/do better?",
    "Throughout your visit what did you find interesting/pleasing about this facility that should be emulated by other facilities?",
    "What do you think can be improved",
    "Anything else that you would like to mention?",
    "What are the top 1-3 things you like about this facility with regards to care and treatment?",
    "What are the top 1-3 things you don’t like about this facility with regards to care and treatment?",
    "how long do you wait on average to get a service; which service was that?",
    "how long do you wait on average to get your lab test result?",
    "Specify the support group you belong to"
]

# Drop the columns
imonitor.drop(columns=columns_to_drop, axis=1, inplace=True)

In [8]:
column_name_mapping = {
    "Created Date": "Date",
    "Facility ownership": "FacilityOwnership",
    "County": "FacilityCounty",
    "For how long have you been accessing services (based on the expected package of services) in this facility?": "ServiceAccessDuration",
    "Are you aware of the package of services that you are entitled to?": "ServicesAwareness",
    "According to you; which HIV related services are you likely to receive in this facility?": "ExpectedHIVServices",
    "Is there a service that you needed that was not provided?": "UnprovidedService",
    "Facility name no service": "UnprovidedServiceFacilityName",
    "For that service that was not provided; were you referred?": "ReferralForUnprovidedService",
    "If referred; did you receive the service where you were referred to?": "ReferralServiceReceived",
    "If Yes which Service/Test/Medicine": "ReceivedServiceDetail",
    "On a scale of 1 to 5; how satisfied are you with the package of services received in this facility? If 1 is VERY UNSATISFIED and 5 is VERY SATISFIED.": "ServiceSatisfaction",
    "Do you face any challenges when accessing the services at the facility?": "AccessChallenges",
    "Common issues that can be added in the drop-down box": "CommonIssuesDropdown",
    "Was confidentiality considered while you were being served?": "Confidentiality",
    "Are there age-appropriate health services for specific groups?": "AgeAppropriateServices",
    "Does the facility allow you to share your concerns with the administration?": "ConcernsSharing",
    "Do you know your health-related rights as a client of this facility?": "RightsAwareness",
    "Have you ever been denied services at this facility?": "ServiceDenial",
    "Are you comfortable with getting services at this facility": "ComfortWithServices",
    "Have you ever been counseled?": "CounselingReceived",
    "Did you identify any gaps in the facility when you tried to access the services": "IdentifiedGaps",
    "Service type": "ServiceGapsType",
    "Are the HIV testing services readily available when required?": "HIVTestingAvailability",
    "Have you ever Interrupted your treatment?": "TreatmentInterruption",
    "Are the PMTCT services readily available when required?": "PMTCTServiceAvailability",
    "Are the HIV prevention; testing; treatment and care services adequate for KPs?": "KPServiceAdequacy",
    "Facility Level": "FacilityLevel",
    "Facility Operation times": "OperationTimes",
    "Facility Operation Days": "OperationDays",
    "What are your preferred days of visiting the facility": "PreferredVisitDays",
    "What are your preferred time of visiting the facility": "PreferredVisitTimes",
    "On a scale of 1-5; how clean do you find the facility?": "FacilityCleanliness",
    "How do you reach this facility?": "FacilityAccessMode",
    "How long does it take to reach this facility?": "FacilityAccessTime",
    "On a scale of 1-5; how accessible do you find this facility?": "FacilityAccessibility",
    "Do you consider the waiting time to be seen at this facility long?": "GeneralWaitingTime",
    "Do you consider the waiting time for lab test results long?": "LabResultsWaitingTime",
    "Does the facility offer support groups?": "SupportGroupAvailability",
    "In your opinion are the services offered at this facility youth friendly?": "YouthFriendlyServices",
    "What measures have been put in place to create GBV awareness and its harmful effects within the community?": "GBVAwarenessMeasures",
    "PWD In your opinion are the services offered at this facility persons-with-disability friendly?": "PWDFriendlyServicesOpinion"
}

# Assuming imonitor is your DataFrame
df = imonitor.rename(columns=column_name_mapping)

In [9]:
for c in df.columns:
    print(c)

Date
FacilityOwnership
FacilityCounty
ServiceAccessDuration
ServicesAwareness
ExpectedHIVServices
UnprovidedService
ReferralForUnprovidedService
ReferralServiceReceived
ReceivedServiceDetail
ServiceSatisfaction
AccessChallenges
CommonIssuesDropdown
Confidentiality
AgeAppropriateServices
ConcernsSharing
RightsAwareness
ServiceDenial
ComfortWithServices
CounselingReceived
IdentifiedGaps
ServiceGapsType
HIVTestingAvailability
TreatmentInterruption
PMTCTServiceAvailability
KPServiceAdequacy
FacilityLevel
OperationTimes
OperationDays
PreferredVisitDays
PreferredVisitTimes
FacilityCleanliness
FacilityAccessMode
FacilityAccessTime
FacilityAccessibility
GeneralWaitingTime
LabResultsWaitingTime
SupportGroupAvailability
YouthFriendlyServices
GBVAwarenessMeasures
PWDFriendlyServicesOpinion


In [10]:
columns_to_clean1 = [
    'GeneralWaitingTime',
    'LabResultsWaitingTime'
]

def replace_dont_know(df, column):
    df[column] = df[column].replace("Dont Know", "Do not know", regex=False)
    return df

for column in columns_to_clean1:
    df = replace_dont_know(df, column)

In [11]:
columns_to_clean2 = [
    'FacilityCleanliness',
    'FacilityAccessibility'
    ]

def replace_mixed_with_text(df, column_name):
    def replace_value(value):
        satisfaction_map = {
            1: 'Very Unsatisfied',
            2: 'Unsatisfied',
            3: 'Okay',
            4: 'Satisfied',
            5: 'Very Satisfied'
        }
        if isinstance(value, str) and value[0].isdigit():
            num = int(value[0])
        elif isinstance(value, int):
            num = value
        else:
            return value

        return satisfaction_map.get(num, value)

    df[column_name] = df[column_name].apply(replace_value)
    return df

for column in columns_to_clean2:
    df = replace_mixed_with_text(df, column)

In [12]:
def standardize_satisfaction(df, column_name):
    # Mapping for consolidating variations of satisfaction levels
    satisfaction_map = {
        '5': 'Very Satisfied',
        5.0: 'Very Satisfied',
        '4': 'Satisfied',
        4.0: 'Satisfied',
        '3': 'Okay',
        3.0: 'Okay',
        '2': 'Unsatisfied',
        2.0: 'Unsatisfied',
        '1': 'Very Unsatisfied',
        1.0: 'Very Unsatisfied',
        'Dissatisfied': 'Unsatisfied'
    }
    
    # Replace values based on the map
    df[column_name] = df[column_name].replace(satisfaction_map)
    return df

df = standardize_satisfaction(df, 'ServiceSatisfaction')


In [13]:
print(df['FacilityLevel'].value_counts())

FacilityLevel
4.0    4732
3.0    4476
2.0    2853
5.0    2193
1.0     556
6.0      14
Name: count, dtype: int64


In [14]:
def standardize_facility(df, column_name):
    # Mapping for consolidating variations of satisfaction levels
    satisfaction_map = {
        1.0: 'Community Health Unit',
        2.0: 'Dispensaries and Private Clinics',
        3.0: 'Health Centers',
        4.0: 'Sub-County Hospitals',
        5.0: 'County Referral Hospitals',
        6.0: 'National Referral Hospitals',
    }
    
    # Replace values based on the map
    df[column_name] = df[column_name].replace(satisfaction_map)
    return df

df = standardize_facility(df, 'FacilityLevel')

In [15]:
def replace_symbols_and_words1(df, column_name):
    df[column_name] = df[column_name].str.replace('<', 'Less than', regex=False)
    df[column_name] = df[column_name].str.replace('>', 'More than', regex=False)
    df[column_name] = df[column_name].str.replace('minutes', 'mins', regex=False)
    return df

df = replace_symbols_and_words1(df, 'FacilityAccessTime')

In [16]:
def replace_symbols_and_words2(df, column_name):
    df[column_name] = df[column_name].str.replace('Less than 30mins', 'Less than 30 mins', regex=False)
    df[column_name] = df[column_name].str.replace('More than45 mins', 'More than 45 mins', regex=False)
    return df

df = replace_symbols_and_words2(df, 'FacilityAccessTime')

In [17]:
def replace_county(df, column_name):
    df[column_name] = df[column_name].str.replace('Homabay', 'Homa Bay', regex=False)
    return df

df = replace_county(df, 'FacilityCounty')

In [18]:
def convert_mixed_dates(date_column):
    """
    This function takes a Pandas Series of mixed dates and Excel serial dates and converts them to datetime objects.
    
    Parameters:
    date_column (pd.Series): A pandas Series with mixed date formats and serial dates.
    
    Returns:
    pd.Series: A pandas Series with all dates converted to datetime objects.
    """
    excel_epoch = pd.Timestamp('1899-12-30')
    converted_dates = []
    for date in date_column:
        if isinstance(date, str) and re.match(r'^\d+(\.\d+)?$', date):
            serial_value = float(date)
            converted_date = excel_epoch + pd.to_timedelta(serial_value, unit='D')
        elif isinstance(date, (int, float)):
            converted_date = excel_epoch + pd.to_timedelta(date, unit='D')
        else:
            converted_date = pd.to_datetime(date, errors='coerce')
        converted_dates.append(converted_date)
    return pd.Series(converted_dates)

df['Date'] = convert_mixed_dates(df['Date'])

In [19]:
def standardize_gbv_awareness(df, column_name):
    df[column_name] = df[column_name].str.replace('Is there a desk to report GBV as community or individual', 'Presence of GBV Desk', regex=False)
    df[column_name] = df[column_name].str.replace('Are there training events on GBV for the community', 'Community trained on GBV', regex=False)
    return df

df = standardize_gbv_awareness(df, 'GBVAwarenessMeasures')

##### Feature engineering

The goal of this section is to encode the data since it is comprised of categorical variables. This ensures that the data is in the correct format before feeding it into the modelling. Categorical variables are one-hot encoded, creating binary (0/1) columns for each category. This is a necessary step for many machine learning algorithms which require numerical input.

In [20]:
def encode_multi_select(df, columns):
    for col in columns:
        split_series = df[col].str.replace(' ', '').str.split(';')
        encoded = split_series.str.join('|').str.get_dummies()
        encoded.columns = [f"{col}_{option}" for option in encoded.columns]
        df = df.join(encoded)
    return df
columns_to_encode = ['ExpectedHIVServices', 'OperationTimes', 'OperationDays', 'PreferredVisitDays', 'PreferredVisitTimes', 'GBVAwarenessMeasures']

df2 = encode_multi_select(df, columns_to_encode)

In [21]:
df2.drop(columns=columns_to_encode, axis=1, inplace=True)

In [22]:
for c in df2.columns:
    print(c)

Date
FacilityOwnership
FacilityCounty
ServiceAccessDuration
ServicesAwareness
UnprovidedService
ReferralForUnprovidedService
ReferralServiceReceived
ReceivedServiceDetail
ServiceSatisfaction
AccessChallenges
CommonIssuesDropdown
Confidentiality
AgeAppropriateServices
ConcernsSharing
RightsAwareness
ServiceDenial
ComfortWithServices
CounselingReceived
IdentifiedGaps
ServiceGapsType
HIVTestingAvailability
TreatmentInterruption
PMTCTServiceAvailability
KPServiceAdequacy
FacilityLevel
FacilityCleanliness
FacilityAccessMode
FacilityAccessTime
FacilityAccessibility
GeneralWaitingTime
LabResultsWaitingTime
SupportGroupAvailability
YouthFriendlyServices
PWDFriendlyServicesOpinion
ExpectedHIVServices_ARTmedicine
ExpectedHIVServices_CD4COUNT
ExpectedHIVServices_Cervicalcancerscreening
ExpectedHIVServices_ChestXray(thiscapturesonlyonesectorofclientswithchestissues)
ExpectedHIVServices_CondomDistribution
ExpectedHIVServices_Contraceptives
ExpectedHIVServices_Diagnosis
ExpectedHIVServices_Different

In [23]:
missing_percentage = df2.isnull().mean() * 100

threshold = 60

columns_to_drop = missing_percentage[missing_percentage > threshold].index.tolist()

print("Columns to drop:", columns_to_drop)

print("Number of columns to drop:", len(columns_to_drop))

df2.drop(columns=columns_to_drop, axis=1, inplace=True)

print("DataFrame shape after dropping columns:", df2.shape)

Columns to drop: ['ReferralForUnprovidedService', 'ReferralServiceReceived', 'ReceivedServiceDetail', 'CommonIssuesDropdown', 'ServiceGapsType', 'HIVTestingAvailability', 'TreatmentInterruption', 'PMTCTServiceAvailability', 'KPServiceAdequacy', 'YouthFriendlyServices', 'PWDFriendlyServicesOpinion']
Number of columns to drop: 11
DataFrame shape after dropping columns: (36511, 75)


In [24]:
threshold_percentage = 100

threshold = len(df2.columns) * (threshold_percentage / 100)

data = df2.dropna(thresh=threshold).copy()

print("Original DataFrame shape:", df2.shape)
print("Cleaned DataFrame shape:", data.shape)

rows_dropped = df2.shape[0] - data.shape[0]
print("Rows dropped:", rows_dropped)

Original DataFrame shape: (36511, 75)
Cleaned DataFrame shape: (13575, 75)
Rows dropped: 22936


In [25]:
data['ServiceSatisfaction'].value_counts()

ServiceSatisfaction
Very Satisfied      6451
Satisfied           6243
Okay                 445
Unsatisfied          364
Very Unsatisfied      72
Name: count, dtype: int64

In [26]:
recategorization_mapping = {
    'Very Satisfied': 2,
    'Satisfied': 1,
    'Okay': 1,
    'Unsatisfied': 0,
    'Very Unsatisfied': 0,
    'Do not know': 99,
    'Prefer not to answer ': 99
}

data.loc[:, 'ServiceSatisfaction'] = data['ServiceSatisfaction'].replace(recategorization_mapping)

data['ServiceSatisfaction'] = data['ServiceSatisfaction'].astype(int)

print(data['ServiceSatisfaction'].value_counts())

ServiceSatisfaction
1    6688
2    6451
0     436
Name: count, dtype: int64


  data.loc[:, 'ServiceSatisfaction'] = data['ServiceSatisfaction'].replace(recategorization_mapping)


In [27]:
model_data = data[data.ServiceSatisfaction != 99]

In [28]:
model_df = model_data.drop(['Date', 'FacilityCounty', 'FacilityOwnership'], axis=1)

In [29]:
value_counts1 = model_df['ServiceSatisfaction'].value_counts().reset_index()
value_counts1.columns = ['ServiceSatisfaction', 'Counts']

fig = px.bar(value_counts1, x='ServiceSatisfaction', y='Counts',
             title='Service Satisfaction Value Counts (Imbalanced)',
             labels={'ServiceSatisfaction': 'Service Satisfaction', 'Counts': 'Count'})

fig.show()

In [30]:
class_3_df = model_df[model_df['ServiceSatisfaction'] == 2]
class_2_df = model_df[model_df['ServiceSatisfaction'] == 1]
class_1_df = model_df[model_df['ServiceSatisfaction'] == 0]

target_number = class_1_df.shape[0]

class_3_sampled_df = class_3_df.sample(n=target_number, random_state=42)
class_2_sampled_df = class_2_df.sample(n=target_number, random_state=42)

balanced_df = pd.concat([class_3_sampled_df, class_2_sampled_df, class_1_df])

In [31]:
value_counts2 = balanced_df['ServiceSatisfaction'].value_counts().reset_index()
value_counts2.columns = ['ServiceSatisfaction', 'Counts']

fig = px.bar(value_counts2, x='ServiceSatisfaction', y='Counts',
             title='Service Satisfaction Value Counts (Balanced)',
             labels={'ServiceSatisfaction': 'Service Satisfaction', 'Counts': 'Count'})

fig.show()

In [32]:
ordinal_vars = balanced_df['ServiceSatisfaction']
nominal_vars = [col for col in balanced_df.columns if balanced_df[col].dtype == 'object' and col not in ordinal_vars]
encoded_data = pd.get_dummies(balanced_df, columns=nominal_vars)

print("NaN counts after pandas get_dummies:", encoded_data.isnull().sum().sum())

NaN counts after pandas get_dummies: 0


In [33]:
encoded_data.shape

(1308, 109)

In [34]:
encoded_data.to_csv('data/saved_data.csv')

In [35]:
X = encoded_data.drop('ServiceSatisfaction', axis=1)
y = encoded_data['ServiceSatisfaction']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

##### Data modelling

The following classification models were chosen for this problem: CatBoost, LGBM, XGBoostm Random Forest, Ligistic Regression, and SVC.

In [36]:
def test_models(X_train, y_train, X_test, y_test):
    models = {
        'CatBoostClassifier': CatBoostClassifier(verbose=0),
        'LGBMClassifier': LGBMClassifier(),
        'XGBClassifier': XGBClassifier(use_label_encoder=False, eval_metric='mlogloss'),
        'RandomForestClassifier': RandomForestClassifier(),
        'LogisticRegression': make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000, multi_class='ovr')),
        'SVC': make_pipeline(StandardScaler(), SVC(probability=True, decision_function_shape='ovr'))
    }
    
    best_model = None
    best_score = -1
    model_results = []
    for name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        roc_auc = roc_auc_score(y_test, model.predict_proba(X_test), multi_class='ovr', average='weighted') if hasattr(model, "predict_proba") else None
        report = classification_report(y_test, y_pred, output_dict=True)
        
        model_result = {
            'Model': name,
            'ROC AUC': roc_auc,
            'Accuracy': report['accuracy'],
            'F1 Score': report['weighted avg']['f1-score'],
        }
        model_results.append(model_result)
        
        # Check if this model is the best based on ROC AUC
        if roc_auc is not None and roc_auc > best_score:
            best_score = roc_auc
            best_model = model

    return pd.DataFrame(model_results), best_model

results_df, best_model = test_models(X_train, y_train, X_test, y_test)
print("Best model is:", best_model)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000880 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 170
[LightGBM] [Info] Number of data points in the train set: 915, number of used features: 85
[LightGBM] [Info] Start training from score -1.075921
[LightGBM] [Info] Start training from score -1.092076
[LightGBM] [Info] Start training from score -1.128565
Best model is: RandomForestClassifier()


##### Model Optimization and Evaluation 

The best performing model was CatBoost. To further improve its performance, Random search was employed to find the most optimal parameters for the model.

In [37]:
model = CatBoostClassifier(verbose=0, thread_count=-1)
param_distributions = {
    'iterations': randint(100, 1000),
    'learning_rate': uniform(0.01, 0.3),  
    'depth': randint(3, 10),
    'l2_leaf_reg': randint(1, 10)
}

# Setup Randomized Search with Cross-Validation
random_search = RandomizedSearchCV(model, param_distributions, n_iter=10, cv=3, n_jobs=-1, random_state=42, verbose=3)
random_search.fit(X_train, y_train)

best_parameters = random_search.best_params_
print("Best Parameters:", best_parameters)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best Parameters: {'depth': 8, 'iterations': 352, 'l2_leaf_reg': 9, 'learning_rate': 0.09736874205941257}


In [38]:
# Best parameters from RandomizedSearchCV
best_params = {
    'depth': 9, 
    'iterations': 221, 
    'l2_leaf_reg': 3, 
    'learning_rate': 0.039992474745400866
}

# Create a new CatBoost model using the best parameters
best_cat_model = CatBoostClassifier(
    verbose=0,
    thread_count=-1,
    **best_params
)

# Train the new model on training data
best_cat_model.fit(X_train, y_train)

<catboost.core.CatBoostClassifier at 0x21d8a732c20>

In [39]:
# Save the model to a file
joblib.dump(best_cat_model, 'catboost_model_clm.pkl')

['catboost_model_clm.pkl']

In [40]:
# Predict the values
y_pred = best_cat_model.predict(X_test)

unique_classes = np.unique(y_test)

report_dict = classification_report(y_test, y_pred, target_names=[str(cls) for cls in unique_classes], output_dict=True)

report_best_model = pd.DataFrame(report_dict)

report_best_model = report_best_model.transpose()

report_best_model = report_best_model.drop(index=['macro avg', 'weighted avg'])

report_best_model

Unnamed: 0,precision,recall,f1-score,support
0,0.922414,0.862903,0.891667,124.0
1,0.68323,0.852713,0.758621,129.0
2,0.827586,0.685714,0.75,140.0
accuracy,0.796438,0.796438,0.796438,0.796438


##### Model Predictions

In [41]:
y_pred = best_cat_model.predict(X_test)

unique_classes = np.unique(y_test)

report_dict = classification_report(y_test, y_pred, target_names=[str(cls) for cls in unique_classes], output_dict=True)

report_best_model = pd.DataFrame(report_dict)

report_best_model = report_best_model.transpose()

report_best_model = report_best_model.drop(index=['macro avg', 'weighted avg'])

report_best_model

Unnamed: 0,precision,recall,f1-score,support
0,0.922414,0.862903,0.891667,124.0
1,0.68323,0.852713,0.758621,129.0
2,0.827586,0.685714,0.75,140.0
accuracy,0.796438,0.796438,0.796438,0.796438


In [42]:
# Compute ROC curve and ROC area for each class
n_classes = len(np.unique(y_test))
y_test_bin = label_binarize(y_test, classes=np.arange(n_classes))
y_score = best_cat_model.predict_proba(X_test)

fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test_bin[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

fpr["micro"], tpr["micro"], _ = roc_curve(y_test_bin.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

fig = go.Figure()
for i in range(n_classes):
    fig.add_trace(go.Scatter(x=fpr[i], y=tpr[i], mode='lines', name=f'Class {i} (area = {roc_auc[i]:0.2f})'))

fig.add_shape(type='line', line=dict(dash='dash'), x0=0, x1=1, y0=0, y1=1)
fig.update_layout(title='Multiclass ROC Curve', xaxis_title='False Positive Rate', yaxis_title='True Positive Rate')
fig.show()

In [43]:
# Compute Precision-Recall and plot for each class
precision = dict()
recall = dict()
average_precision = dict()
for i in range(n_classes):
    precision[i], recall[i], _ = precision_recall_curve(y_test_bin[:, i], y_score[:, i])
    average_precision[i] = average_precision_score(y_test_bin[:, i], y_score[:, i])

fig = go.Figure()
for i in range(n_classes):
    fig.add_trace(go.Scatter(x=recall[i], y=precision[i], mode='lines', name=f'Class {i} (area = {average_precision[i]:0.2f})'))

fig.update_layout(title='Multiclass Precision-Recall Curve', xaxis_title='Recall', yaxis_title='Precision')
fig.show()

In [44]:
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)
fig = ff.create_annotated_heatmap(z=cm, x=[f'Predicted {i}' for i in range(n_classes)], y=[f'Actual {i}' for i in range(n_classes)], colorscale='Viridis')

fig.update_layout(title='Confusion Matrix', xaxis=dict(title='Predicted label'), yaxis=dict(title='True label'))
fig.show()

In [45]:
feature_importances = best_cat_model.feature_importances_

importances = pd.Series(feature_importances, index=X_train.columns)

top_10_importances = importances.sort_values(ascending=False)[:10][::-1]

fig = px.bar(top_10_importances, x=top_10_importances.values, y=top_10_importances.index, orientation='h',
             labels={'x': 'Importance', 'index': 'Feature'},
             title='Top 15 Feature Importances (Highest to Lowest)')

fig.show()