Diseases classification model with decision trees - clusters as labels

In [1]:
import numpy as np
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.feature_selection import SelectFromModel, RFE
import warnings

warnings.filterwarnings(action='ignore')
pd.set_option('display.max_columns', None)

In [2]:
# Load dataset
scaled_df = pd.read_csv('../Data/scaled_df.csv')

In [3]:

# Checking prepered dataset
scaled_df.head()

Unnamed: 0,Age,Gender,Sickness_Duration_Months,RBC_Count,Hemoglobin,Hematocrit,MCV,MCH,MCHC,RDW,Reticulocyte_Count,WBC_Count,Neutrophils,Lymphocytes,Monocytes,Eosinophils,Basophils,PLT_Count,MPV,ANA,Esbach,MBL_Level,ESR,C3,C4,CRP,Anti-dsDNA,Anti-Sm,Rheumatoid factor,ACPA,Anti-TPO,Anti-Tg,Anti-SMA,Low-grade fever,Fatigue or chronic tiredness,Dizziness,Weight loss,Rashes and skin lesions,Stiffness in the joints,Brittle hair or hair loss,Dry eyes and/or mouth,General unwell feeling,Joint pain,Anti_dsDNA,Anti_enterocyte_antibodies,ASCA,Anti_BP180,ASMA,IgG_IgE_receptor,Anti_SRP,Anti_La_SSB,Anti_Jo1,Anti_desmoglein_1,EMA,Anti_type_VII_collagen,C1_inhibitor,Anti_epidermal_basement_membrane_IgA,Anti_OmpC,pANCA,Anti_tissue_transglutaminase,anti_Scl_70,Anti_Mi2,Anti_parietal_cell,Progesterone_antibodies,Anti_Sm,Diseases_ID,Diseases_eng,Diseases_group
0,0.712121,1,0.218487,0.016667,0.424,0.596429,0.828276,0.966667,0.424,0.226667,0.632,0.534817,0.938889,0.16,0.9075,0.755,0.58,0.822857,0.216,1,0.482759,0.894444,0.816327,0.836364,0.78,0.31,1,1,0,1,0,1,1,0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,Linear IgA disease,7
1,0.651515,0,0.890756,0.6125,0.630667,0.456429,0.661887,0.656667,0.014,0.811111,0.284,0.574072,0.594,0.922667,0.0825,0.1675,0.66,0.968049,0.384,1,0.572414,0.488889,0.469388,0.0,0.14,0.273,0,0,1,1,0,1,0,1,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,2,Dermatomyositis,4
2,0.363636,0,0.05042,0.170833,0.488,0.457857,0.441814,0.571667,0.922,0.893333,0.872,0.589949,0.464889,0.540333,0.62,0.83,0.57,1.0,0.14,1,0.824138,0.677778,0.897959,0.5,0.32,0.102,1,0,1,1,1,1,1,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,Ord's thyroiditis,8
3,0.409091,1,0.092437,0.445833,0.661333,0.292857,0.364788,0.236667,0.586,0.142222,0.516,0.462308,0.248889,0.62,0.5675,0.53,0.22,0.074418,0.79,1,0.224138,0.472222,0.510204,0.6,0.32,0.545,0,1,0,1,1,1,0,1,0,1,1,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,Restless legs syndrome,5
4,0.909091,1,0.252101,0.575,0.161333,0.112857,0.605202,0.645,0.058,0.922222,0.368,0.719465,0.351556,0.919,0.13125,0.9,0.9,0.918556,0.702,1,0.227586,0.522222,0.857143,0.3,0.3,0.105,0,1,0,1,0,0,1,0,1,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,Autoimmune polyendocrine syndrome type 2 (APS2),8


In [4]:
scaled_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12499 entries, 0 to 12498
Data columns (total 68 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Age                                   12499 non-null  float64
 1   Gender                                12499 non-null  int64  
 2   Sickness_Duration_Months              12499 non-null  float64
 3   RBC_Count                             12499 non-null  float64
 4   Hemoglobin                            12499 non-null  float64
 5   Hematocrit                            12499 non-null  float64
 6   MCV                                   12499 non-null  float64
 7   MCH                                   12499 non-null  float64
 8   MCHC                                  12499 non-null  float64
 9   RDW                                   12499 non-null  float64
 10  Reticulocyte_Count                    12499 non-null  float64
 11  WBC_Count      

In [5]:
# Checking the number of cases in each cluster
scaled_df['Diseases_group'].value_counts()

Diseases_group
6    2499
5    2206
7    1649
9    1332
8    1222
3    1171
4     752
1     677
2     506
0     485
Name: count, dtype: int64

In [6]:
# Preparing features and labels DataFrames
X = scaled_df.drop(columns=['Diseases_group', 'Diseases_ID', 'Diseases_eng'])
y = scaled_df['Diseases_group']

In [7]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [8]:
# Apply SMOTE to balance the training set
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

In [9]:
# Feature selection using Recursive Feature Elimination (RFE) with Logistic Regression
rfe_model = LogisticRegression(max_iter=1000, random_state=42)
rfe = RFE(estimator=rfe_model, n_features_to_select=10)  # Select top 10 features
rfe.fit(X_train_res, y_train_res)

In [10]:
# Select features based on RFE
X_train_rfe = rfe.transform(X_train_res)
X_test_rfe = rfe.transform(X_test)

In [11]:
# Hyperparameter tuning with RandomizedSearchCV
param_dist = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

model = DecisionTreeClassifier(random_state=42)
random_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=20, cv=3, scoring='accuracy', random_state=42)
random_search.fit(X_train_rfe, y_train_res)

In [12]:
# Evaluate the best model
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test_rfe)

In [13]:

print("Best parameters:", random_search.best_params_)
print("Classification report:")
print(classification_report(y_test, y_pred))
print("Confusion matrix:")
print(confusion_matrix(y_test, y_pred))

Best parameters: {'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': None}
Classification report:
              precision    recall  f1-score   support

           0       0.60      0.88      0.71        95
           1       0.04      0.05      0.04       150
           2       0.06      0.10      0.07       110
           3       0.12      0.12      0.12       217
           4       0.21      0.22      0.22       169
           5       0.26      0.19      0.22       434
           6       1.00      1.00      1.00       494
           7       0.31      0.26      0.28       320
           8       0.10      0.09      0.10       253
           9       0.14      0.14      0.14       258

    accuracy                           0.35      2500
   macro avg       0.28      0.31      0.29      2500
weighted avg       0.36      0.35      0.35      2500

Confusion matrix:
[[ 84   1   2   2   1   0   0   1   3   1]
 [  2   7  18  19  13  21   0  16  25  29]
 [  3  15  11  10   3  20   0 

This classification model was created based on the labels of the grouping variable generated using k-means clustering. This model was compared with classification model where the categories of the grouping variable were defined based on domain (medical) knowledge. There is small difference in accuracy score between these two models in favour of the non-cluster-based model. Additionally it should be noted that the disease groups created through clustering may present challenges in terms of their substantive interpretation. This will pose a problem in the context of the business objective for which the classification model is being built (classifying patients into different autoimmune disease groups based on symptoms and test results, as well as providing recommendations for further medical diagnosis and disease prevention).