# Project 3: Diabetes - classification analysis using cost-sensitive model trainign

**Team members: Ronald (leader) <br>

### About dataset
Source: https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset

*This is a clean dataset of 253,680 survey responses to the CDC's BRFSS2015. The target variable Diabetes_012 has 3 classes. 0 is for no diabetes or only during pregnancy, 1 is for prediabetes, and 2 is for diabetes. There is class imbalance in this dataset. This dataset has 21 feature variables*

The data file of the following columns:

**Diabetes_binary (target variable)**
0 = no diabetes 1 = diabetes

**HighBP**
0 = no high BP 1 = high BP

**HighChol**
0 = no high cholesterol 1 = high cholesterol

**CholCheck**
0 = no cholesterol check in 5 years 1 = yes cholesterol check in 5 years

**BMI**
Body Mass Index

**Smoker**
Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] 0 = no 1 = yes

**Stroke**
(Ever told) you had a stroke. 0 = no 1 = yes

**HeartDiseaseorAttack**
coronary heart disease (CHD) or myocardial infarction (MI) 0 = no 1 = yes

**PhysActivity**
physical activity in past 30 days - not including job 0 = no 1 = yes

**Fruits**
Consume Fruit 1 or more times per day 0 = no 1 = yes 

**Veggies**
Consume Vegetables 1 or more times per day 0 = no 1 = yes

**HvyAlcoholConsump**
Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) 0 = no 1 = yes

**AnyHealthcare**
Have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc. 0 = no 1 = yes

**NoDocbcCost**
Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? 0 = no 1 = yes

**GenHlth**
Would you say that in general your health is: scale 1-5 1 = excellent 2 = very good 3 = good 4 = fair 5 = poor

**MentHlth**
Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? scale 1-30 days

**PhysHlth**
Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? scale 1-30 days

**DiffWalk**
Do you have serious difficulty walking or climbing stairs? 0 = no 1 = yes

**Sex**
0 = female 1 = male

**Age**
13-level age category (AGEG5YR see codebook) 1 = 18-24 9 = 60-64 13 = 80 or older

**Education**
Education level (EDUCA see codebook) scale 1-6 1 = Never attended school or only kindergarten 2 = Grades 1 through 8 (Elementary) 3 = Grades 9 through 11 (Some high school) 4 = Grade 12 or GED (High school graduate) 5 = College 1 year to 3 years (Some college or technical school) 6 = College 4 years or more (College graduate)

**Income**
Income scale (INCOME2 see codebook) scale 1-8. 1 = less than 10k USD; 5 = less than 35k USD; 8 = 75k+ USD 

## Note
The present notebook is focused on classification. EDA etc can be found in another notebook.

In [None]:
# general imports
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
import sys
import time
import warnings
from math import ceil
from collections import defaultdict
from sklearn.model_selection import RandomizedSearchCV, train_test_split, StratifiedKFold
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from matplotlib import pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

# import my personal library
sys.path.insert(0, '../../../ronaldlib/')
import ronaldlib.utils as rutils

# settings
pd.options.display.float_format = "{:.2f}".format
warnings.filterwarnings('ignore', 'The objective has been evaluated ')
warnings.filterwarnings('ignore', 'The max_iter was reached ')
warnings.filterwarnings('ignore', 'The least populated class in ')
warnings.filterwarnings('ignore', 'The total space of parameters ')

# data file
data_file = './data/diabetes_binary_health_indicators_BRFSS2015.csv'
analysis_start_time = time.time()

# load data
df = rutils.load_data(data_file, resave_as_pickle=True)

# define the (relative) cost for each possible classification result
cost_dict = { 'hit'               : 0,     # 'cost' of correctly diagnosing a diabetes patient (not used in model training)
              'correct rejection' : 0,     # 'cost' of correctly diagnosing non-diabetes patient (not used in model training)
              'miss'              : 5000,  # we assume a cost of $5000 for erroneously diagnosing a diabetes patient as healthy (e.g. the additional medical cost due to worsened health due to not diagnosing it in time)
              'false alarm'       : 1000,   # we assume a cost of $250 for erroneously diagnosing a healthy person as a diabetes patient (e.g. the cost of a follow-up visit with a more detailed examination)
            }

cost_dict = { 'hit'               : 0,     # 'cost' of correctly diagnosing a diabetes patient (not used in model training)
              'correct rejection' : 0,     # 'cost' of correctly diagnosing non-diabetes patient (not used in model training)
              'miss'              : 1,     # we assume a cost of $5000 for erroneously diagnosing a diabetes patient as healthy (e.g. the additional medical cost due to worsened health due to not diagnosing it in time)
              'false alarm'       : 1,     # we assume a cost of $250 for erroneously diagnosing a healthy person as a diabetes patient (e.g. the cost of a follow-up visit with a more detailed examination)
            }

# model training settings
n_iter_randomized_search = 10
n_cv_folds = 5
n_jobs = 6
cv = StratifiedKFold(n_splits=n_cv_folds)
n_samples_per_run = -1 # set to -1 to use all samples

# adjust the costs to the class weights
class_weights = {0: cost_dict['false alarm'], 1: cost_dict['miss']}

# define models
models = {
    'LogisticRegression': LogisticRegression(class_weight=class_weights),
    'RandomForest': RandomForestClassifier(class_weight=class_weights),
    'KNeighbors': KNeighborsClassifier(),
    'XGBoost': XGBClassifier(scale_pos_weight=class_weights[1]/class_weights[0]),  # XGBoost uses scale_pos_weight for imbalance
}

# define the hyperparameter space for each model 
logisticregression_param_grid = {
    'penalty': ['l1', 'l2'],
    'C': [.0001, .001, .01, .1, 1, 10],
    'solver': ['liblinear']  # 'liblinear' is required for 'l1'
}

randomforest_param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_features': ['sqrt'],
    'max_depth': [10, 20, 40, 80, 160],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

kneighbors_param_grid = {
    'n_neighbors': list(range(1, 30)),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski']
}

xgboost_param_grid = {
    'n_estimators': [50, 100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7, 9],
    'colsample_bytree': [0.3, 0.5, 0.7],
    'gamma': [0, 0.1, 0.2]
}

# create a dictionary that maps the model name to the right param grid
param_grid = {
    'LogisticRegression': logisticregression_param_grid,
    'RandomForest': randomforest_param_grid,
    'KNeighbors': kneighbors_param_grid,
    'XGBoost': xgboost_param_grid
}

# initialize dictionaries that will store best scores and parameters
best_scores = defaultdict(list)
best_params = defaultdict(list)

# subsample
if n_samples_per_run > 0:
    df_sample = df.sample(n_samples_per_run)
else:
    df_sample = df
    
# split features and outcome variable
X = df_sample.drop('Diabetes_binary', axis=1)
y = df_sample['Diabetes_binary']

# get a stratified subsample for each run
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

# specify column transformer
binary_cols = [col for col in X_train.columns if X_train[col].nunique() == 2]
continuous_cols = [col for col in X_train.columns if col not in binary_cols]
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), continuous_cols),
        ('passthrough', 'passthrough', binary_cols)
    ])

# transform the columns
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)
    
# train models and store their performance
cost_scores = {model: [] for model in models.keys()}
accuracy_scores = {model: [] for model in models.keys()}
for model_name, model in models.items():
    # fit the classifier model using RandomizedSearchCV
    start_time = time.time()
    grid_search = RandomizedSearchCV(model, param_grid[model_name], cv=cv, n_iter=n_iter_randomized_search, n_jobs=n_jobs)
    grid_search.fit(X_train, y_train)

    # compute predictions for the validation set
    y_pred = grid_search.predict(X_test)    
    
    # compute accuracy
    accuracy_scores[model_name].append(accuracy_score(y_test, y_pred))
        
    # compute cost
    conf_mat = confusion_matrix(y_test, y_pred)
    TP = conf_mat[1, 1]
    TN = conf_mat[0, 0]
    FP = conf_mat[0, 1]
    FN = conf_mat[1, 0]
    total_cost = cost_dict['hit']*TP + cost_dict['correct rejection']*TN + cost_dict['false alarm']*FP + cost_dict['miss']*FN
    cost_scores[model_name].append(total_cost)

# Plot the accuracy of each model
plt.figure(figsize=(10, 5))
plt.bar(accuracy_scores.keys(), [np.mean(v) for v in accuracy_scores.values()], color='orange')
plt.title('Accuracy of each model')
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.show()

# Plot the cost of each model
plt.figure(figsize=(10, 5))
plt.bar(cost_scores.keys(), [np.mean(v) for v in cost_scores.values()], color='green')
plt.title('Cost of each model')
plt.xlabel('Model')
plt.ylabel('Cost')
plt.show()

# Print info about the best model
print(f"\nModel with best accuracy: {max(accuracy_scores, key=accuracy_scores.get)} (accuracy = {max(accuracy_scores.values())})")
print(f"\nModel with lowest cost: {min(cost_scores, key=cost_scores.get)} (cost = {min(cost_scores.values())})")

# Print and plot confusion matrix for each model
for model_name, model in models.items():
    print(f"Confusion matrix for {model_name}:")
    
    # Fit the classifier model using RandomizedSearchCV
    grid_search = RandomizedSearchCV(model, param_grid[model_name], cv=cv, n_iter=n_iter_randomized_search, n_jobs=n_jobs)
    grid_search.fit(X_train, y_train)
    y_pred = grid_search.predict(X_test)
    
    print(f"TP(cost={cost_dict['hit']*TP}): {TP}, FN(cost={cost_dict['miss']*FN}): {FN}")
    print(f"FP(cost={cost_dict['false alarm']*FP}): {FP}, TN(cost={cost_dict['correct rejection']*TN}): {TN}")
    print(f"Total cost: {total_cost}")
    
    # Plot confusion matrix
    disp = ConfusionMatrixDisplay.from_estimator(grid_search, X_test, y_test, display_labels=["No Diabetes", "Diabetes"])
    plt.title(f"{model_name}, acc = {accuracy_scores[model_name]}, cost = {cost_scores[model_name]}")
    plt.show()

loading data from pickle file...
data loaded (took 0.0 seconds)
