Instructions:

- Build upon the classification completed in the mini-project, adding additional modeling from new classification algorithms
- Add explanations that are inline with the CRISP-DM framework.
- Use appropriate cross validation for all of your analysis. Explain your chosen method of performance validation in detail.
- Try to use as much testing data as possible in a realistic manner. Define what you think
is realistic and why.

- Identify two tasks from the dataset to regress or classify. That is:  
  - two classification tasks OR
  - two regression tasks OR
  - one classification task and one regression task  
- Example from the diabetes dataset:
  (1) Classify if a patient will be readmitted within a 30 day period or not.
  (2) Regress what the total number of days a patient will spend in the hospital, given their history and specifics of the encounter like tests administered and previous admittance.

### Setup and Data Import

In [1]:
# Essential Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import re

# Machine Learning Libraries
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics as mt
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

# Display plots inline
%matplotlib inline

# Load dataset
df = pd.read_csv('data/diabetes+130-us+hospitals+for+years+1999-2008/diabetic_data.csv')
df.head()


Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


## Data Preparation

- Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis.
- Describe the final dataset that is used for classification/regression (include a
description of any newly formed variables you created).

#### Data Cleaning & Preprocessing

In [2]:
# Make a copy of the dataset
df_clean = df.copy()

# Replace '?' with NaN
df_clean.replace('?', np.nan, inplace=True)

# Fill missing values
df_clean[['medical_specialty', 'payer_code', 'race']] = df_clean[['medical_specialty', 'payer_code', 'race']].fillna('Unknown')
df_clean[['diag_1', 'diag_2', 'diag_3']] = df_clean[['diag_1', 'diag_2', 'diag_3']].fillna('Unknown/None')
df_clean[['max_glu_serum', 'A1Cresult']] = df_clean[['max_glu_serum', 'A1Cresult']].fillna('Untested')

# Convert numeric categorical columns to strings explicitly (not category yet)
numeric_categorical_cols = ['admission_type_id', 'discharge_disposition_id', 'admission_source_id']
df_clean[numeric_categorical_cols] = df_clean[numeric_categorical_cols].astype(str)

# Drop unnecessary columns
df_clean.drop(columns=['encounter_id', 'examide', 'citoglipton', 'weight', 'patient_nbr'], inplace=True)

# Define ordinal category orders
category_orders = {
    'readmitted': ['<30', '>30', 'NO'],
    'max_glu_serum': ['Untested', 'Norm', '>200', '>300'],
    'A1Cresult': ['Untested', 'Norm', '>7', '>8'],
    'age': ['[0-10)', '[10-20)', '[20-30)', '[30-40)', '[40-50)',
            '[50-60)', '[60-70)', '[70-80)', '[80-90)', '[90-100)']
}

# Convert ordinal variables
for col, order in category_orders.items():
    df_clean[col] = pd.Categorical(df_clean[col], categories=order, ordered=True)

# Convert drug variables to ordinal categories
drug_order = ['No', 'Down', 'Steady', 'Up']
drug_cols = ['metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 
                'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'tolazamide', 
                'pioglitazone', 'rosiglitazone', 'troglitazone', 'acarbose', 'miglitol', 
                'insulin', 'glyburide-metformin', 'glipizide-metformin',
                'metformin-rosiglitazone', 'metformin-pioglitazone', 'glimepiride-pioglitazone']
for col in drug_cols:
    df_clean[col] = pd.Categorical(df_clean[col], categories=drug_order, ordered=True)

# Preprocess diag_1, diag_2, diag_3 combining all codes with decimals under their integer values
for col in ['diag_1', 'diag_2', 'diag_3']:
    df_clean[col] = df_clean[col].str.split('.').str[0]  # Drop decimals and digits after

df_clean.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 45 columns):
 #   Column                    Non-Null Count   Dtype   
---  ------                    --------------   -----   
 0   race                      101766 non-null  object  
 1   gender                    101766 non-null  object  
 2   age                       101766 non-null  category
 3   admission_type_id         101766 non-null  object  
 4   discharge_disposition_id  101766 non-null  object  
 5   admission_source_id       101766 non-null  object  
 6   time_in_hospital          101766 non-null  int64   
 7   payer_code                101766 non-null  object  
 8   medical_specialty         101766 non-null  object  
 9   num_lab_procedures        101766 non-null  int64   
 10  num_procedures            101766 non-null  int64   
 11  num_medications           101766 non-null  int64   
 12  number_outpatient         101766 non-null  int64   
 13  number_emergency          101

#### Feature Engineering & Selection

In [3]:
# Extract response variable
y = df_clean['readmitted']
X = df_clean.drop(columns=['readmitted'])

# Convert target variable to a binary numeric (1 for 'Yes', 0 for 'No')
y_binary = y.copy()
y_binary = np.where(y == '<30', 1, 0)


## Feature Selection from RF Variable Importance (Done in EDA)
rf_features_js = ['num_lab_procedures', 'diag_1', 'diag_2', 'diag_3', 'num_medications', 'time_in_hospital', 'age', 
                  'number_inpatient', 'medical_specialty', 'discharge_disposition_id', 'payer_code', 'num_procedures', 
                  'number_diagnoses', 'admission_type_id', 'admission_source_id']
rf_features_kh = ['num_lab_procedures', 'num_medications', 'time_in_hospital', 'number_inpatient', 'number_diagnoses', 
                  'num_procedures', 'number_outpatient', 'number_emergency', 'diag_3', 'gender', 'diag_1', 'medical_specialty', 
                  'diag_2', 'payer_code', 'race', 'discharge_disposition_id']

# Get the union of both feature lists (combined RF-selected features)
rf_features_all = list(set(rf_features_js) | set(rf_features_kh))

print(f"Total RF-Selected Features ({len(rf_features_all)}): {rf_features_all}")

# Alternatively, we could take only the common features
# rf_features_common = list(set_js & set_kh)  # OR use set_js.intersection(set_kh)
# print(f"\nFeatures in both RF lists ({len(rf_features_common)}):\n", rf_features_common)

# Create a Reduced Dataset with RF-Selected Features
X_rf_selected = X[rf_features_all].copy()

# Define categorical and numerical feature subsets
categorical_cols = X_rf_selected.select_dtypes(include=['object', 'category']).columns
numerical_cols = X_rf_selected.select_dtypes(include=['int64', 'float64']).columns

# One-Hot Encoding categorical variables
X_encoded = pd.get_dummies(X_rf_selected, columns=categorical_cols, drop_first=True)  # drop_first for multicollinearity issues - log reg

# print("Categorical columns:", list(categorical_cols))

# # Convert numerical categorical columns to string explicitly
# for col in categorical_cols:
#     if X_rf_selected[col].dtype in ['int64', 'float64', 'category']:  # Check for problematic types
#         X_rf_selected[col] = X_rf_selected[col].astype(str)
# # Reassign to ensure all categorical data is string type
# X_rf_selected.loc[:, categorical_cols] = X_rf_selected.loc[:, categorical_cols].astype(str)
# Ensure all categorical columns are strings
# X_rf_selected[categorical_cols] = X_rf_selected[categorical_cols].astype(str)

# # Function to group ICD-9 codes to avoid unseen categories
# def map_icd9_group(code):
#     try:
#         if code.startswith('V') or code.startswith('E'):
#             return 'External'
#         elif code.isdigit():
#             code = int(code)
#             if 1 <= code <= 139:
#                 return 'Infectious'
#             elif 140 <= code <= 239:
#                 return 'Neoplasms'
#             elif 240 <= code <= 279:
#                 return 'Endocrine/Metabolic'
#             elif 280 <= code <= 289:
#                 return 'Blood Disorders'
#             elif 290 <= code <= 319:
#                 return 'Mental Disorders'
#             elif 320 <= code <= 389:
#                 return 'Neurological/Sensory'
#             elif 390 <= code <= 459:
#                 return 'Circulatory'
#             elif 460 <= code <= 519:
#                 return 'Respiratory'
#             elif 520 <= code <= 579:
#                 return 'Digestive'
#             elif 580 <= code <= 629:
#                 return 'Genitourinary'
#             elif 630 <= code <= 679:
#                 return 'Pregnancy'
#             elif 680 <= code <= 709:
#                 return 'Skin'
#             elif 710 <= code <= 739:
#                 return 'Musculoskeletal'
#             elif 740 <= code <= 759:
#                 return 'Congenital'
#             elif 760 <= code <= 779:
#                 return 'Perinatal'
#             elif 780 <= code <= 799:
#                 return 'Symptoms/Signs'
#             elif 800 <= code <= 999:
#                 return 'Injury'
#         return 'Unknown'
#     except:
#         return 'Unknown'

# # Apply to diagnosis columns
# for col in ['diag_1', 'diag_2', 'diag_3']:
#     X_rf_selected.loc[:, col] = X_rf_selected[col].astype(str).apply(map_icd9_group)


# # Replace rare categories (enhanced for numeric IDs)
# def replace_rare_categories(df, categorical_cols, threshold):
#     df = df.copy()
#     for col in categorical_cols:
#         freq = df[col].value_counts()
#         rare_categories = freq[freq < threshold].index.tolist()
#         df[col] = df[col].replace(rare_categories, 'Other')
#         # Explicitly group uncommon numeric IDs if still present
#         # if col in ['admission_type_id', 'discharge_disposition_id', 'admission_source_id']:
#         #     df[col] = df[col].apply(lambda x: x if freq.get(x, 0) >= threshold else 'Other')
#     return df

# X_rf_selected = replace_rare_categories(X_rf_selected, categorical_cols, threshold=50)

# # Debugging: Check unique values
# print("Unique values after preprocessing:")
# for col in categorical_cols:
#     print(f"{col}: {X_rf_selected[col].unique()}")


# # Column Transformer: OneHotEncode categorical, Scale numerical
# preprocessor = ColumnTransformer([
#     ('num', StandardScaler(), numerical_cols),
#     ('cat', OneHotEncoder(handle_unknown='ignore', drop='first'), categorical_cols)
# ])

# Column Transformer: OneHotEncode categorical, Scale numerical
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical_cols)
])


Total RF-Selected Features (19): ['diag_3', 'time_in_hospital', 'age', 'discharge_disposition_id', 'num_lab_procedures', 'gender', 'diag_1', 'admission_source_id', 'diag_2', 'medical_specialty', 'number_emergency', 'payer_code', 'number_diagnoses', 'num_procedures', 'admission_type_id', 'number_outpatient', 'race', 'num_medications', 'number_inpatient']
Categorical columns: ['diag_3', 'age', 'discharge_disposition_id', 'gender', 'diag_1', 'admission_source_id', 'diag_2', 'medical_specialty', 'payer_code', 'admission_type_id', 'race']
Unique values after preprocessing:
diag_3: ['Unknown' 'Endocrine/Metabolic' 'External' 'Circulatory' 'Infectious'
 'Respiratory' 'Injury' 'Neoplasms' 'Genitourinary' 'Musculoskeletal'
 'Symptoms/Signs' 'Digestive' 'Skin' 'Mental Disorders' 'Congenital'
 'Neurological/Sensory' 'Pregnancy' 'Blood Disorders']
age: ['[0-10)' '[10-20)' '[20-30)' '[30-40)' '[40-50)' '[50-60)' '[60-70)'
 '[70-80)' '[80-90)' '[90-100)']
discharge_disposition_id: ['25' '1' '3' '6' 

## Modeling and Evaluation

- Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.
- Choose the method you will use for dividing your data into training and testing splits (i.e., are you using Stratified 10-fold cross validation? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate.
- Create three different classification/regression models (e.g., random forest, KNN, and SVM). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric.
- Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model.
- Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods.
- Which attributes from your analysis are most important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesize about why certain attributes are more important than others for a given classification task.

#### Creating a Stratified Holdout Test Set

In [4]:
# Split into train (80%) and holdout test (20%) - Stratified
X_train, X_test, y_train, y_test = train_test_split(X_rf_selected, y_binary, test_size=0.2, stratify=y_binary, random_state=1234)


#### Define Classifiers

In [5]:
# Define Classifiers
clf_lr = SGDClassifier(loss="log_loss", penalty="l2", alpha=1e-05, eta0=0.01,
                       max_iter=1000, class_weight="balanced",
                       learning_rate="adaptive", n_jobs=-1, random_state=1234)
clf_nb = MultinomialNB(alpha=0.5)  # Tune alpha later
clf_dt = AdaBoostClassifier(estimator=DecisionTreeClassifier(criterion='entropy', max_depth=1),
                            n_estimators=500)  # Tune estimators later 

# Define classifier labels
clf_labels = ['SGD Logistic Regression', 'Multinomial Naive Bayes', 'AdaBoost Decision Tree']

# Pipelines
clf_lr_pipeline = Pipeline([('preprocessor', preprocessor), ('clf', clf_lr)])
# clf_nb_pipeline = Pipeline([('preprocessor', OneHotEncoder()), ('clf', clf_nb)])  # No scaling needed
clf_dt_pipeline = Pipeline([('preprocessor', preprocessor), ('clf', clf_dt)])


#### Tune Hyperparameters

In [6]:
# # Hyperparameter tuning for MultinomialNB
# param_grid_nb = {'clf__alpha': np.arange(0.1, 1.1, 0.1)}
# grid_nb = GridSearchCV(clf_nb_pipeline, param_grid_nb, cv=5, scoring='recall', n_jobs=-1)
# grid_nb.fit(X_train, y_train)

# # Get best alpha for NB
# best_alpha = grid_nb.best_params_['clf__alpha']
# print(f"Best alpha for MNB: {best_alpha}")

# # Update NB classifier with best alpha
# clf_nb_pipeline.set_params(clf__alpha=best_alpha)

# # Hyperparameter tuning for AdaBoost
# param_grid_ab = {'clf__n_estimators': [50, 100, 200, 500]}
# grid_ab = GridSearchCV(clf_dt_pipeline, param_grid_ab, cv=5, scoring='recall', n_jobs=-1)
# grid_ab.fit(X_train, y_train)

# # Get best n_estimators for AdaBoost
# best_n_estimators = grid_ab.best_params_['clf__n_estimators']
# print(f"Best n_estimators for AdaBoost: {best_n_estimators}")

# # Update AdaBoost classifier with best n_estimators
# clf_dt_pipeline.set_params(clf__n_estimators=best_n_estimators)


#### SGD Logistic Regression

In [7]:
%%time
# Define Stratified K-Fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1234)

# # Cross-validation
# print('10-fold cross validation (CV):\n')
# for clf, label in zip([clf_lr_pipeline, clf_nb_pipeline, clf_dt_pipeline], clf_labels):
#     scores = cross_val_score(estimator=clf,
#                              X=X_train,
#                              y=y_train,
#                              cv=skf, # Stratified 10-fold CV
#                              scoring='recall',  # Prioritize recall
#                              n_jobs=-1)  # Use all CPUs
#     print(f"CV recall: {scores.mean():.2f} (+/- {scores.std():.2f}) [{label}]")

# Cross-validationa
print('10-fold cross validation (CV):\n')
for clf, label in zip([clf_lr_pipeline], 'LR Classifier'):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=skf, # Stratified 10-fold CV
                             scoring='recall',  # Prioritize recall
                             n_jobs=-1)  # Use all CPUs
    print(f"CV recall: {scores.mean():.2f} (+/- {scores.std():.2f}) [{label}]")


10-fold cross validation (CV):

CV recall: 0.55 (+/- 0.01) [L]
CPU times: user 29 ms, sys: 88.3 ms, total: 117 ms
Wall time: 1.17 s




## Deployment

- How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)? How would you measure the model's value if it was used by these parties? How would your deploy your model for interested parties? What other data should be collected? How often would the model need to be updated, etc.?

## Exceptional Work

- You have free reign to provide additional modeling.
- One idea: grid search parameters in a parallelized fashion and visualize the performances across attributes. Which parameters are most significant for making a good model for each classification algorithm?