Instructions:

- Build upon the classification completed in the mini-project, adding additional modeling from new classification algorithms
- Add explanations that are inline with the CRISP-DM framework.
- Use appropriate cross validation for all of your analysis. Explain your chosen method of performance validation in detail.
- Try to use as much testing data as possible in a realistic manner. Define what you think
is realistic and why.

- Identify two tasks from the dataset to regress or classify. That is:  
  - two classification tasks OR
  - two regression tasks OR
  - one classification task and one regression task  
- Example from the diabetes dataset:
  (1) Classify if a patient will be readmitted within a 30 day period or not.
  (2) Regress what the total number of days a patient will spend in the hospital, given their history and specifics of the encounter like tests administered and previous admittance.

### Setup and Data Import

In [2]:
# Essential Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import re

# Machine Learning Libraries
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn import metrics as mt
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

# Display plots inline
%matplotlib inline

# Load dataset
df = pd.read_csv('data/diabetes+130-us+hospitals+for+years+1999-2008/diabetic_data.csv')
df.head()


## Data Preparation

- Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis.
- Describe the final dataset that is used for classification/regression (include a
description of any newly formed variables you created).

#### Data Cleaning & Preprocessing

In [None]:
# Make a copy of the dataset
df_clean = df.copy()

# Replace '?' with NaN
df_clean.replace('?', np.nan, inplace=True)

# Fill missing values
df_clean[['medical_specialty', 'payer_code', 'race']] = df_clean[['medical_specialty', 'payer_code', 'race']].fillna('Unknown')
df_clean[['diag_1', 'diag_2', 'diag_3']] = df_clean[['diag_1', 'diag_2', 'diag_3']].fillna('Unknown/None')
df_clean[['max_glu_serum', 'A1Cresult']] = df_clean[['max_glu_serum', 'A1Cresult']].fillna('Untested')

# Convert categorical integer variables to category dtype
categorical_int_cols = ['admission_type_id', 'discharge_disposition_id', 'admission_source_id']
df_clean[categorical_int_cols] = df_clean[categorical_int_cols].astype('category')

# Drop unnecessary columns
df_clean.drop(columns=['encounter_id', 'examide', 'citoglipton', 'weight', 'patient_nbr'], inplace=True)

# Define ordinal category orders
category_orders = {
    'readmitted': ['<30', '>30', 'NO'],
    'max_glu_serum': ['Untested', 'Norm', '>200', '>300'],
    'A1Cresult': ['Untested', 'Norm', '>7', '>8'],
    'age': ['[0-10)', '[10-20)', '[20-30)', '[30-40)', '[40-50)',
            '[50-60)', '[60-70)', '[70-80)', '[80-90)', '[90-100)']
}

# Convert ordinal variables
for col, order in category_orders.items():
    df_clean[col] = pd.Categorical(df_clean[col], categories=order, ordered=True)

# Convert drug variables to ordinal categories
drug_order = ['No', 'Down', 'Steady', 'Up']
drug_cols = ['metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 
                'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'tolazamide', 
                'pioglitazone', 'rosiglitazone', 'troglitazone', 'acarbose', 'miglitol', 
                'insulin', 'glyburide-metformin', 'glipizide-metformin',
                'metformin-rosiglitazone', 'metformin-pioglitazone', 'glimepiride-pioglitazone']
for col in drug_cols:
    df_clean[col] = pd.Categorical(df_clean[col], categories=drug_order, ordered=True)

# Preprocess diag_1, diag_2, diag_3 combining all codes with decimals under their integer values
for col in ['diag_1', 'diag_2', 'diag_3']:
    df_clean[col] = df_clean[col].str.split('.').str[0]  # Drop decimals and digits after

df_clean.info()


#### Feature Engineering: Encoding and Scaling

In [None]:
# Extract response variable
y = df_clean['readmitted']
X = df_clean.drop(columns=['readmitted'])

# Make a binary (readmitted within 30 days, 'Yes', or not, 'No') version of the response variable
y_binary = y.copy()
y_binary = np.where(y == '<30', 'Yes', 'No')


# One-Hot Encoding categorical variables
categorical_cols = X.select_dtypes(include=['object', 'category']).columns
X_encoded = pd.get_dummies(X, columns=categorical_cols, drop_first=True)  # drop_first for multicollinearity issues - log reg

# Standardize numerical features
num_cols = X_encoded.select_dtypes(include=['int64', 'float64']).columns
scaler = StandardScaler()
X_encoded[num_cols] = scaler.fit_transform(X_encoded[num_cols])


#### Feature Selection

In [None]:
# Extract numerical features before one-hot encoding
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

## Feature Selection from RF Variable Importance (Done in EDA)
rf_features_js = ['num_lab_procedures', 'diag_1', 'diag_2', 'diag_3', 'num_medications', 'time_in_hospital', 'age', 
                  'number_inpatient', 'medical_specialty', 'discharge_disposition_id', 'payer_code', 'num_procedures', 
                  'number_diagnoses', 'admission_type_id', 'admission_source_id']
rf_features_kh = ['num_lab_procedures', 'num_medications', 'time_in_hospital', 'number_inpatient', 'number_diagnoses', 
                  'num_procedures', 'number_outpatient', 'number_emergency', 'diag_3', 'gender', 'diag_1', 'medical_specialty', 
                  'diag_2', 'payer_code', 'race', 'discharge_disposition_id']

# Get the union of both feature lists (combined RF-selected features)
rf_features_all = list(set(rf_features_js) | set(rf_features_kh))

print(f"Total RF-Selected Features ({len(rf_features_all)}): {rf_features_all}")

# Alternatively, we could take only the common features
# rf_features_common = list(set_js & set_kh)  # OR use set_js.intersection(set_kh)
# print(f"\nFeatures in both RF lists ({len(rf_features_common)}):\n", rf_features_common)

## Create Reduced Datasets with RF-Selected Features
X_rf_selected = X[rf_features_all]  # Select the relevant features


# # Identify Categorical Features
# rf_features_categorical = list(set(X.select_dtypes(include=['object', 'category']).columns) & set(rf_features_all))
# rf_features_numeric = list(set(rf_features_all) - set(rf_features_categorical))  # Keep numeric features

# # Identify One-Hot Encoded Columns
# one_hot_cols = X_encoded.columns

# # Get All One-Hot Encoded Versions of Categorical Features
# rf_encoded_features = []
# for cat_feat in rf_features_categorical:
#     rf_encoded_features.extend([col for col in one_hot_cols if col.startswith(cat_feat + "_")])

# # Merge Selected Features: Numeric + One-Hot Encoded Categorical
# rf_features_final = rf_features_numeric + rf_encoded_features

# print(f"Final RF-Selected Features Count: {len(rf_features_final)}")

# # Create the Reduced Dataset
# X_rf_selected = X_encoded[rf_features_final]

# # Check the new dataset shape
# print(f"RF-Selected Dataset Shape: {X_rf_selected.shape}")

## Modeling and Evaluation

- Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.
- Choose the method you will use for dividing your data into training and testing splits (i.e., are you using Stratified 10-fold cross validation? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate.
- Create three different classification/regression models (e.g., random forest, KNN, and SVM). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric.
- Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model.
- Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods.
- Which attributes from your analysis are most important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesize about why certain attributes are more important than others for a given classification task.

In [5]:
# Split into train (80%) and holdout test (20%) - Stratified
X_train, X_test, y_train, y_test = train_test_split(X_rf_selected, y_binary, test_size=0.2, stratify=y_binary, random_state=1234)

# Define categorical and numerical feature subsets
categorical_cols = X_rf_selected.select_dtypes(include=['object', 'category']).columns
numerical_cols = X_rf_selected.select_dtypes(include=['int64', 'float64']).columns

# Column Transformer: OneHotEncode categorical, Scale numerical
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical_cols),
    ('cat', OneHotEncoder(drop='first'), categorical_cols)
])


NameError: name 'train_test_split' is not defined

### Classification Task: Predic

In [None]:
# Pipelines
clf_lr_pipeline = Pipeline([('preprocessor', preprocessor), ('clf', clf_lr)])
clf_nb_pipeline = Pipeline([('preprocessor', OneHotEncoder()), ('clf', clf_nb)])  # No scaling needed
clf_dt_pipeline = Pipeline([('preprocessor', preprocessor), ('clf', clf_dt)])

#### Tune Hyperparameters

In [None]:
# Define grid for MultinomialNB
param_grid_nb = {'clf__alpha': np.arange(0.1, 1.1, 0.1)}

# GridSearch for MNB
grid_nb = GridSearchCV(clf_nb_pipeline, param_grid_nb, cv=5, scoring='recall', n_jobs=-1)
grid_nb.fit(X_train, y_train)

print(f"Best alpha for MNB: {grid_nb.best_params_}")

# Define grid for AdaBoost
param_grid_ab = {'clf__n_estimators': [50, 100, 200, 500]}

# GridSearch for AdaBoost
grid_ab = GridSearchCV(clf_dt_pipeline, param_grid_ab, cv=5, scoring='recall', n_jobs=-1)
grid_ab.fit(X_train, y_train)

print(f"Best n_estimators for AdaBoost: {grid_ab.best_params_}")


#### SGD Logistic Regression

In [None]:
%%time
# # Initialize SGD Classifier for Logistic Regression
#     sgd_clf = SGDClassifier(loss="log_loss", penalty="l2", 
#                             'alpha': 1e-05, 'eta0': 0.01,
#                             max_iter=1000, class_weight="balanced",
#                             learning_rate="adaptive", n_jobs=-1, random_state=1234)

#     # Ensure X_train, X_test, y_train, y_test are NumPy arrays
#     X_train, X_test, y_train, y_test = X_train.values, X_test.values, y_train.values, y_test.values

    
#     # Perform K-Fold Cross-Validation
#     num_folds = 5
#     cv_object = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=1234)

#     cv_accuracies = []
    
#     for train_idx, val_idx in cv_object.split(X_train, y_train):
#         X_train_fold, X_val_fold = X_train[train_idx], X_train[val_idx]
#         y_train_fold, y_val_fold = y_train[train_idx], y_train[val_idx]

#         # Train on training fold
#         sgd_clf.fit(X_train_fold, y_train_fold)

#         # Validate on validation fold
#         y_val_pred = sgd_clf.predict(X_val_fold)
#         acc = mt.accuracy_score(y_val_fold, y_val_pred)
#         cv_accuracies.append(acc)

#     print(f"Cross-Validation Mean Accuracy: {np.mean(cv_accuracies):.3f}")

#     # Train Final Model on Full Training Data
#     sgd_clf.fit(X_train, y_train)

#     # Evaluate on Independent Test Set
#     y_test_pred = sgd_clf.predict(X_test)
#     test_acc = mt.accuracy_score(y_test, y_test_pred)
#     conf_matrix = mt.confusion_matrix(y_test, y_test_pred)

#     print(f"Model Converged in {sgd_clf.n_iter_} Iterations")
#     print(f"Final Model Test Accuracy: {test_acc:.3f}")
#     print(f"Confusion Matrix:\n{conf_matrix}")
#     print(classification_report(y_test, y_test_pred, target_names=['<30', '>30', 'NO']))


# Define Classifiers
clf_lr = SGDClassifier(loss="log_loss", penalty="l2", alpha=1e-05, eta0=0.01,
                       max_iter=1000, class_weight="balanced",
                       learning_rate="adaptive", n_jobs=-1, random_state=1234)

clf_nb = MultinomialNB(alpha=0.5)  # Tune alpha via GridSearchCV later
clf_dt = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(criterion='entropy', max_depth=1),
                            n_estimators=500)

# Define classifier labels
clf_labels = ['SGD Logistic Regression', 'Multinomial Naive Bayes', 'AdaBoost Decision Tree']

# Define Stratified K-Fold
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1234)

# Cross-validationa
print('10-fold cross validation (CV):\n')
for clf, label in zip([clf_lr_pipeline, clf_nb_pipeline, clf_dt_pipeline], clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=skf, # Stratified 10-fold CV
                             scoring='recall',  # Prioritize recall
                             n_jobs=-1)  # Use all CPUs
    print(f"CV recall: {scores.mean():.2f} (+/- {scores.std():.2f}) [{label}]")


In [None]:
# Initialize SGD Classifier for Logistic Regression
clf_lr = SGDClassifier(loss="log_loss", penalty="l2", 
                            'alpha': 1e-05, 'eta0': 0.01,
                            max_iter=1000, class_weight="balanced",
                            learning_rate="adaptive", n_jobs=-1, random_state=1234)
clf_mnb = MultinomialNB(alpha=alpha), # alphas = np.arange(0.1, 1.1, 0.1)
clf_abdt = ... (tree = DecisionTreeClassifier(criterion='entropy',
                              max_depth=1),
              ada = AdaBoostClassifier(base_estimator=tree,
                         n_estimators=500) # maybe also tune hyperparameters

clf_lr_pipeline = Pipeline([['sc', StandardScaler()],
                  ['clf', clf_lr]])

# define classifier labels
clf_labels = ['SGD Logistic Regression', 'Multinomial Naive Bayes', 'AdaBoost Decision Tree']

# evaluate the model performance for each classifier using 10-fold cross validation on the training data
# note that with the 10-fold validation we don't try to find the optimal combination of hyperparameter values (i.e., use the GridSearchCV() method from sklearn.model_selection module)
# instead, we want to fine-tune the performance given a single set of hyperparameter values
print('10-fold cross validation (CV):\n')
for clf, label in zip([clf_lr_pipeline, clf_nb, clf_dt], clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=10,
                             n_jobs=1) # n_jobs = the number of CPUs to use, set to -1 to use all
    print("CV accuracy: %0.2f (+/- %0.2f) [%s]"
          % (scores.mean(), scores.std(), label)) # cross_val_score() returns stats(e.g., mean and variance) for accuracy scores

## Deployment

- How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)? How would you measure the model's value if it was used by these parties? How would your deploy your model for interested parties? What other data should be collected? How often would the model need to be updated, etc.?

## Exceptional Work

- You have free reign to provide additional modeling.
- One idea: grid search parameters in a parallelized fashion and visualize the performances across attributes. Which parameters are most significant for making a good model for each classification algorithm?