<a href="https://colab.research.google.com/github/Ananya-AJ/CMPE255-SafeDose/blob/main/Models_Casetype.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



---

**This colab implements Multiclass classification algorithms to determine the type of case for those that are marked as 'Others' in the original dataset. The aim is to overcome the limitations of manually determining the type of case  due to incomplete/ explicit documentation of substance abuse. Several algorithms are tried and tested for performance, finally selecting the RandomForestCalssifier for predicting on the test set.**


---


In [1]:
# Import libraries
from google.colab import drive
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import classification_report, precision_score, f1_score
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

In [2]:
drive.mount('/content/drive')

Mounted at /content/drive


### Import data

In [None]:
# Read train and test data
c_train = pd.read_csv('/content/drive/Shareddrives/CMPE255/data/pca/X_train_mca_casetype.csv')
c_test = pd.read_csv('/content/drive/Shareddrives/CMPE255/data/pca/X_test_mca_casetype.csv')
c_train = c_train.iloc[:, 1:]
c_test = c_test.iloc[:, 1:]

columns = ['CASETYPE_1', 'CASETYPE_2', 'CASETYPE_3', 'CASETYPE_4', 'CASETYPE_5', 'CASETYPE_6', 'CASETYPE_7']

# Get X and y from train and test
c_X_train = c_train.drop(columns, axis = 1)
c_y_train = c_train[columns]

X_test = c_test.drop(['CASETYPE_8'], axis = 1)

# Reverse one hot encoding on casetype
c_y_train['target'] = pd.get_dummies(c_y_train[columns]).idxmax(1)
c_y_train.drop(columns, axis = 1, inplace = True)

In [None]:
# Check the target class counts
c_y_train.target.value_counts()

CASETYPE_4    85777
CASETYPE_5    16810
CASETYPE_2    13529
CASETYPE_1     7872
CASETYPE_3     7421
CASETYPE_7     3125
CASETYPE_6      768
Name: target, dtype: int64

Performing SMOTE (Synthetic Minority Oversampling Technique) due to the class imbalance of CASETYPE. Oversampling ensures that they are near equal number of records for every CASETYPE which prevents bias towards the majority class and reduces possibility of poor classification of the minority classes.

### Oversample using SMOTE

In [None]:
# SMOTE
sm = SMOTE(random_state = 42)
X_smote, y_smote = sm.fit_resample(c_X_train, c_y_train)

In [None]:
# Split train into train and validation set
X_train, X_val, y_train, y_val = train_test_split(X_smote, y_smote, test_size = 0.2, shuffle = True, stratify = y_smote, random_state = 42)

In [None]:
print('Shape before oversampling')
print('X_train = ', c_X_train.shape)

print('Shape after oversampling and splitting into train and validation sets')
print('X_train = ', X_train.shape)
print('X_val = ', X_val.shape)

Shape before oversampling
X_train =  (135302, 2)
Shape after oversampling and splitting into train and validation sets
X_train =  (480351, 2)
X_val =  (120088, 2)


In [None]:
y_train.target.value_counts()

CASETYPE_1    68622
CASETYPE_6    68622
CASETYPE_3    68622
CASETYPE_2    68622
CASETYPE_4    68621
CASETYPE_5    68621
CASETYPE_7    68621
Name: target, dtype: int64

The target classes in the training set are now balanced.

In [None]:
# Create dictionary to store best parameter set and f1 score obtained from every model
models = ['NaiveBayes', 'LogisticRegression', 'LightGBM', 'RandomForest', 'KneighborsClassifier']
model_performance = {k:{} for k in models}

### Naive Bayes

In [None]:
# Naive Bayes Classifier
nb = GaussianNB()
nb.fit(X_train, y_train)

# Predcit on validation set
y_pred_val = nb.predict(X_val)

# Calculate precision on validation set
f1 = f1_score(y_val, y_pred_val, average = 'macro')
precision = precision_score(y_val, y_pred_val, average = 'macro')

# Save metrics to dictionary
model_performance['NaiveBayes'] = {'prec':precision, 'f1':f1}

### Logistic Regression

In [None]:
# Logistic Regression
lr = LogisticRegression()
lr.fit(X_train, y_train)

# Predict on validation set
y_pred_val = lr.predict(X_val)

# Calculate precision and f1 score on validation set
f1 = f1_score(y_val, y_pred_val, average = 'macro')
precision = precision_score(y_val, y_pred_val, average = 'macro')

# Save metrics to dictionary
model_performance['LogisticRegression'] = {'prec':precision, 'f1':f1}

### KNN

In [None]:
# KneighborsClassifier 
knn = KNeighborsClassifier(n_neighbors = 4)
knn.fit(X_train, y_train)

# Predict on validation set
y_pred_val = knn.predict(X_val)

# Calculate precision and f1 score on validation set
f1 = f1_score(y_val, y_pred_val, average = 'macro')
precision = precision_score(y_val, y_pred_val, average = 'macro')

# Save metrics in dictionary
model_performance['KneighborsClassifier'] = {'prec':precision, 'f1':f1}

### Light GBM

In [None]:
# LightGBM
gbc = LGBMClassifier(n_estimators = 100, learning_rate = 0.01, max_depth = 5, random_state = 11)
gbc.fit(X_train, y_train.values.ravel())

# Predict on validation
y_pred_val = gbc.predict(X_val)

# Calculate precision and f1 score on validation set
precision = precision_score(y_val, y_pred_val, average = 'macro')
f1 = f1_score(y_val, y_pred_val, average = 'macro')

# Store metrics in dict
model_performance['LightGBM'] = {'prec':precision, 'f1':f1}

### Random Forest

In [None]:
# Random Forest Classifier
forest = RandomForestClassifier(n_estimators = 50, min_samples_leaf = 1, min_samples_split = 2, random_state = 1)
multi_target_forest = MultiOutputClassifier(forest, n_jobs = 2)

# Fit on train set
multi_target_forest.fit(X_train, y_train)

# Predict on validation
y_pred_val = multi_target_forest.predict(X_val)

# Calculate precision and f1 score on validation set
f1 = f1_score(y_val, y_pred_val, average = 'macro')
precision = precision_score(y_val, y_pred_val, average = 'macro')

# Store in dict
model_performance['RandomForest'] = {'prec':precision, 'f1':f1}

### Choose best performing model

 In predicting casetypes, precision is an important metric as the type of case would determine the next steps of treatment for the patient. Precision score tells us out of the predicted true/positive cases, how many were actually true/ positive for that particular casetype. Recall should also be considered as it is important that out of the actual positive case of each casetype, how many were correctly identified, which is given by recall. However, precision is more significant in our case, and hence precision is the primary metric and F1 score is the secondary metric.

In [None]:
# Determine the best performing model from the model_performance dictionary
model_performance

{'NaiveBayes': {'prec': 0.4694479031174916, 'f1': 0.4184357830207569},
 'LogisticRegression': {'prec': 0.39684318366591825, 'f1': 0.3475446677797421},
 'LightGBM': {'prec': 0.5948406498947373, 'f1': 0.5787964770160663},
 'RandomForest': {'prec': 0.8123381034762271, 'f1': 0.812707916098966},
 'KneighborsClassifier': {'prec': 0.7964366269670303,
  'f1': 0.7954894780225236}}

###  Prediction on validation set using the chosen model, and metrics

In [None]:
# Predict on validation
y_pred_val = multi_target_forest.predict(X_val)

# Prediction on test set
y_pred_test = multi_target_forest.predict(X_test)

In [None]:
# Create dataframes for predicted casetype and actual casetype
predictions_test = pd.DataFrame(list(y_pred_test), columns = ['PRED_CASETYPE'])
predictions_val = pd.DataFrame(list(y_pred_val), columns = ['PRED_CASETYPE'])

casetypes = {
    'CASETYPE_1' : 'Suicide Attempt',
    'CASETYPE_2' : 'Seeking Detox',
    'CASETYPE_3' : 'Alcohol consumed below 21 years',
    'CASETYPE_4' : 'Adverse Reaction',
    'CASETYPE_5' : 'Overmedication',
    'CASETYPE_6' : 'Malicious Poisoning',
    'CASETYPE_7': 'Accidental Ingestion'
}

# Map casetype number to casetype name
predictions_test['PRED_CASETYPE'] = predictions_test['PRED_CASETYPE'].map(casetypes)
predictions_val['PRED_CASETYPE'] = predictions_val['PRED_CASETYPE'].map(casetypes)

y_val['target'] = y_val['target'].map(casetypes)

In [None]:
predictions_test.value_counts()

PRED_CASETYPE       
Seeking Detox           42144
Suicide Attempt         13380
Malicious Poisoning     12038
Overmedication           6730
Adverse Reaction         6563
Accidental Ingestion     2793
dtype: int64

This is the classifciation of the casetypes of the 'other' category. Since the target casetypes are unavailable, we look at the precision and f1 score on the validation set.

In [None]:
# Classification  report
print(classification_report(y_val, predictions_val))

                                 precision    recall  f1-score   support

           Accidental Ingestion       0.80      0.83      0.81     17156
               Adverse Reaction       0.70      0.67      0.69     17156
Alcohol consumed below 21 years       1.00      1.00      1.00     17155
            Malicious Poisoning       0.91      0.93      0.92     17155
                 Overmedication       0.68      0.68      0.68     17156
                  Seeking Detox       0.87      0.87      0.87     17155
                Suicide Attempt       0.72      0.72      0.72     17155

                       accuracy                           0.81    120088
                      macro avg       0.81      0.81      0.81    120088
                   weighted avg       0.81      0.81      0.81    120088

