<a href="https://colab.research.google.com/github/Ananya-AJ/CMPE255-SafeDose/blob/main/Model_Abuse.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



---
**This notebook presents the models used to identify the types of substances abused in a medical case reported to the ED. 
It is a multilabel classification problem as there can be more than one substance abuses occuring in a single case.
The best performing model is trained on the train set and evaluated on the test set using f1 and recall metrics.**


---


In [2]:
!pip install multilabel_knn
!pip install evaluations

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting evaluations
  Downloading evaluations-0.0.5-py3-none-any.whl (13 kB)
Installing collected packages: evaluations
Successfully installed evaluations-0.0.5


In [14]:
# Import libraries
from google.colab import drive

import pandas as pd
import numpy as np
import lzma
import pickle

from sklearn.metrics import f1_score, accuracy_score, recall_score

from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
import multilabel_knn as mlk

In [4]:
drive.mount('/content/drive/')

Mounted at /content/drive/


### Import data

In [5]:
# Read train and test data
abuse_train = pd.read_csv('/content/drive/Shareddrives/CMPE255/data/dimensionality_reduction/X_train_abuse.csv')
abuse_test = pd.read_csv('/content/drive/Shareddrives/CMPE255/data/dimensionality_reduction/X_test_abuse.csv')

In [6]:
# Drop index column
abuse_x_train = abuse_train.iloc[:, 1:]
abuse_x_test = abuse_train.iloc[:, 1:]

### Create train and test sets

In [7]:
# Create X and y from train and test set
x_train_data = abuse_x_train.drop(['ALLABUSE','NONALCILL','ALCOHOL','NONMEDPHARMA','PHARMA'], axis = 1)
y_train_labels = abuse_x_train[['ALLABUSE','NONALCILL','ALCOHOL','NONMEDPHARMA','PHARMA']]

x_test_data = abuse_x_test.drop(['ALLABUSE','NONALCILL','ALCOHOL','NONMEDPHARMA','PHARMA'], axis = 1)
y_test_labels = abuse_x_test[['ALLABUSE','NONALCILL','ALCOHOL','NONMEDPHARMA','PHARMA']]

### Random Forest Classifier

In [8]:
# RandomForestClassifier for multilabl classification
forest = RandomForestClassifier(n_estimators = 30, random_state = 1)
multi_target_forest = MultiOutputClassifier(forest, n_jobs = 10)

# Fit on training set
multi_target_forest.fit(x_train_data, y_train_labels)

# Predcit in test set
predictions = multi_target_forest.predict(x_test_data)

In [None]:
# Save model
pickle.dump(multi_target_forest, open('/content/drive/Shareddrives/CMPE255/pickles/randomclassifier.pkl', 'wb'))

# Compress model pickle
with lzma.open("/content/drive/Shareddrives/CMPE255/pickles/compressed_randomclassifier_pickle.xz", "wb") as f:
    pickle.dump(multi_target_forest, f)

In [9]:
# Calculate metrics on test set
accuracy = accuracy_score(y_test_labels, np.array(predictions))
f1 = f1_score(y_test_labels, predictions, average = 'macro')
recall = recall_score(y_test_labels, predictions, average = 'macro')

In [10]:
print('F1 score = ', f1)
print('Recall = ', recall)

F1 score =  0.9999547198732006
Recall =  0.9999160317933239


## K Neighbors Classifier

In [11]:
# Convert dataframes to numpy arrays for input to multilabel_knn
x_train_data_np = np.array(x_train_data).copy(order = 'C')
y_train_labels_np = np.array(y_train_labels).copy(order = 'C')

x_test_data_np = np.array(x_test_data).copy(order = 'C')
y_test_labels_np = np.array(y_test_labels).copy(order = 'C')

In [12]:
# Multilabel KNN
model = mlk.multilabel_kNN(k = 10, metric = 'cosine')
model.fit(x_train_data_np, y_train_labels_np)

# Predict probabilities of labels on test set
Y_prob = model.predict(x_test_data_np, return_prob = True)

# Get labels on test set
Y_pred = model.predict(x_test_data_np)

# Calculate f1 score on test set
f1 = mlk.evaluation.macro_f1score(y_test_labels_np, Y_pred)

In [13]:
print('f1 score = ', f1)

f1 score =  0.9999547198732006


From the above two models, RandomClassifier gives the best performance in terms of f1 score and recall as compared to KneighboursClassifier. Hence, going forward we will use RandomForest to make identify abuse labels on user input data.