<a href="https://colab.research.google.com/github/mehdimerbah/CompDrugDiscovery/blob/main/models/ClassificationModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Libraries and Data Import

In [15]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import matthews_corrcoef
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc
import json


data = pd.read_csv('https://raw.githubusercontent.com/mehdimerbah/CompDrugDiscovery/main/data/classification_model_data.csv')
data.head(5)

Unnamed: 0,SubFP1,SubFP2,SubFP3,SubFP4,SubFP5,SubFP6,SubFP7,SubFP8,SubFP9,SubFP10,...,SubFP299,SubFP300,SubFP301,SubFP302,SubFP303,SubFP304,SubFP305,SubFP306,SubFP307,activity_class
0,0,0,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,active
1,1,0,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,inactive
2,0,0,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,active
3,0,0,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,active
4,0,0,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,active


In [3]:
features = data.drop(columns = ['activity_class'])
targets = data.activity_class

In [5]:
def remove_low_variance(input_data, threshold=0.1):
    selection = VarianceThreshold(threshold)
    selection.fit(input_data)
    return input_data[input_data.columns[selection.get_support(indices=True)]]


features = remove_low_variance(features, threshold=0.1)
features.head(5)

Unnamed: 0,SubFP1,SubFP2,SubFP3,SubFP20,SubFP38,SubFP49,SubFP85,SubFP88,SubFP96,SubFP100,SubFP135,SubFP137,SubFP171,SubFP181,SubFP182,SubFP183,SubFP184,SubFP279,SubFP280,SubFP287
0,0,0,0,0,0,1,0,1,0,0,1,1,0,0,0,0,0,0,0,1
1,1,0,0,0,0,1,0,1,0,0,1,1,0,1,1,0,1,0,0,1
2,0,0,0,0,0,1,0,1,0,0,1,1,0,0,1,0,1,0,0,1
3,0,0,0,0,0,1,0,1,0,0,1,1,0,0,1,0,1,0,0,1
4,0,0,0,0,0,1,0,1,0,0,1,1,0,0,0,1,1,0,0,1


# Models Building and Training

In [7]:
def get_metrics(predicted,true):
    metrics = dict()
    metrics['accuracy'] = round(accuracy_score(predicted, true), 5)
    metrics['precision'] = round(precision_score(predicted, true, average = 'weighted'), 5)
    metrics['recall'] = round(recall_score(predicted, true, average = 'weighted'), 5)
    metrics['f1'] = round(f1_score(predicted, true, average = 'weighted'), 5)
    
    return metrics

## Random Forest Classifier
Splitting the Data into training and validation sets. 80% training and 20% validation should be a good split in this case.

In [6]:
X_training_set, X_validation_set, y_training_set, y_validation_set = train_test_split(features, targets, test_size=0.2, random_state=42)

Create Random Forest Classifier and fit it to the training data.

In [8]:
RF_model = RandomForestClassifier(n_estimators=500, random_state=42)
RF_model.fit(X_training_set, y_training_set)

RandomForestClassifier(n_estimators=500, random_state=42)

Make some predictions on training and validation data to tests for Mathews Correlation values.

In [10]:
y_training_pred = RF_model.predict(X_training_set)
y_validation_pred = RF_model.predict(X_validation_set)

In [11]:
RF_mcc_test = matthews_corrcoef(y_validation_set, y_validation_pred)
RF_mcc_test

0.7774957785358391

Calculate alternative accuracy metrics. Accuracy, Precision, Recall, and F1-Score.

In [21]:
RF_metrics = pd.DataFrame([get_metrics(y_validation_pred, y_validation_set)])
RF_metrics

Unnamed: 0,accuracy,precision,recall,f1
0,0.88889,0.89226,0.88889,0.88935


## Support Vector Machine Classifier

Initialize Support Vector Machine Classifier (SVC) as Linear SVC.
Fit SVM to training data.

In [16]:
SVM_classifier = LinearSVC()
SVM_classifier.fit(X_training_set, y_training_set)

LinearSVC()

Make some predictions with SVM classifier.

In [20]:
y_SVM_pred = SVM_classifier.predict(X_validation_set)
SVM_metrics = pd.DataFrame([get_metrics(y_SVM_pred, y_validation_set)])
SVM_metrics

Unnamed: 0,accuracy,precision,recall,f1
0,0.90741,0.91438,0.90741,0.90809
