## Gene Expression RNA-Seq Classification 
 
#### Classification Models
I will be working with the subsets of my original normalized dataframe that only contains genes with a certain correlation.

The three correlation threshold I will work with are: 
- 0.262105
- 0.50
- 0.75

These will be the datasets I will work with first to explore different models and then I will re-adjust my datasets if needed to terst out different models. 

There are different types of classification models:
- Binary
- Multi-Class
- Multi-Label
- Imbalanced Classifications

I am currently exploring various classification models for my dataset, which primarily aligns with Multi-Class Classification. I aim to determine the most suitable algorithm among the available Multi-Class options.

In addition to Multi-Class Classification, I am intrigued by the idea of testing the dataset with a Multi-Label algorithm. Given that my dataset involves gene expression levels and tumor classification, I am particularly interested in understanding how the model assigns percentages to specific tumor types compared to others. This approach will provide insights into the nuanced relationships between gene expressions and different tumor classifications.

In [1]:
# Import packages 
import pandas as pd

#Import functions
from functions import *

# Train and Test package
from sklearn.model_selection import train_test_split

# Modelling
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, precision_score, recall_score, f1_score, roc_auc_score, matthews_corrcoef, ConfusionMatrixDisplay

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Tree Visualisation
from sklearn.tree import export_graphviz
from IPython.display import Image
# import graphviz

### Retreiving our dataset and splitting into our train and test data sets

In [2]:
# Get our training datasets 
subset_data_1 = pd.read_csv("/Users/kim/Desktop/repos/RNA-Seq_GeneExpression_Model/Datasets/min-max_threshold_df_0.26.csv")
subset_data_2 = pd.read_csv("/Users/kim/Desktop/repos/RNA-Seq_GeneExpression_Model/Datasets/min-max_threshold_df_0.50.csv")
subset_data_3 = pd.read_csv("/Users/kim/Desktop/repos/RNA-Seq_GeneExpression_Model/Datasets/min-max_threshold_df_0.75.csv")

In [3]:
# Splitting our 3 dataframes to features(X) and target(Y)
x_train_1, x_test_1, x_val_1, y_train_1, y_test_1, y_val_1 = split_train_test_data(subset_data_1)
x_train_2, x_test_2, x_val_2, y_train_2, y_test_2, y_val_2 = split_train_test_data(subset_data_2)
x_train_3, x_test_3, x_val_3, y_train_3, y_test_3, y_val_3 = split_train_test_data(subset_data_3)

### 1. Random Forest


### Hyperparameter Tuning
Find the ideal parameters to run the Random Forest Algorithm

In [4]:
# Use GridSearchCV for hyperparameter tuning 
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

best_params1 = hyperparameter_grid_search(x_val_1, y_val_1, param_grid)
# best_params2 = hyperparameter_grid_search(x_train_2, y_train_2, param_grid)
# best_params3 = hyperparameter_grid_search(x_train_3, y_train_3, param_grid)

#### Model 1

In [5]:
# Initializae RandomForest Classifier with the best parameter
rf_model1 = RandomForestClassifier(**best_params1, random_state = 42) # **best_params unpacks the dictionary with the hyperparameter values

# Train the model 
rf_model1.fit(x_train_1, y_train_1)

# Test model by making prediction using the validation data
y_pred1 = rf_model1.predict(x_test_1)
y_pred1

array([0, 4, 2, 0, 4, 0, 2, 0, 0, 0, 4, 0, 2, 4, 0, 2, 0, 2, 4, 4, 2, 2,
       0, 4, 3, 2, 4, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 4, 0, 3, 1, 2, 0,
       4, 2, 4, 4, 3, 0, 3, 0, 0, 2, 0, 0, 0, 4, 2, 4, 3, 0, 0, 2, 0, 2,
       0, 3, 0, 1, 0, 0, 0, 0, 0, 4, 4, 4, 0, 3, 0, 3, 0, 1, 4, 1, 0, 0,
       2, 0, 0, 0, 1, 3, 2, 2, 2, 2])

### Evaluate our model

In [6]:
# Check Accuracy
accuracy1 = accuracy_score(y_test_1, y_pred1)
print(f"Accuracy of Model1: {accuracy1:.2f}")

Accuracy of Model1: 1.00


In [7]:
# Assuming y_test and y_pred are your true labels and predicted labels
conf_matrix = confusion_matrix(y_test_1, y_pred1)
print("Confusion Matrix:")
print(conf_matrix)

Confusion Matrix:
[[45  0  0  0  0]
 [ 0  5  0  0  0]
 [ 0  0 19  0  0]
 [ 0  0  0  9  0]
 [ 0  0  0  0 20]]


In [8]:
# Classification Report
print("Classification Report:")
print(classification_report(y_test_1, y_pred1))

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        45
           1       1.00      1.00      1.00         5
           2       1.00      1.00      1.00        19
           3       1.00      1.00      1.00         9
           4       1.00      1.00      1.00        20

    accuracy                           1.00        98
   macro avg       1.00      1.00      1.00        98
weighted avg       1.00      1.00      1.00        98



In [9]:
# Make predictions on the validation set
y_pred_proba1 = rf_model1.predict_proba(x_test_1)

# Calculate AUC-ROC for each class
class_auc_roc = {}
for i in range(y_pred_proba1.shape[1]):
    y_true_class = (y_test_1 == i).astype(int)
    y_probs_class = y_pred_proba1[:, i]
    auc_roc_class = roc_auc_score(y_true_class, y_probs_class)
    class_auc_roc[i] = auc_roc_class

# Overall AUC-ROC for multiclass classification
overall_auc_roc = roc_auc_score(y_test_1, y_pred_proba1, multi_class='ovr')

# Print the results
print(f"Overall AUC-ROC: {overall_auc_roc}")
print("Class-wise AUC-ROC:")
for class_label, auc_roc_class in class_auc_roc.items():
    print(f"Class {class_label}: {auc_roc_class}")

Overall AUC-ROC: 1.0
Class-wise AUC-ROC:
Class 0: 1.0
Class 1: 1.0
Class 2: 1.0
Class 3: 1.0
Class 4: 1.0






# Precision, Recall, F1 Score
    precision = precision_score(y_test, y_pred, average = 'weighted')
    recall = recall_score(y_test, y_pred, average = 'weighted')
    f1 = f1_score(y_test, y_pred, average = 'weighted')

    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}")
    print(f"F1 Score: {f1:.2f}")

    # ROC AUC Score
    roc_auc = roc_auc_score(y_test, y_pred, multi_class='ovr')
    print(f"ROC AUC Score: {roc_auc:.2f}")

    # Matthews Correlation Coefficient
    mcc = matthews_corrcoef(y_test, y_pred)
    print(f"Matthews Correlation Coefficient: {mcc:.2f}")

