# Machine Learning in Computational Biology
## Course Project
## 2 class problem classification - Pipeline Notebook
#### Papadopoulou Marianna, ID:7115152200032
#### Vossos Charalampos, ID:7115152200037
#### Fillipidou Thalassini-Marina, ID:7115152200022

This notebook presents all the pipeline for the 2class classification problem
of cells based on their gene profile. It contains all the optimization steps, hyperparameter tunning and the feature selection techniques that were tested
In order to replicate the findings outlined in this academic document, please follow the steps below:

1.	Open the notebook provided using Google Colab
2.	Access the data section of the environment by clicking on the folder icon located on the left side.
Then, upload the:
  - final_data.csv file <br>
  or the files
  - E-MTAB-6108.aggregated_filtered_normalised_counts.mtx_cols
  - E-MTAB-6108.aggregated_filtered_normalised_counts.mtx_rows
  - E-MTAB-6108.aggregated_filtered_normalised_counts.mtx<br>
  AND
  - ground_truth.xlsx

 These files are provided within the exercise's zip file. To upload them, simply drag and drop the files into the designated area.

3. Run the cells in the notebook in the order they appear, making sure to follow the instructions provided in each cell.

When executing the notebook in Jupyter, it is essential to ensure that the necessary packages have been installed. Additionally, it is important to specify the path to the dataset (final_data.csv or other three files) as an input parameter for the *pd.read_csv()* functions.

** *We have to note that the results of these classifiers are presented in the notebook* "*Papadopoulou_Fillipidou_Vossos_Final_Project_Metrics-Plots(2 class problem).ipynb*" *in the corresponding zip file of the exercise.*

# Libraries

In [None]:
# Execute if notebook is opened in Colab
!pip install mrmr_selection
!pip install optuna

In [5]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import make_scorer, matthews_corrcoef, balanced_accuracy_score, f1_score, fbeta_score, recall_score, precision_score, average_precision_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
import optuna
from optuna.samplers import RandomSampler
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
import random
from mrmr import mrmr_classif

# Load datasets and construct the final dataframe that will be used for classification

In this section, the three files mentioned at the beginning of this notebook are utilised to create our final dataframe, which will be used as the input for our classification pipeline. Our procedure involves the consolidation of three MTX files, which collectively represent the gene profiles of our cells. The subsequent step involves parsing the excel file containing the ground truth labels (gained from the clustering results, which the authors of the chosen paper provided) and appending its contents as additional columns to our gene profile dataframe.

In [None]:
# open file and read its content
with open("/content/E-MTAB-6108.aggregated_filtered_normalised_counts.mtx") as f:
    lines = f.readlines()

In [None]:
# Create an empty dictionary 'normalized_counts_mtx' with three lists as values: "gene", "cell", and "counts".
# These lists will be populated with data from the file 'E-MTAB-6108.aggregated_filtered_normalised_counts.mtx'.
normalized_counts_mtx = {"gene": [], "cell": [], "counts": []}

# Loop through each line in the 'lines' list starting from the third line (skipping the header).
# Each line in the file represents a gene, cell, and its corresponding count.
for line in lines[2:]:
    normalized_counts_mtx["gene"].append(line.split()[0])
    normalized_counts_mtx["cell"].append(line.split()[1])
    normalized_counts_mtx["counts"].append(line.split()[2])

normalized_counts_mtx = pd.DataFrame(normalized_counts_mtx).astype(float) # Convert the 'normalized_counts_mtx' dictionary into a Pandas DataFrame and convert the 'counts' column to float type.

In [None]:
# Find the number of columns (genes) and rows (cells) present in the 'normalized_counts_mtx' DataFrame.
# This information will be used to initialize an empty matrix for storing counts.
cols= int(normalized_counts_mtx["gene"].max())
rows = int(normalized_counts_mtx["cell"].max())
matrix = np.zeros((rows, cols)) # Initialize an empty matrix of size (rows, cols)

# Loop through each row in the 'normalized_counts_mtx' DataFrame.
for i in range(normalized_counts_mtx.shape[0]):
    # Retrieve the gene index and cell index (zero-based) for each row in the DataFrame.
    gene = int(normalized_counts_mtx.loc[i, "gene"]) - 1
    cell = int(normalized_counts_mtx.loc[i, "cell"]) - 1
    counts = normalized_counts_mtx.loc[i, "counts"]
    # Assign the counts value to the corresponding position in the 'matrix' (cell, gene).
    matrix[cell, gene] = counts


In [None]:
df = pd.DataFrame(matrix)

In [None]:
# Read gene names from the corresponding file
genes = []
with open("/content/E-MTAB-6108.aggregated_filtered_normalised_counts.mtx_rows") as f:
    lines = f.readlines()
for line in lines:
    genes.append(line.split()[0])
df.columns = genes # Assign the gene names as column names in the 'df' DataFrame.

In [None]:
# Read cell names from the corrsponding file.
with open("/content/E-MTAB-6108.aggregated_filtered_normalised_counts.mtx_cols") as f:
    lines = f.readlines()

df.insert(loc=0, column='cell', value=lines) # Add the cell names as the first column in the 'df' DataFrame
df["cell"] = df["cell"].str.rstrip("\n") # Remove the newline character from the 'cell' column values in the 'df' DataFrame.

## Adding ground truth labels from clustering results

In [None]:
# Create an empty DataFrame named 'final_data' to store the final processed data.
final_data = pd.DataFrame({})

# Read the ground truth data from the Excel file 'ground_truth.xlsx' and store it in a DataFrame named 'ground_truth'.
ground_truth = pd.read_excel("/content/ground_truth.xlsx")

temp_df = pd.DataFrame({})

# Loop through each column in the 'ground_truth' DataFrame, starting from the third column.
# Each column represents a different cell (e.g., 'column' corresponds to the cell name).
for number, column in enumerate(ground_truth.columns[2:]):
    df_clusters = pd.DataFrame({})

    # Extract a subset of the original DataFrame 'df' where the 'cell' column matches the current cell (column).
    temp_df = pd.DataFrame(df[df["cell"] == column])

    # Iterate through each row in the 'ground_truth' DataFrame.
    # 'ground_truth.loc[i, "K"]' contains the number of clusters for the current row (index 'i').
    # This number represents the number of clusters used in a certain split given by the row index 'i'.
    for i in range(ground_truth.shape[0]):

        # Get the number of clusters for the current row (index 'i').
        number_of_clusters = ground_truth.loc[i, "K"]

        # Assign the value from the 'ground_truth' DataFrame to a new column in 'temp_df' with a name like "cluster_2".
        # The name is generated using the format "cluster_number_of_clusters".
        # This column will contain the cluster label that the cell (column) belongs to for a certain split given by index 'i'.
        temp_df["cluster_{}".format(number_of_clusters)] = ground_truth.loc[i, column]

    # Concatenate the 'temp_df' DataFrame to the 'final_data' DataFrame, adding data for the current cell (column) and clusters.
    final_data = pd.concat([final_data, temp_df])

In [None]:
final_data

Unnamed: 0,cell,ENSG00000000003,ENSG00000000005,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,...,ENSG00000289716,cluster_2,cluster_4,cluster_7,cluster_8,cluster_11,cluster_18,cluster_23,cluster_31,cluster_37
0,ERR2538859-AAACCTGAGACCACGA,122.99367,0.0,0.000000,0.0,0.00000,0.0,0.000000,30.748417,0.000000,...,0.0,1,3,3,3,2,2,19,17,12
1,ERR2538859-AAACCTGTCTGATACG,215.19258,0.0,0.000000,0.0,0.00000,0.0,0.000000,0.000000,0.000000,...,0.0,2,2,6,7,7,3,6,6,1
2,ERR2538859-AAACGGGAGTGTTGAA,0.00000,0.0,0.000000,0.0,0.00000,0.0,0.000000,0.000000,0.000000,...,0.0,1,1,1,1,1,14,14,26,26
3,ERR2538859-AAAGATGTCCGAACGC,0.00000,0.0,65.466450,0.0,0.00000,0.0,32.733227,0.000000,65.466450,...,0.0,1,4,5,5,10,18,22,28,33
4,ERR2538859-AAAGTAGGTTAGTGGG,168.26047,0.0,0.000000,0.0,0.00000,0.0,56.086823,0.000000,56.086823,...,0.0,1,4,4,6,5,7,20,11,21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1422,ERR2538860-TTTGGTTTCGTTACGA,0.00000,0.0,0.000000,0.0,0.00000,0.0,0.000000,0.000000,0.000000,...,0.0,1,2,5,4,3,9,3,5,5
1423,ERR2538860-TTTGTCAAGCCCAACC,0.00000,0.0,172.205950,0.0,0.00000,0.0,0.000000,0.000000,0.000000,...,0.0,2,2,6,7,7,3,6,25,1
1424,ERR2538860-TTTGTCACATTGGGCC,220.29666,0.0,0.000000,0.0,0.00000,0.0,0.000000,0.000000,0.000000,...,0.0,1,1,2,2,4,13,13,12,8
1425,ERR2538860-TTTGTCAGTTGCGTTA,0.00000,0.0,55.496975,0.0,0.00000,0.0,0.000000,277.484860,0.000000,...,0.0,1,1,2,2,6,8,7,13,11


# Get the 1500 genes with most variance

We keep only the 1500 most variant genes, as proposed in the paper.

In [None]:
# Calculate the variance across cells for each gene
variances = final_data.iloc[:, 1:-9].var(axis=0)

# Sort the genes based on variance in descending order and select the top 1,500 genes
top_genes = variances.sort_values(ascending=False).index[:1500]

# Get the last columns from the original DataFrame
last_columns = final_data.iloc[:, -9:]

# Filter the DataFrame to keep only the top 1,500 genes and the last columns
df_filtered = pd.concat([final_data[['cell'] + list(top_genes)], last_columns], axis=1)

# Optionally, you can reset the index of the filtered DataFrame
df_filtered.reset_index(drop=True, inplace=True)

In [None]:
df_filtered

Unnamed: 0,cell,ENSG00000205542,ENSG00000198804,ENSG00000167996,ENSG00000198712,ENSG00000156508,ENSG00000087086,ENSG00000075624,ENSG00000229117,ENSG00000026025,...,ENSG00000185565,cluster_2,cluster_4,cluster_7,cluster_8,cluster_11,cluster_18,cluster_23,cluster_31,cluster_37
0,ERR2538859-AAACCTGAGACCACGA,32962.305,9065.20200,8240.5760,5350.2246,4735.2563,5042.7400,4950.4950,9255.0700,3659.0615,...,0.00000,1,3,3,3,2,2,19,17,12
1,ERR2538859-AAACCTGTCTGATACG,34861.200,0.00000,5579.0670,0.0000,16139.4430,4519.0444,3873.4666,12696.3620,3443.0813,...,0.00000,2,2,6,7,7,3,6,6,1
2,ERR2538859-AAACGGGAGTGTTGAA,17791.770,9193.14500,4845.0840,3431.0170,13334.9840,4739.7560,1061.1393,14643.7230,2193.0212,...,70.74262,1,1,1,1,1,14,14,26,26
3,ERR2538859-AAAGATGTCCGAACGC,26513.914,10016.36700,7070.3770,5564.6484,10540.0990,4058.9202,3404.2556,7725.0415,8837.9720,...,65.46645,1,4,5,5,10,18,22,28,33
4,ERR2538859-AAAGTAGGTTAGTGGG,30791.666,18116.04300,7291.2870,7347.3735,5664.7690,5384.3350,2636.0806,8469.1100,3140.8620,...,0.00000,1,4,4,6,5,7,20,11,21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1422,ERR2538860-TTTGGTTTCGTTACGA,20833.332,11986.30100,5707.7627,6849.3150,6563.9270,5707.7627,3139.2693,14554.7940,1141.5525,...,0.00000,1,2,5,4,3,9,3,5,5
1423,ERR2538860-TTTGTCAAGCCCAACC,73876.350,172.20595,13776.4760,0.0000,5510.5903,9126.9150,18598.2420,10160.1510,7577.0615,...,0.00000,2,2,6,7,7,3,6,25,1
1424,ERR2538860-TTTGTCACATTGGGCC,22690.557,14539.58000,3524.7466,4038.7722,10060.2140,2570.1277,3671.6110,8003.4385,6094.8745,...,0.00000,1,1,2,2,4,13,13,12,8
1425,ERR2538860-TTTGTCAGTTGCGTTA,54775.516,5771.68550,11874.2080,2830.3457,6992.6187,11654.3640,8047.0615,8713.0250,6160.1640,...,0.00000,1,1,2,2,6,8,7,13,11


In [None]:
#df_filtered.to_csv('final_df.csv',index=False) # save the filtered file for future use

# Load Dataset (csv ready, already filtered)

This stage skips all of the preceding steps because the csv that we load is the result of all of the preceding.

In [6]:
final_data = pd.read_csv('final_df.csv')

In [None]:
final_data

Unnamed: 0,cell,ENSG00000205542,ENSG00000198804,ENSG00000167996,ENSG00000198712,ENSG00000156508,ENSG00000087086,ENSG00000075624,ENSG00000229117,ENSG00000026025,...,ENSG00000185565,cluster_2,cluster_4,cluster_7,cluster_8,cluster_11,cluster_18,cluster_23,cluster_31,cluster_37
0,ERR2538859-AAACCTGAGACCACGA,32962.305,9065.20200,8240.5760,5350.2246,4735.2563,5042.7400,4950.4950,9255.0700,3659.0615,...,0.00000,1,3,3,3,2,2,19,17,12
1,ERR2538859-AAACCTGTCTGATACG,34861.200,0.00000,5579.0670,0.0000,16139.4430,4519.0444,3873.4666,12696.3620,3443.0813,...,0.00000,2,2,6,7,7,3,6,6,1
2,ERR2538859-AAACGGGAGTGTTGAA,17791.770,9193.14500,4845.0840,3431.0170,13334.9840,4739.7560,1061.1393,14643.7230,2193.0212,...,70.74262,1,1,1,1,1,14,14,26,26
3,ERR2538859-AAAGATGTCCGAACGC,26513.914,10016.36700,7070.3770,5564.6484,10540.0990,4058.9202,3404.2556,7725.0415,8837.9720,...,65.46645,1,4,5,5,10,18,22,28,33
4,ERR2538859-AAAGTAGGTTAGTGGG,30791.666,18116.04300,7291.2870,7347.3735,5664.7690,5384.3350,2636.0806,8469.1100,3140.8620,...,0.00000,1,4,4,6,5,7,20,11,21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1422,ERR2538860-TTTGGTTTCGTTACGA,20833.332,11986.30100,5707.7627,6849.3150,6563.9270,5707.7627,3139.2693,14554.7940,1141.5525,...,0.00000,1,2,5,4,3,9,3,5,5
1423,ERR2538860-TTTGTCAAGCCCAACC,73876.350,172.20595,13776.4760,0.0000,5510.5903,9126.9150,18598.2420,10160.1510,7577.0615,...,0.00000,2,2,6,7,7,3,6,25,1
1424,ERR2538860-TTTGTCACATTGGGCC,22690.557,14539.58000,3524.7466,4038.7722,10060.2140,2570.1277,3671.6110,8003.4385,6094.8745,...,0.00000,1,1,2,2,4,13,13,12,8
1425,ERR2538860-TTTGTCAGTTGCGTTA,54775.516,5771.68550,11874.2080,2830.3457,6992.6187,11654.3640,8047.0615,8713.0250,6160.1640,...,0.00000,1,1,2,2,6,8,7,13,11


# Split dataset to features and labels

In [9]:
# Separate features and target variable
# If used the final_df.csv, use the second batch of lines
# If the whole process was executed from the beggining use the tfirst batch of lines

#X = df_filtered.iloc[:,1:-9]
#y_labels = df_filtered.iloc[:,-9:]

X = final_data.iloc[:,1:-9]
y_labels = final_data.iloc[:,-9:]

In [None]:
#X_features

In [None]:
#y_labels

# 2class classification approach

# Split dataset to two sets(training and test)

** * The test set will remain hidden in all the procedures below. We will only use it for our final evaluations*

In [10]:
# Extract the column named 'cluster_2' from the 'y_labels' DataFrame and store it in the variable 'y'.
y = y_labels['cluster_2']

y = np.where(y == 1, 0, 1)

In [11]:
# Standardize the features
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

In [12]:
# Convert y to a 1-dimensional array
y = np.ravel(y)

In [13]:
# Split the dataset into training/validation set and test set
X_train_val, X_test, y_train_val, y_test = train_test_split(X_std, y, test_size=0.2, random_state=42)

# ANOVA for feature selection

## Number of features to keep

The main goal of this code is to perform hyperparameter optimization for feature selection (how many features should we keep) using the ANOVA F-test and evaluate the performance of different classifiers on the selected features based on the Matthews correlation coefficient. The Optuna library is used to search for the optimal number of features to select, and the results are stored in the study object, which contains information about the best number of features to keep found during the optimization process.

We split the dataset using the *perform_dataset_split()* function in order not to have any leakage of data and evaluate our results in the most appropriate way

In [14]:
def perform_dataset_split(X_train_val, y_train_val):
    X_train_features, X_evaluation, y_train_features, y_evaluation = train_test_split(X_train_val, y_train_val, test_size=0.30, random_state=30)
    # Perform feature selection on X_train_features only
    X_train_features_df = pd.DataFrame(X_train_features)
    y_train_features_df = pd.DataFrame(y_train_features)
    X_evaluation_df = pd.DataFrame(X_evaluation)
    y_evaluation_df = pd.DataFrame(y_evaluation)
    return X_train_features_df, y_train_features_df, X_evaluation_df,y_evaluation_df

In [20]:
def objective_feature_selection_ANOVA(trial,X_train_features_df,y_train_features_df,X_evaluation_df,y_evaluation_df,scores):
  # Define the search space for the number of features
  num_features = trial.suggest_int("num_features", 1, X_train_features_df.shape[1])

  # Perform feature selection on the training set
  feature_selector = SelectKBest(score_func=f_classif, k=num_features)
  X_train_features_selected_df = feature_selector.fit_transform(X_train_features_df, y_train_features_df)
  selected_features = feature_selector.get_support()

  # Apply the same feature selection on the validation set
  X_evaluation_selected = X_evaluation_df.iloc[:, selected_features]

  SVC_cl = SVC()
  GNB_cl = GaussianNB()
  LR_cl = LogisticRegression()
  RF_cl = RandomForestClassifier()
  XG_cl = XGBClassifier()

  # Train the classifier on the selected features
  SVC_cl.fit(X_train_features_selected_df ,y_train_features_df)
  GNB_cl.fit(X_train_features_selected_df ,y_train_features_df)
  LR_cl.fit(X_train_features_selected_df,y_train_features_df)
  RF_cl.fit(X_train_features_selected_df,y_train_features_df)
  XG_cl.fit(X_train_features_selected_df,y_train_features_df)

  # Make predictions on the validation set
  y_pred_SVC = SVC_cl.predict(X_evaluation_selected)
  y_pred_GNB = GNB_cl.predict(X_evaluation_selected)
  y_pred_LR = LR_cl.predict(X_evaluation_selected)
  y_pred_RF = RF_cl.predict(X_evaluation_selected)
  y_pred_XG = XG_cl.predict(X_evaluation_selected)

  # Calculate the accuracy score
  mcc_features_evaluation_GNB = matthews_corrcoef(y_evaluation_df, y_pred_GNB)
  scores.append(mcc_features_evaluation_GNB)
  mcc_features_evaluation_LR = matthews_corrcoef(y_evaluation_df, y_pred_LR)
  scores.append(mcc_features_evaluation_LR)
  mcc_features_evaluation_RF = matthews_corrcoef(y_evaluation_df, y_pred_RF)
  scores.append(mcc_features_evaluation_RF)
  mcc_features_evaluation_XG = matthews_corrcoef(y_evaluation_df, y_pred_XG)
  scores.append(mcc_features_evaluation_XG)

  avg_mcc = np.mean(scores) # we get the average in order to find the optimal number of features to keep according to all classifiers
  return avg_mcc

In [21]:
# Perform feature selection
X_train_features_df,y_train_features_df,X_evaluation_df,y_evaluation_df = perform_dataset_split(X_train_val, y_train_val)

In [22]:
scores=[]

In [23]:
# Define the objective function with selected features
def objective(trial):
    return objective_feature_selection_ANOVA(trial,X_train_features_df,y_train_features_df,X_evaluation_df,y_evaluation_df,scores)

In [None]:
study = optuna.create_study(direction='maximize', sampler=RandomSampler(seed=42))
study.optimize(objective, n_trials=100)

## SVC



#### Cross val for hyperparameter tunning and feature selection

The main goal of the provided code is to perform hyperparameter tuning for a Support Vector Machine (SVM) classifier (specifically, the SVC implementation) using the Optuna library.
The code also stores the best set of features selected during the hyperparameter tuning in the best_features variable. If the hyperparameter tuning process is pruned (terminated early), the best_features variable will be set to the selected features of the last trial that was not pruned.

In [None]:
# Define the number of folds for cross-validation
n_folds = 5

In [None]:
# Define the objective function for Optuna hyperparameter tuning
def objective_SVC(trial):
  global best_features
  # Define the hyperparameters to be optimized
  C = trial.suggest_float('C', 0.1, 10)
  gamma = trial.suggest_float('gamma', 0.01, 1)
  kernel = trial.suggest_categorical('kernel', ['linear', 'poly', 'rbf', 'sigmoid'])
  degree = trial.suggest_int('degree', 2, 10)
  # Instantiate the classifier with the current hyperparameters
  classifier = SVC(C=C, gamma=gamma,kernel=kernel,degree=degree)

  # Perform cross-validation with feature selection
  kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
  cv_scores = []
  feature_selector = SelectKBest(score_func=f_classif, k=300)  # Adjust the number of features as desired or found through optimization
  best_features = None

  for train_index, val_index in kf.split(X_train_val, y_train_val):
      X_train, X_val = X_train_val[train_index], X_train_val[val_index]
      y_train, y_val = y_train_val[train_index], y_train_val[val_index]

      # Perform feature selection on the training set
      X_train_selected = feature_selector.fit_transform(X_train, y_train)
      selected_features = feature_selector.get_support()

      # Apply the same feature selection on the validation set
      X_val_selected = X_val[:, selected_features]

      # Fit the classifier on the selected features and evaluate on the validation set
      classifier.fit(X_train_selected, y_train)
      y_pred = classifier.predict(X_val_selected)
      mcc = matthews_corrcoef(y_val, y_pred)
      cv_scores.append(mcc)

  # Calculate the average MCC
  avg_mcc = np.mean(cv_scores)

  if trial.should_prune()==False:
    best_features = selected_features.copy()

  return avg_mcc

In [None]:
# Define the Optuna study
study = optuna.create_study(direction='maximize',sampler=RandomSampler(seed=42))
study.optimize(objective_SVC, n_trials=100)

[I 2023-06-23 09:00:11,830] A new study created in memory with name: no-name-b79a7270-e773-457e-8073-8bc20ea9fad6
[I 2023-06-23 09:00:12,047] Trial 0 finished with value: 0.8791884824338855 and parameters: {'C': 3.807947176588889, 'gamma': 0.951207163345817, 'kernel': 'linear', 'degree': 2}. Best is trial 0 with value: 0.8791884824338855.
[I 2023-06-23 09:00:13,547] Trial 1 finished with value: 0.0 and parameters: {'C': 8.675143843171858, 'gamma': 0.6051038616257767, 'kernel': 'rbf', 'degree': 3}. Best is trial 0 with value: 0.8791884824338855.
[I 2023-06-23 09:00:14,050] Trial 2 finished with value: 0.3347510634344725 and parameters: {'C': 1.9000671753502962, 'gamma': 0.1915704647548995, 'kernel': 'poly', 'degree': 7}. Best is trial 0 with value: 0.8791884824338855.
[I 2023-06-23 09:00:15,093] Trial 3 finished with value: 0.0 and parameters: {'C': 1.4809892204552142, 'gamma': 0.29922320204986597, 'kernel': 'rbf', 'degree': 6}. Best is trial 0 with value: 0.8791884824338855.
[I 2023-06

In [None]:
# Get the best hyperparameters from Optuna
best_params = study.best_params

In [None]:
best_params

{'C': 3.807947176588889,
 'gamma': 0.951207163345817,
 'kernel': 'linear',
 'degree': 2}

In [None]:
best_features

array([ True,  True, False, ..., False, False, False])

#### Get best parameters and make predictions
##### We suggest executing the notebook "*Papadopoulou_Fillipidou_Vossos_Final_Project_Metrics-Plots(2 class problem).ipynb*" for all the sections of predictions and results from the classifiers

In [None]:
selected_features_indices = [i for i, value in enumerate(best_features) if value]
X_test_selected = X_test[:, selected_features_indices]

In [None]:
X_train_val_selected_final_features = X_train_val[:, selected_features_indices]

In [None]:
X_test_selected.shape

(286, 300)

In [None]:
# best_classifier_SVC = SVC('C': 3.807947176588889,gamma=0.951207163345817,kernel= 'linear',degree= 2) # in case something is wrong with the parameter **best_params
best_classifier_SVC = SVC(**best_params)

In [None]:
best_classifier_SVC.fit(X_train_val_selected_final_features, y_train_val)

In [None]:
y_test_pred = best_classifier_SVC.predict(X_test_selected)

In [None]:
balanced_acc = balanced_accuracy_score(y_test, y_test_pred)
precision = precision_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)
f1 = f1_score(y_test, y_test_pred)

## Random Forest

#### Cross val for hyperparameter tunning and feature selection

The main goal of the provided code is to perform hyperparameter tuning for a RandomForest classifier using the Optuna library.
The code also stores the best set of features selected during the hyperparameter tuning in the best_features variable. If the hyperparameter tuning process is pruned (terminated early), the best_features variable will be set to the selected features of the last trial that was not pruned.

In [None]:
# Define the number of folds for cross-validation
n_folds = 5

In [None]:
# Define the objective function for Optuna hyperparameter tuning
def objective_RF(trial):
  global best_features
  # Define the hyperparameters to be optimized
  n_estimators= trial.suggest_int('n_estimators', 100,1000)
  max_depth= trial.suggest_int('max_depth', 5, 31)
  min_samples_split= trial.suggest_int('min_samples_split', 2, 100)
  min_samples_leaf= trial.suggest_int('min_samples_leaf', 1, 4)
  bootstrap=trial.suggest_categorical('bootstrap', [True, False])

  # Instantiate the classifier with the current hyperparameters
  classifier = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth,min_samples_split=min_samples_split,min_samples_leaf=min_samples_leaf,bootstrap=bootstrap)

  # Perform cross-validation with feature selection
  kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
  cv_scores = []
  feature_selector = SelectKBest(score_func=f_classif, k=300)  # Adjust the number of features as desired
  best_features = None

  for train_index, val_index in kf.split(X_train_val, y_train_val):
      X_train, X_val = X_train_val[train_index], X_train_val[val_index]
      y_train, y_val = y_train_val[train_index], y_train_val[val_index]

      # Perform feature selection on the training set
      X_train_selected = feature_selector.fit_transform(X_train, y_train)
      selected_features = feature_selector.get_support()

      # Apply the same feature selection on the validation set
      X_val_selected = X_val[:, selected_features]

      # Fit the classifier on the selected features and evaluate on the validation set
      classifier.fit(X_train_selected, y_train)
      y_pred = classifier.predict(X_val_selected)
      mcc = matthews_corrcoef(y_val, y_pred)
      cv_scores.append(mcc)

  # Calculate the average MCC
  avg_mcc = np.mean(cv_scores)

  if trial.should_prune()==False:
    best_features = selected_features.copy()

  return avg_mcc

In [None]:
# Define the Optuna study
study = optuna.create_study(direction='maximize',sampler=RandomSampler(seed=42))
study.optimize(objective_RF, n_trials=100)

[I 2023-06-23 09:59:24,345] A new study created in memory with name: no-name-8965de21-3a04-4fa3-9955-d4ede5192c4c
[I 2023-06-23 09:59:34,718] Trial 0 finished with value: 0.9586952887473211 and parameters: {'n_estimators': 437, 'max_depth': 30, 'min_samples_split': 74, 'min_samples_leaf': 3, 'bootstrap': True}. Best is trial 0 with value: 0.9586952887473211.
[I 2023-06-23 09:59:42,288] Trial 1 finished with value: 0.9586952887473211 and parameters: {'n_estimators': 152, 'max_depth': 28, 'min_samples_split': 61, 'min_samples_leaf': 3, 'bootstrap': False}. Best is trial 0 with value: 0.9586952887473211.
[I 2023-06-23 10:00:14,844] Trial 2 finished with value: 0.9661398274865641 and parameters: {'n_estimators': 850, 'max_depth': 10, 'min_samples_split': 20, 'min_samples_leaf': 1, 'bootstrap': False}. Best is trial 2 with value: 0.9661398274865641.
[I 2023-06-23 10:00:29,887] Trial 3 finished with value: 0.9586952887473211 and parameters: {'n_estimators': 489, 'max_depth': 12, 'min_samples

In [None]:
# Get the best hyperparameters from Optuna
best_params = study.best_params

In [None]:
best_params

{'n_estimators': 850,
 'max_depth': 10,
 'min_samples_split': 20,
 'min_samples_leaf': 1,
 'bootstrap': False}

In [None]:
#best_params

{'n_estimators': 854,
 'max_depth': 24,
 'min_samples_split': 28,
 'min_samples_leaf': 3,
 'bootstrap': True}

In [None]:
best_features

array([ True,  True, False, ..., False, False, False])

#### Get best parameters and make predictions
We suggest executing the notebook "Papadopoulou_Fillipidou_Vossos_Final_Project_Metrics-Plots(2 class problem).ipynb" for all the sections of predictions and results from the classifiers

In [None]:
selected_features_indices = [i for i, value in enumerate(best_features) if value]
X_test_selected = X_test[:, selected_features_indices]

In [None]:
X_train_val_selected_final_features = X_train_val[:, selected_features_indices]

In [None]:
X_test_selected.shape

(286, 300)

In [None]:
best_classifier_RF = RandomForestClassifier(**best_params)

In [None]:
best_classifier_RF.fit(X_train_val_selected_final_features, y_train_val)

In [None]:
y_test_pred = best_classifier_RF.predict(X_test_selected)

In [None]:
balanced_acc = balanced_accuracy_score(y_test, y_test_pred)
precision = precision_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)
f1 = f1_score(y_test, y_test_pred)

#### Feature importance

In [None]:
# Filter the DataFrame using the boolean array
data_filtered_features = df_filtered.iloc[:,:-10].loc[:, best_features]

# Print the filtered DataFrame
data_filtered_features

Unnamed: 0,cell,ENSG00000205542,ENSG00000167996,ENSG00000087086,ENSG00000075624,ENSG00000198918,ENSG00000111640,ENSG00000198938,ENSG00000145425,ENSG00000112306,...,ENSG00000115866,ENSG00000165775,ENSG00000126261,ENSG00000159352,ENSG00000167123,ENSG00000172785,ENSG00000104419,ENSG00000151835,ENSG00000077942,ENSG00000241553
0,ERR2538859-AAACCTGAGACCACGA,32962.305,8240.5760,5042.7400,4950.4950,3874.3005,6949.1420,2982.5964,3351.5774,3474.5710,...,153.742080,153.742080,61.496834,61.496834,297.796360,107.619460,0.000000,61.496834,245.987340,30.748417
1,ERR2538859-AAACCTGTCTGATACG,34861.200,5579.0670,4519.0444,3873.4666,10329.2440,9038.0890,0.0000,8607.7030,10114.0520,...,215.192580,0.000000,0.000000,215.192580,0.000000,0.000000,0.000000,0.000000,0.000000,215.192580
2,ERR2538859-AAACGGGAGTGTTGAA,17791.770,4845.0840,4739.7560,1061.1393,11424.9340,6402.2075,3218.7893,8878.1990,8489.1140,...,70.742620,141.485240,0.000000,70.742620,156.347350,78.863240,35.371310,35.371310,0.000000,35.371310
3,ERR2538859-AAAGATGTCCGAACGC,26513.914,7070.3770,4058.9202,3404.2556,5695.5815,3076.9233,2029.4601,4517.1850,5302.7827,...,65.466450,0.000000,0.000000,0.000000,65.466450,120.912285,196.399350,65.466450,130.932900,0.000000
4,ERR2538859-AAAGTAGGTTAGTGGG,30791.666,7291.2870,5384.3350,2636.0806,4823.4670,4655.2060,6449.9844,3533.4697,4822.8145,...,112.173645,56.086823,112.173645,56.086823,112.173645,0.000000,0.000000,56.086823,56.086823,224.347290
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1422,ERR2538860-TTTGGTTTCGTTACGA,20833.332,5707.7627,5707.7627,3139.2693,6563.9270,3995.4336,7420.0913,7420.0913,7705.4795,...,285.388120,0.000000,570.776250,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1423,ERR2538860-TTTGTCAAGCCCAACC,73876.350,13776.4760,9126.9150,18598.2420,6027.2080,7921.4736,0.0000,5855.0024,5855.0024,...,172.205950,172.205950,516.617860,172.205950,172.205950,0.000000,0.000000,172.205950,0.000000,0.000000
1424,ERR2538860-TTTGTCACATTGGGCC,22690.557,3524.7466,2570.1277,3671.6110,9105.5960,8150.9766,5140.2554,6976.0610,4993.3910,...,367.161100,146.864440,73.432220,73.432220,0.000000,220.296660,0.000000,73.432220,367.161100,0.000000
1425,ERR2538860-TTTGTCAGTTGCGTTA,54775.516,11874.2080,11654.3640,8047.0615,4384.2610,9101.5040,2830.3457,3995.7822,3884.7883,...,166.490920,55.496975,0.000000,166.490920,83.245460,221.987900,110.993950,0.000000,0.000000,55.496975


In [None]:
# view the feature scores
#X_df = pd.DataFrame(X_train_val,columns=X.columns)
feature_scores = pd.Series(best_classifier_RF.feature_importances_).sort_values(ascending=False)

In [None]:
feature_scores

6      0.087005
1      0.081445
8      0.072879
2      0.061646
26     0.052522
         ...   
293    0.000016
239    0.000008
256    0.000000
261    0.000000
284    0.000000
Length: 300, dtype: float64

In [None]:
# Get the column names corresponding to the series indices
column_names = data_filtered_features.columns[feature_scores.index]

# Print the column names
print(column_names)

Index(['ENSG00000111640', 'ENSG00000205542', 'ENSG00000145425',
       'ENSG00000167996', 'ENSG00000168209', 'ENSG00000177600',
       'ENSG00000105193', 'ENSG00000096384', 'ENSG00000096150',
       'ENSG00000163041',
       ...
       'ENSG00000115866', 'ENSG00000178980', 'ENSG00000101966',
       'ENSG00000159210', 'ENSG00000169871', 'ENSG00000159352',
       'ENSG00000063046', 'ENSG00000149357', 'ENSG00000079482',
       'ENSG00000136003'],
      dtype='object', length=300)


In [None]:
# Get the column names corresponding to the series indices above the threshold
filtered_column_names = data_filtered_features.columns[feature_scores[feature_scores > 0.05].index]

# Print the filtered column names
print(filtered_column_names)

Index(['ENSG00000111640', 'ENSG00000205542', 'ENSG00000145425',
       'ENSG00000167996', 'ENSG00000168209', 'ENSG00000177600'],
      dtype='object')


## XGBoost

#### Cross val for hyperparameter tunning and feature selection

The main goal of the provided code is to perform hyperparameter tuning for an XGBoost classifier using the Optuna library.
The code also stores the best set of features selected during the hyperparameter tuning in the best_features variable. If the hyperparameter tuning process is pruned (terminated early), the best_features variable will be set to the selected features of the last trial that was not pruned.

In [None]:
# Define the number of folds for cross-validation
n_folds = 5

In [None]:
# Define the objective function for Optuna hyperparameter tuning
def objective_XG(trial):
  global best_features

  params = {
            'n_estimators': trial.suggest_int('n_estimators', 50, 200),
            'max_depth': trial.suggest_int('max_depth', 5, 20),
            'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.1),
            'subsample': trial.suggest_float('subsample', 0.5, 1.0),
            'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
            'gamma': trial.suggest_float('gamma', 0, 5),
            'min_child_weight': trial.suggest_int('min_child_weight', 1, 5),
            'random_state': 42,
            'tree_method': 'hist',  # Optional: Use 'hist' method for faster training
            'objective': 'multi:softmax',
            'num_class': 2  # Replace with the actual number of classes
        }

  # Instantiate the classifier with the current hyperparameters
  #classifier = XGBClassifier(eta=eta,gamma=gamma ,min_child_weight=min_child_weight,max_delta_step=max_delta_step,subsample=subsample)
  classifier = XGBClassifier(**params)

  # Perform cross-validation with feature selection
  kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
  cv_scores = []
  feature_selector = SelectKBest(score_func=f_classif, k=300)  # Adjust the number of features as desired
  best_features = None

  for train_index, val_index in kf.split(X_train_val, y_train_val):
      X_train, X_val = X_train_val[train_index], X_train_val[val_index]
      y_train, y_val = y_train_val[train_index], y_train_val[val_index]

      # Perform feature selection on the training set
      X_train_selected = feature_selector.fit_transform(X_train, y_train)
      selected_features = feature_selector.get_support()

      # Apply the same feature selection on the validation set
      X_val_selected = X_val[:, selected_features]

      # Fit the classifier on the selected features and evaluate on the validation set
      classifier.fit(X_train_selected, y_train)
      y_pred = classifier.predict(X_val_selected)
      mcc = matthews_corrcoef(y_val, y_pred)
      cv_scores.append(mcc)

  # Calculate the average MCC
  avg_mcc = np.mean(cv_scores)

  if trial.should_prune()==False:
    best_features = selected_features.copy()

  return avg_mcc

In [None]:
# Define the Optuna study
study = optuna.create_study(direction='maximize', sampler=RandomSampler(seed=42))
study.optimize(objective_XG, n_trials=100)

[I 2023-06-23 09:02:18,874] A new study created in memory with name: no-name-f4a9e6e3-4e36-4df5-923e-d44af625dff6
  'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.1),
[I 2023-06-23 09:02:24,592] Trial 0 finished with value: 0.9659070867794387 and parameters: {'n_estimators': 106, 'max_depth': 20, 'learning_rate': 0.05395030966670229, 'subsample': 0.7993292420985183, 'colsample_bytree': 0.5780093202212182, 'gamma': 0.7799726016810132, 'min_child_weight': 1}. Best is trial 0 with value: 0.9659070867794387.
  'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.1),
[I 2023-06-23 09:02:33,689] Trial 1 finished with value: 0.9517078298970004 and parameters: {'n_estimators': 180, 'max_depth': 14, 'learning_rate': 0.051059032093947576, 'subsample': 0.5102922471479012, 'colsample_bytree': 0.9849549260809971, 'gamma': 4.162213204002109, 'min_child_weight': 2}. Best is trial 0 with value: 0.9659070867794387.
  'learning_rate': trial.suggest_loguniform('learning_

In [None]:
# Get the best hyperparameters from Optuna
best_params = study.best_params

In [None]:
best_params

{'n_estimators': 71,
 'max_depth': 9,
 'learning_rate': 0.023246728489504348,
 'subsample': 0.728034992108518,
 'colsample_bytree': 0.8925879806965068,
 'gamma': 0.9983689107917987,
 'min_child_weight': 3}

In [None]:
#best_params

{'eta': 0.9998923071077181,
 'gamma': 0.48062735928448386,
 'min_child_weight ': 0.9165527129072861,
 'max_delta_step': 9,
 'subsample': 0.9036565255711814}

In [None]:
best_features

array([ True,  True, False, ..., False, False, False])

#### Get best parameters and make predictions
We suggest executing the notebook "Papadopoulou_Fillipidou_Vossos_Final_Project_Metrics-Plots(2 class problem).ipynb" for all the sections of predictions and results from the classifiers

In [None]:
selected_features_indices = [i for i, value in enumerate(best_features) if value]
X_test_selected = X_test[:, selected_features_indices]

In [None]:
X_train_val_selected_final_features = X_train_val[:, selected_features_indices]

In [None]:
X_test_selected.shape

(286, 300)

In [None]:
best_classifier_XG = XGBClassifier(**best_params)

In [None]:
best_classifier_XG.fit(X_train_val_selected_final_features, y_train_val)

In [None]:
y_test_pred = best_classifier_XG.predict(X_test_selected)

In [None]:
balanced_acc = balanced_accuracy_score(y_test, y_test_pred)
precision = precision_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)
f1 = f1_score(y_test, y_test_pred)

## Logistic Regression

#### Cross val for hyperparameter tunning and feature selection

The main goal of the provided code is to perform hyperparameter tuning for a LogisticRegression classifier using the Optuna library.
The code also stores the best set of features selected during the hyperparameter tuning in the best_features variable. If the hyperparameter tuning process is pruned (terminated early), the best_features variable will be set to the selected features of the last trial that was not pruned.

In [None]:
# Define the number of folds for cross-validation
n_folds = 5

In [None]:
# Define the objective function for Optuna hyperparameter tuning
def objective_LG(trial):
  global best_features
  params = {
              'tol' : trial.suggest_float('tol' , 1e-6 , 1e-3),
              'C' : trial.suggest_float("C", 1e-5, 100),
              "n_jobs" : -1,
              'solver': trial.suggest_categorical('solver', ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'])
          }

  # Instantiate the classifier with the current hyperparameters
  #classifier = XGBClassifier(eta=eta,gamma=gamma ,min_child_weight=min_child_weight,max_delta_step=max_delta_step,subsample=subsample)
  classifier = LogisticRegression(**params)

  # Perform cross-validation with feature selection
  kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
  cv_scores = []
  feature_selector = SelectKBest(score_func=f_classif, k=300)  # Adjust the number of features as desired
  best_features = None

  for train_index, val_index in kf.split(X_train_val, y_train_val):
      X_train, X_val = X_train_val[train_index], X_train_val[val_index]
      y_train, y_val = y_train_val[train_index], y_train_val[val_index]

      # Perform feature selection on the training set
      X_train_selected = feature_selector.fit_transform(X_train, y_train)
      selected_features = feature_selector.get_support()

      # Apply the same feature selection on the validation set
      X_val_selected = X_val[:, selected_features]

      # Fit the classifier on the selected features and evaluate on the validation set
      classifier.fit(X_train_selected, y_train)
      y_pred = classifier.predict(X_val_selected)
      mcc = matthews_corrcoef(y_val, y_pred)
      cv_scores.append(mcc)

  # Calculate the average MCC
  avg_mcc = np.mean(cv_scores)

  if trial.should_prune()==False:
    best_features = selected_features.copy()

  return avg_mcc

In [None]:
# Define the Optuna study
study = optuna.create_study(direction='maximize', sampler=RandomSampler(seed=42))
study.optimize(objective_LG, n_trials=100)

[I 2023-06-23 09:27:21,257] A new study created in memory with name: no-name-67271ba0-9880-4066-bfe2-b80b7cda5d9c
[I 2023-06-23 09:27:23,483] Trial 0 finished with value: 0.9082276043023576 and parameters: {'tol': 0.0003751655787285152, 'C': 95.07143113384855, 'solver': 'newton-cg'}. Best is trial 0 with value: 0.9082276043023576.
[I 2023-06-23 09:27:24,024] Trial 1 finished with value: 0.8290908451538833 and parameters: {'tol': 0.0008663099696291604, 'C': 60.11150516317076, 'solver': 'liblinear'}. Best is trial 0 with value: 0.9082276043023576.
[I 2023-06-23 09:27:26,087] Trial 2 finished with value: 0.8375464915589985 and parameters: {'tol': 0.00018264314223989352, 'C': 18.340459151298283, 'solver': 'saga'}. Best is trial 0 with value: 0.9082276043023576.
[I 2023-06-23 09:27:26,633] Trial 3 finished with value: 0.8290908451538833 and parameters: {'tol': 0.0001403543667913898, 'C': 29.21447193207533, 'solver': 'liblinear'}. Best is trial 0 with value: 0.9082276043023576.
[I 2023-06-23

In [None]:
# Get the best hyperparameters from Optuna
best_params = study.best_params

In [None]:
best_params

{'tol': 0.0003751655787285152, 'C': 95.07143113384855, 'solver': 'newton-cg'}

In [None]:
best_features

array([ True,  True, False, ..., False, False, False])

#### Get best parameters and make predictions
We suggest executing the notebook "Papadopoulou_Fillipidou_Vossos_Final_Project_Metrics-Plots(2 class problem).ipynb" for all the sections of predictions and results from the classifiers

In [None]:
selected_features_indices = [i for i, value in enumerate(best_features) if value]
X_test_selected = X_test[:, selected_features_indices]

In [None]:
X_train_val_selected_final_features = X_train_val[:, selected_features_indices]

In [None]:
X_test_selected.shape

(286, 300)

In [None]:
best_classifier_LG = LogisticRegression(**best_params)

In [None]:
best_classifier_LG.fit(X_train_val_selected_final_features, y_train_val)

In [None]:
y_test_pred = best_classifier_LG.predict(X_test_selected)

In [None]:
balanced_acc = balanced_accuracy_score(y_test, y_test_pred)
precision = precision_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)
f1 = f1_score(y_test, y_test_pred)

## GaussianNB

#### Cross val for hyperparameter tunning and feature selection

The main goal of the provided code is to perform hyperparameter tuning for a GaussianNB classifier using the Optuna library.
The code also stores the best set of features selected during the hyperparameter tuning in the best_features variable. If the hyperparameter tuning process is pruned (terminated early), the best_features variable will be set to the selected features of the last trial that was not pruned.

In [None]:
# Define the number of folds for cross-validation
n_folds = 5

In [None]:
# Define the objective function for Optuna hyperparameter tuning
def objective_GNB(trial):
  global best_features
  params = {
            'var_smoothing': trial.suggest_float('var_smoothing', 1e-10, 1e-3, log=True)
          }

  # Instantiate the classifier with the current hyperparameters
  #classifier = XGBClassifier(eta=eta,gamma=gamma ,min_child_weight=min_child_weight,max_delta_step=max_delta_step,subsample=subsample)
  classifier = GaussianNB(**params)

  # Perform cross-validation with feature selection
  kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
  cv_scores = []
  feature_selector = SelectKBest(score_func=f_classif, k=300)  # Adjust the number of features as desired
  best_features = None

  for train_index, val_index in kf.split(X_train_val, y_train_val):
      X_train, X_val = X_train_val[train_index], X_train_val[val_index]
      y_train, y_val = y_train_val[train_index], y_train_val[val_index]

      # Perform feature selection on the training set
      X_train_selected = feature_selector.fit_transform(X_train, y_train)
      selected_features = feature_selector.get_support()

      # Apply the same feature selection on the validation set
      X_val_selected = X_val[:, selected_features]

      # Fit the classifier on the selected features and evaluate on the validation set
      classifier.fit(X_train_selected, y_train)
      y_pred = classifier.predict(X_val_selected)
      mcc = matthews_corrcoef(y_val, y_pred)
      cv_scores.append(mcc)

  # Calculate the average MCC
  avg_mcc = np.mean(cv_scores)

  if trial.should_prune()==False:
    best_features = selected_features.copy()

  return avg_mcc

In [None]:
# Define the Optuna study
study = optuna.create_study(direction='maximize', sampler=RandomSampler(seed=42))
study.optimize(objective_GNB, n_trials=100)

[I 2023-06-23 09:42:52,640] A new study created in memory with name: no-name-97694674-8357-409a-ab60-fb09004a7e1e
[I 2023-06-23 09:42:52,840] Trial 0 finished with value: 0.7774806049494355 and parameters: {'var_smoothing': 4.185822729546966e-08}. Best is trial 0 with value: 0.7774806049494355.
[I 2023-06-23 09:42:53,028] Trial 1 finished with value: 0.775315618686399 and parameters: {'var_smoothing': 0.0004518560951024107}. Best is trial 0 with value: 0.7774806049494355.
[I 2023-06-23 09:42:53,213] Trial 2 finished with value: 0.7678940594228604 and parameters: {'var_smoothing': 1.3303245101522907e-05}. Best is trial 0 with value: 0.7774806049494355.
[I 2023-06-23 09:42:53,421] Trial 3 finished with value: 0.7774806049494355 and parameters: {'var_smoothing': 1.5509913987594307e-06}. Best is trial 0 with value: 0.7774806049494355.
[I 2023-06-23 09:42:53,715] Trial 4 finished with value: 0.7774806049494355 and parameters: {'var_smoothing': 1.2363188277052218e-09}. Best is trial 0 with v

In [None]:
# Get the best hyperparameters from Optuna
best_params = study.best_params

In [None]:
best_params

{'var_smoothing': 0.0008094845352286139}

In [None]:
best_features

array([ True,  True, False, ..., False, False, False])

#### Get best parameters and make predictions
We suggest executing the notebook "Papadopoulou_Fillipidou_Vossos_Final_Project_Metrics-Plots(2 class problem).ipynb" for all the sections of predictions and results from the classifiers

In [None]:
selected_features_indices = [i for i, value in enumerate(best_features) if value]
X_test_selected = X_test[:, selected_features_indices]

In [None]:
X_train_val_selected_final_features = X_train_val[:, selected_features_indices]

In [None]:
X_test_selected.shape

(286, 300)

In [None]:
best_classifier_GNB = GaussianNB(**best_params)

In [None]:
best_classifier_GNB.fit(X_train_val_selected_final_features, y_train_val)

In [None]:
y_test_pred = best_classifier_GNB.predict(X_test_selected)

In [None]:
balanced_acc = balanced_accuracy_score(y_test, y_test_pred)
precision = precision_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)
f1 = f1_score(y_test, y_test_pred)

# MRMR for feature selection

In this section, the optimisation procedures employed were similar to those utilised in the preceding section. However, a notable distinction lies in the fact that feature selection was not conducted concurrently with hyperparameter optimisation (with different datasets employed). Instead, feature selection was performed on a separate dataset prior to hyperparameter optimisation. This approach was adopted due to the time-intensive nature of the mRMR procedure.

## Number of features to keep

In [None]:
def perform_dataset_split(X_train_val, y_train_val):
    X_train_features, X_evaluation, y_train_features, y_evaluation = train_test_split(X_train_val, y_train_val, test_size=0.30, random_state=30)
    # Perform feature selection on X_train_features only
    X_train_features_df = pd.DataFrame(X_train_features)
    y_train_features_df = pd.DataFrame(y_train_features)
    X_evaluation_df = pd.DataFrame(X_evaluation)
    y_evaluation_df = pd.DataFrame(y_evaluation)
    return X_train_features_df, y_train_features_df, X_evaluation_df,y_evaluation_df

In [None]:
def objective_feature_selection(trial,X_train_features_df,y_train_features_df,X_evaluation_df,y_evaluation_df,scores):
  # Define the search space for the number of features
  num_features = trial.suggest_int("num_features", 1, X_train_features_df.shape[1])

  selected_features = mrmr_classif(X_train_features_df, y_train_features_df, K=num_features)
  X_train_features_selected_df = X_train_features_df.loc[:,selected_features]
  X_evaluation_selected = X_evaluation_df.loc[:, selected_features]

  SVC_cl = SVC()
  GNB_cl = GaussianNB()
  LR_cl = LogisticRegression()
  RF_cl = RandomForestClassifier()
  XG_cl = XGBClassifier()

  # Train the classifier on the selected features
  SVC_cl.fit(X_train_features_selected_df ,y_train_features_df)
  GNB_cl.fit(X_train_features_selected_df ,y_train_features_df)
  LR_cl.fit(X_train_features_selected_df,y_train_features_df)
  RF_cl.fit(X_train_features_selected_df,y_train_features_df)
  XG_cl.fit(X_train_features_selected_df,y_train_features_df)

  # Make predictions on the validation set
  y_pred_SVC = SVC_cl.predict(X_evaluation_selected)
  y_pred_GNB = GNB_cl.predict(X_evaluation_selected)
  y_pred_LR = LR_cl.predict(X_evaluation_selected)
  y_pred_RF = RF_cl.predict(X_evaluation_selected)
  y_pred_XG = XG_cl.predict(X_evaluation_selected)

  # Calculate the accuracy score
  mcc_features_evaluation_GNB = matthews_corrcoef(y_evaluation_df, y_pred_GNB)
  scores.append(mcc_features_evaluation_GNB)
  mcc_features_evaluation_LR = matthews_corrcoef(y_evaluation_df, y_pred_LR)
  scores.append(mcc_features_evaluation_LR)
  mcc_features_evaluation_RF = matthews_corrcoef(y_evaluation_df, y_pred_RF)
  scores.append(mcc_features_evaluation_RF)
  mcc_features_evaluation_XG = matthews_corrcoef(y_evaluation_df, y_pred_XG)
  scores.append(mcc_features_evaluation_XG)

  avg_mcc = np.mean(scores) # we get the average in order to find the optimal number of features to keep according to all classifiers
  return avg_mcc

In [None]:
# Perform feature selection
X_train_features_df,y_train_features_df,X_evaluation_df,y_evaluation_df = perform_dataset_split(X_train_val, y_train_val)

In [None]:
scores=[]

In [None]:
# Define the objective function with selected features
def objective(trial):
    return objective_feature_selection(trial,X_train_features_df,y_train_features_df,X_evaluation_df,y_evaluation_df,scores)

In [None]:
study = optuna.create_study(direction='maximize', sampler=RandomSampler(seed=42))
study.optimize(objective, n_trials=100)

## SVC



#### Cross val for hyperparameter tunning and feature selection

In [None]:
# Define the number of folds for cross-validation
n_folds = 5

In [None]:
def perform_feature_selection(X_train_val, y_train_val):
    X_train_features, X_hyperparam_opt, y_train_features, y_hyperparam_opt = train_test_split(X_train_val, y_train_val, train_size=0.30, random_state=42)
    # Perform feature selection on X_train_val
    X_train_val_df = pd.DataFrame(X_train_features)
    y_train_val_df = pd.DataFrame(y_train_features)
    X_hyperparam_opt = pd.DataFrame(X_hyperparam_opt)
    y_hyperparam_opt = pd.DataFrame(y_hyperparam_opt)
    selected_features = mrmr_classif(X_train_val_df, y_train_val_df, K=300)
    X_train_val_selected = X_hyperparam_opt.loc[:, selected_features] # filter the dataset for hyperparameters tunning according to the selected features

    return X_train_val_selected, y_hyperparam_opt,selected_features

In [None]:
def objective_SVC(trial, X_hyperparam_opt, y_hyperparam_opt,selected_features):
    # Define the hyperparameters to be optimized
    C = trial.suggest_float('C', 0.1, 10)
    gamma = trial.suggest_float('gamma', 0.01, 1)
    kernel = trial.suggest_categorical('kernel', ['linear', 'poly', 'rbf', 'sigmoid'])
    degree = trial.suggest_int('degree', 2, 10)

    # Instantiate the classifier with the current hyperparameters
    classifier = SVC(C=C, gamma=gamma, kernel=kernel, degree=degree)

    # Perform cross-validation on the selected features
    kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
    cv_scores = []

    for train_index, val_index in kf.split(X_hyperparam_opt, y_hyperparam_opt):
        X_train, X_val = X_hyperparam_opt.iloc[train_index], X_hyperparam_opt.iloc[val_index]
        y_train, y_val = y_hyperparam_opt.iloc[train_index], y_hyperparam_opt.iloc[val_index]

        # Fit the classifier on the selected features and evaluate on the validation set
        classifier.fit(X_train, y_train)
        y_pred = classifier.predict(X_val)
        mcc = matthews_corrcoef(y_val, y_pred)
        cv_scores.append(mcc)

    # Calculate the average MCC
    avg_mcc = np.mean(cv_scores)

    return avg_mcc

In [None]:
# Perform feature selection
X_hyperparam_opt, y_hyperparam_opt,selected_features = perform_feature_selection(X_train_val, y_train_val)

100%|██████████| 300/300 [02:54<00:00,  1.72it/s]


In [None]:
# Define the objective function with selected features
def objective(trial):
    return objective_SVC(trial, X_hyperparam_opt, y_hyperparam_opt,selected_features)

In [None]:
study = optuna.create_study(direction='maximize', sampler=RandomSampler(seed=42))
study.optimize(objective, n_trials=100)

[I 2023-06-23 13:05:54,447] A new study created in memory with name: no-name-a5ae983e-a302-4932-aeb4-224d2f54f8b6
[I 2023-06-23 13:05:55,247] Trial 0 finished with value: 0.8668486306868823 and parameters: {'C': 3.807947176588889, 'gamma': 0.951207163345817, 'kernel': 'linear', 'degree': 2}. Best is trial 0 with value: 0.8668486306868823.
[I 2023-06-23 13:05:57,939] Trial 1 finished with value: 0.0 and parameters: {'C': 8.675143843171858, 'gamma': 0.6051038616257767, 'kernel': 'rbf', 'degree': 3}. Best is trial 0 with value: 0.8668486306868823.
[I 2023-06-23 13:05:59,366] Trial 2 finished with value: 0.0 and parameters: {'C': 1.9000671753502962, 'gamma': 0.1915704647548995, 'kernel': 'poly', 'degree': 7}. Best is trial 0 with value: 0.8668486306868823.
[I 2023-06-23 13:06:01,140] Trial 3 finished with value: 0.0 and parameters: {'C': 1.4809892204552142, 'gamma': 0.29922320204986597, 'kernel': 'rbf', 'degree': 6}. Best is trial 0 with value: 0.8668486306868823.
[I 2023-06-23 13:06:01,47

In [None]:
# Get the best hyperparameters from Optuna
best_params = study.best_params

In [None]:
best_params #42 random seed

{'C': 3.807947176588889,
 'gamma': 0.951207163345817,
 'kernel': 'linear',
 'degree': 2}

#### Get best parameters and make predictions
We suggest executing the notebook "Papadopoulou_Fillipidou_Vossos_Final_Project_Metrics-Plots(2 class problem).ipynb" for all the sections of predictions and results from the classifiers

In [None]:
#selected_features_indices = [i for i, value in enumerate(best_features) if value]
X_test_selected = X_test[:, selected_features]

In [None]:
X_test_selected.shape

(286, 300)

In [None]:
X_train_val_selected_final_features = X_train_val[:, selected_features]

In [None]:
X_train_val_selected_final_features.shape

(1141, 300)

In [None]:
#best_C = study.best_params['C']
#best_gamma = study.best_params['gamma']
#best_gamma = study.best_params['gamma']
#best_classifier = SVC(C=best_C, gamma=best_gamma)
best_classifier_SVC = SVC(**best_params)

In [None]:
best_classifier_SVC.fit(X_train_val_selected_final_features, y_train_val)

In [None]:
y_test_pred = best_classifier_SVC.predict(X_test_selected)

In [None]:
balanced_acc = balanced_accuracy_score(y_test, y_test_pred)
precision = precision_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)
f1 = f1_score(y_test, y_test_pred)

## Random Forest

#### Cross val for hyperparameter tunning and feature selection

In [25]:
# Define the number of folds for cross-validation
n_folds = 5

In [26]:
def perform_feature_selection(X_train_val, y_train_val):
    X_train_features, X_hyperparam_opt, y_train_features, y_hyperparam_opt = train_test_split(X_train_val, y_train_val, train_size=0.30, random_state=42)
    # Perform feature selection on X_train_val
    X_train_val_df = pd.DataFrame(X_train_features)
    y_train_val_df = pd.DataFrame(y_train_features)
    X_hyperparam_opt = pd.DataFrame(X_hyperparam_opt)
    y_hyperparam_opt = pd.DataFrame(y_hyperparam_opt)
    selected_features = mrmr_classif(X_train_val_df, y_train_val_df, K=300)
    X_train_val_selected = X_hyperparam_opt.loc[:, selected_features] # filter the dataset for hyperparameters tunning according to the selected features

    return X_train_val_selected, y_hyperparam_opt,selected_features

In [27]:
def objective_RF(trial, X_hyperparam_opt, y_hyperparam_opt,selected_features):

  # Define the hyperparameters to be optimized
  n_estimators= trial.suggest_int('n_estimators', 100,1000)
  max_depth= trial.suggest_int('max_depth', 5, 31)
  min_samples_split= trial.suggest_int('min_samples_split', 2, 100)
  min_samples_leaf= trial.suggest_int('min_samples_leaf', 1, 4)
  bootstrap=trial.suggest_categorical('bootstrap', [True, False])

  # Instantiate the classifier with the current hyperparameters
  classifier = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth,min_samples_split=min_samples_split,min_samples_leaf=min_samples_leaf,bootstrap=bootstrap)

  # Perform cross-validation on the selected features
  kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
  cv_scores = []

  for train_index, val_index in kf.split(X_hyperparam_opt, y_hyperparam_opt):
      X_train, X_val = X_hyperparam_opt.iloc[train_index], X_hyperparam_opt.iloc[val_index]
      y_train, y_val = y_hyperparam_opt.iloc[train_index], y_hyperparam_opt.iloc[val_index]
      # Fit the classifier on the selected features and evaluate on the validation set
      classifier.fit(X_train, y_train)
      y_pred = classifier.predict(X_val)
      mcc = matthews_corrcoef(y_val, y_pred)
      cv_scores.append(mcc)
  # Calculate the average MCC
  avg_mcc = np.mean(cv_scores)

  return avg_mcc

In [28]:
# Perform feature selection
X_hyperparam_opt, y_hyperparam_opt,selected_features = perform_feature_selection(X_train_val, y_train_val)

100%|██████████| 300/300 [03:01<00:00,  1.65it/s]


In [29]:
# Define the objective function with selected features
def objective(trial):
    return objective_RF(trial, X_hyperparam_opt, y_hyperparam_opt,selected_features)

In [None]:
study = optuna.create_study(direction='maximize', sampler=RandomSampler(seed=42))
study.optimize(objective, n_trials=100)

In [None]:
# Get the best hyperparameters from Optuna
best_params = study.best_params

In [None]:
best_params #42 random seed

{'n_estimators': 110,
 'max_depth': 17,
 'min_samples_split': 7,
 'min_samples_leaf': 1,
 'bootstrap': False}

#### Get best parameters and make predictions
We suggest executing the notebook "Papadopoulou_Fillipidou_Vossos_Final_Project_Metrics-Plots(2 class problem).ipynb" for all the sections of predictions and results from the classifiers

In [None]:
#selected_features_indices = [i for i, value in enumerate(best_features) if value]
X_test_selected = X_test[:, selected_features]

In [None]:
X_test_selected.shape

(286, 300)

In [None]:
X_train_val_selected_final_features = X_train_val[:, selected_features]

In [None]:
X_train_val_selected_final_features.shape

(1141, 300)

In [None]:
#best_C = study.best_params['C']
#best_gamma = study.best_params['gamma']
#best_gamma = study.best_params['gamma']
#best_classifier = SVC(C=best_C, gamma=best_gamma)
best_classifier_RF = RandomForestClassifier(**best_params)

In [None]:
best_classifier_RF.fit(X_train_val_selected_final_features, y_train_val)

In [None]:
y_test_pred = best_classifier_RF.predict(X_test_selected)

In [None]:
balanced_acc = balanced_accuracy_score(y_test, y_test_pred)
precision = precision_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)
f1 = f1_score(y_test, y_test_pred)

## XGBoost

#### Cross val for hyperparameter tunning and feature selection

In [None]:
# Define the number of folds for cross-validation
n_folds = 5

In [None]:
def perform_feature_selection(X_train_val, y_train_val):
    X_train_features, X_hyperparam_opt, y_train_features, y_hyperparam_opt = train_test_split(X_train_val, y_train_val, train_size=0.30, random_state=42)
    # Perform feature selection on X_train_val
    X_train_val_df = pd.DataFrame(X_train_features)
    y_train_val_df = pd.DataFrame(y_train_features)
    X_hyperparam_opt = pd.DataFrame(X_hyperparam_opt)
    y_hyperparam_opt = pd.DataFrame(y_hyperparam_opt)
    selected_features = mrmr_classif(X_train_val_df, y_train_val_df, K=300)
    X_train_val_selected = X_hyperparam_opt.loc[:, selected_features] # filter the dataset for hyperparameters tunning according to the selected features

    return X_train_val_selected, y_hyperparam_opt,selected_features

In [None]:
def objective_XG(trial, X_hyperparam_opt, y_hyperparam_opt,selected_features):

  params = {
            'n_estimators': trial.suggest_int('n_estimators', 50, 200),
            'max_depth': trial.suggest_int('max_depth', 5, 20),
            'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.1),
            'subsample': trial.suggest_float('subsample', 0.5, 1.0),
            'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
            'gamma': trial.suggest_float('gamma', 0, 5),
            'min_child_weight': trial.suggest_int('min_child_weight', 1, 5),
            'random_state': 42,
            'tree_method': 'hist',  # Optional: Use 'hist' method for faster training
            'objective': 'multi:softmax',
            'num_class': 2  # Replace with the actual number of classes
        }

  # Instantiate the classifier with the current hyperparameters
  classifier = XGBClassifier(**params)

  # Perform cross-validation on the selected features
  kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
  cv_scores = []

  for train_index, val_index in kf.split(X_hyperparam_opt, y_hyperparam_opt):
      X_train, X_val = X_hyperparam_opt.iloc[train_index], X_hyperparam_opt.iloc[val_index]
      y_train, y_val = y_hyperparam_opt.iloc[train_index], y_hyperparam_opt.iloc[val_index]
      # Fit the classifier on the selected features and evaluate on the validation set
      classifier.fit(X_train, y_train)
      y_pred = classifier.predict(X_val)
      mcc = matthews_corrcoef(y_val, y_pred)
      cv_scores.append(mcc)
  # Calculate the average MCC
  avg_mcc = np.mean(cv_scores)

  return avg_mcc

In [None]:
# Perform feature selection
X_hyperparam_opt, y_hyperparam_opt,selected_features = perform_feature_selection(X_train_val, y_train_val)

100%|██████████| 300/300 [03:16<00:00,  1.53it/s]


In [None]:
# Define the objective function with selected features
def objective(trial):
    return objective_XG(trial, X_hyperparam_opt, y_hyperparam_opt,selected_features)

In [None]:
study = optuna.create_study(direction='maximize', sampler=RandomSampler(seed=42))
study.optimize(objective, n_trials=100)

[I 2023-06-23 15:03:05,906] A new study created in memory with name: no-name-48c5052b-3ded-476f-8b38-99c3f1e2cd3a
[I 2023-06-23 15:03:46,773] Trial 0 finished with value: 0.9488298257967969 and parameters: {'n_estimators': 106, 'max_depth': 20, 'learning_rate': 0.07587945476302646, 'subsample': 0.7993292420985183, 'colsample_bytree': 0.5780093202212182, 'gamma': 0.7799726016810132, 'min_child_weight': 1}. Best is trial 0 with value: 0.9488298257967969.
[I 2023-06-23 15:04:41,323] Trial 1 finished with value: 0.9388856389707488 and parameters: {'n_estimators': 180, 'max_depth': 14, 'learning_rate': 0.0737265320016441, 'subsample': 0.5102922471479012, 'colsample_bytree': 0.9849549260809971, 'gamma': 4.162213204002109, 'min_child_weight': 2}. Best is trial 0 with value: 0.9488298257967969.
[I 2023-06-23 15:05:15,891] Trial 2 finished with value: 0.9301788222413163 and parameters: {'n_estimators': 77, 'max_depth': 7, 'learning_rate': 0.0373818018663584, 'subsample': 0.762378215816119, 'col

In [None]:
# Get the best hyperparameters from Optuna
best_params = study.best_params

In [None]:
best_params

{'n_estimators': 106,
 'max_depth': 20,
 'learning_rate': 0.07587945476302646,
 'subsample': 0.7993292420985183,
 'colsample_bytree': 0.5780093202212182,
 'gamma': 0.7799726016810132,
 'min_child_weight': 1}

#### Get best parameters and make predictions
We suggest executing the notebook "Papadopoulou_Fillipidou_Vossos_Final_Project_Metrics-Plots(2 class problem).ipynb" for all the sections of predictions and results from the classifiers

In [None]:
#selected_features_indices = [i for i, value in enumerate(best_features) if value]
X_test_selected = X_test[:, selected_features]

In [None]:
X_test_selected.shape

(286, 300)

In [None]:
X_train_val_selected_final_features = X_train_val[:, selected_features]

In [None]:
X_train_val_selected_final_features.shape

(1141, 300)

In [None]:
best_classifier_XG = XGBClassifier(**best_params)

In [None]:
best_classifier_XG.fit(X_train_val_selected_final_features, y_train_val)

In [None]:
y_test_pred = best_classifier_XG.predict(X_test_selected)

In [None]:
balanced_acc = balanced_accuracy_score(y_test, y_test_pred)
precision = precision_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)
f1 = f1_score(y_test, y_test_pred)

## Logistic Regression

#### Cross val for hyperparameter tunning and feature selection

In [None]:
# Define the number of folds for cross-validation
n_folds = 5
#n_folds_features = 2

In [None]:
def perform_feature_selection(X_train_val, y_train_val):
    X_train_features, X_hyperparam_opt, y_train_features, y_hyperparam_opt = train_test_split(X_train_val, y_train_val, train_size=0.30, random_state=42)
    # Perform feature selection on X_train_val
    X_train_val_df = pd.DataFrame(X_train_features)
    y_train_val_df = pd.DataFrame(y_train_features)
    X_hyperparam_opt = pd.DataFrame(X_hyperparam_opt)
    y_hyperparam_opt = pd.DataFrame(y_hyperparam_opt)
    selected_features = mrmr_classif(X_train_val_df, y_train_val_df, K=300)
    X_train_val_selected = X_hyperparam_opt.loc[:, selected_features] # filter the dataset for hyperparameters tunning according to the selected features

    return X_train_val_selected, y_hyperparam_opt,selected_features

In [None]:
def objective_LR(trial, X_hyperparam_opt, y_hyperparam_opt,selected_features):

  params = {
              'tol' : trial.suggest_float('tol' , 1e-6 , 1e-3),
              'C' : trial.suggest_float("C", 1e-5, 100),
              "n_jobs" : -1,
              'solver': trial.suggest_categorical('solver', ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'])
          }

  # Instantiate the classifier with the current hyperparameters
  #classifier = XGBClassifier(eta=eta,gamma=gamma ,min_child_weight=min_child_weight,max_delta_step=max_delta_step,subsample=subsample)
  classifier = LogisticRegression(**params)

  # Perform cross-validation on the selected features
  kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
  cv_scores = []

  for train_index, val_index in kf.split(X_hyperparam_opt, y_hyperparam_opt):
      #X_train, X_val = X_hyperparam_opt[train_index], X_hyperparam_opt[val_index]
      #y_train, y_val = y_hyperparam_opt[train_index], y_hyperparam_opt[val_index]
      X_train, X_val = X_hyperparam_opt.iloc[train_index], X_hyperparam_opt.iloc[val_index]
      y_train, y_val = y_hyperparam_opt.iloc[train_index], y_hyperparam_opt.iloc[val_index]
      # Fit the classifier on the selected features and evaluate on the validation set
      classifier.fit(X_train, y_train)
      y_pred = classifier.predict(X_val)
      mcc = matthews_corrcoef(y_val, y_pred)
      cv_scores.append(mcc)
  # Calculate the average MCC
  avg_mcc = np.mean(cv_scores)

  return avg_mcc

In [None]:
# Perform feature selection
X_hyperparam_opt, y_hyperparam_opt,selected_features = perform_feature_selection(X_train_val, y_train_val)

100%|██████████| 300/300 [03:21<00:00,  1.49it/s]


In [None]:
# Define the objective function with selected features
def objective(trial):
    return objective_LR(trial, X_hyperparam_opt, y_hyperparam_opt,selected_features)

In [None]:
study = optuna.create_study(direction='maximize', sampler=RandomSampler(seed=42))
study.optimize(objective, n_trials=100)

[I 2023-06-23 14:45:08,003] A new study created in memory with name: no-name-078e3a57-be4b-4e05-81a4-0622646491e9
[I 2023-06-23 14:45:10,723] Trial 0 finished with value: 0.8668486306868823 and parameters: {'tol': 0.0003751655787285152, 'C': 95.07143113384855, 'solver': 'newton-cg'}. Best is trial 0 with value: 0.8668486306868823.
[I 2023-06-23 14:45:13,481] Trial 1 finished with value: 0.5057760313622911 and parameters: {'tol': 0.0008663099696291604, 'C': 60.11150516317076, 'solver': 'liblinear'}. Best is trial 0 with value: 0.8668486306868823.
[I 2023-06-23 14:45:21,986] Trial 2 finished with value: 0.4804327086352961 and parameters: {'tol': 0.00018264314223989352, 'C': 18.340459151298283, 'solver': 'saga'}. Best is trial 0 with value: 0.8668486306868823.
[I 2023-06-23 14:45:24,899] Trial 3 finished with value: 0.4983549388463773 and parameters: {'tol': 0.0001403543667913898, 'C': 29.21447193207533, 'solver': 'liblinear'}. Best is trial 0 with value: 0.8668486306868823.
[I 2023-06-23

In [None]:
# Get the best hyperparameters from Optuna
best_params = study.best_params

In [None]:
best_params #42 random seed

{'tol': 0.0009966402002368315, 'C': 55.54317500594569, 'solver': 'lbfgs'}

#### Get best parameters and make predictions
We suggest executing the notebook "Papadopoulou_Fillipidou_Vossos_Final_Project_Metrics-Plots(2 class problem).ipynb" for all the sections of predictions and results from the classifiers

In [None]:
#selected_features_indices = [i for i, value in enumerate(best_features) if value]
X_test_selected = X_test[:, selected_features]

In [None]:
X_test_selected.shape

(286, 300)

In [None]:
X_train_val_selected_final_features = X_train_val[:, selected_features]

In [None]:
X_train_val_selected_final_features.shape

(1141, 300)

In [None]:

best_classifier_LR = LogisticRegression(**best_params)

In [None]:
best_classifier_LR.fit(X_train_val_selected_final_features, y_train_val)

In [None]:
y_test_pred = best_classifier_LR.predict(X_test_selected)

In [None]:
balanced_acc = balanced_accuracy_score(y_test, y_test_pred)
precision = precision_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)
f1 = f1_score(y_test, y_test_pred)

## GaussianNB


#### Cross val for hyperparameter tunning and feature selection

In [None]:
# Define the number of folds for cross-validation
n_folds = 5

In [None]:
def perform_feature_selection(X_train_val, y_train_val):
    X_train_features, X_hyperparam_opt, y_train_features, y_hyperparam_opt = train_test_split(X_train_val, y_train_val, train_size=0.30, random_state=42)
    # Perform feature selection on X_train_val
    X_train_val_df = pd.DataFrame(X_train_features)
    y_train_val_df = pd.DataFrame(y_train_features)
    X_hyperparam_opt = pd.DataFrame(X_hyperparam_opt)
    y_hyperparam_opt = pd.DataFrame(y_hyperparam_opt)
    selected_features = mrmr_classif(X_train_val_df, y_train_val_df, K=300)
    X_train_val_selected = X_hyperparam_opt.loc[:, selected_features] # filter the dataset for hyperparameters tunning according to the selected features

    return X_train_val_selected, y_hyperparam_opt,selected_features

In [None]:
def objective_GNB(trial, X_hyperparam_opt, y_hyperparam_opt,selected_features):

  params = {
            'var_smoothing': trial.suggest_float('var_smoothing', 1e-10, 1e-3, log=True)
          }

  # Instantiate the classifier with the current hyperparameters
  #classifier = XGBClassifier(eta=eta,gamma=gamma ,min_child_weight=min_child_weight,max_delta_step=max_delta_step,subsample=subsample)
  classifier = GaussianNB(**params)

  # Perform cross-validation on the selected features
  kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
  cv_scores = []

  for train_index, val_index in kf.split(X_hyperparam_opt, y_hyperparam_opt):
      #X_train, X_val = X_hyperparam_opt[train_index], X_hyperparam_opt[val_index]
      #y_train, y_val = y_hyperparam_opt[train_index], y_hyperparam_opt[val_index]
      X_train, X_val = X_hyperparam_opt.iloc[train_index], X_hyperparam_opt.iloc[val_index]
      y_train, y_val = y_hyperparam_opt.iloc[train_index], y_hyperparam_opt.iloc[val_index]
      # Fit the classifier on the selected features and evaluate on the validation set
      classifier.fit(X_train, y_train)
      y_pred = classifier.predict(X_val)
      mcc = matthews_corrcoef(y_val, y_pred)
      cv_scores.append(mcc)
  # Calculate the average MCC
  avg_mcc = np.mean(cv_scores)

  return avg_mcc

In [None]:
# Perform feature selection
X_hyperparam_opt, y_hyperparam_opt,selected_features = perform_feature_selection(X_train_val, y_train_val)

100%|██████████| 300/300 [03:20<00:00,  1.50it/s]


In [None]:
# Define the objective function with selected features
def objective(trial):
    return objective_GNB(trial, X_hyperparam_opt, y_hyperparam_opt,selected_features)

In [None]:
study = optuna.create_study(direction='maximize', sampler=RandomSampler(seed=42))
study.optimize(objective, n_trials=100)

[I 2023-06-23 14:58:13,783] A new study created in memory with name: no-name-a5a2727e-7ae5-41bf-9cd2-fbdef4eaea93
[I 2023-06-23 14:58:14,001] Trial 0 finished with value: 0.48788674353486067 and parameters: {'var_smoothing': 4.185822729546966e-08}. Best is trial 0 with value: 0.48788674353486067.
[I 2023-06-23 14:58:14,217] Trial 1 finished with value: 0.5523622794189398 and parameters: {'var_smoothing': 0.0004518560951024107}. Best is trial 1 with value: 0.5523622794189398.
[I 2023-06-23 14:58:14,433] Trial 2 finished with value: 0.5068216787033559 and parameters: {'var_smoothing': 1.3303245101522907e-05}. Best is trial 1 with value: 0.5523622794189398.
[I 2023-06-23 14:58:14,652] Trial 3 finished with value: 0.5068216787033559 and parameters: {'var_smoothing': 1.5509913987594307e-06}. Best is trial 1 with value: 0.5523622794189398.
[I 2023-06-23 14:58:14,863] Trial 4 finished with value: 0.48788674353486067 and parameters: {'var_smoothing': 1.2363188277052218e-09}. Best is trial 1 wi

In [None]:
# Get the best hyperparameters from Optuna
best_params = study.best_params

In [None]:
best_params #42 random seed

{'var_smoothing': 0.0008094845352286139}

#### Get best parameters and make predictions
We suggest executing the notebook "Papadopoulou_Fillipidou_Vossos_Final_Project_Metrics-Plots(2 class problem).ipynb" for all the sections of predictions and results from the classifiers

In [None]:
#selected_features_indices = [i for i, value in enumerate(best_features) if value]
X_test_selected = X_test[:, selected_features]

In [None]:
X_test_selected.shape

(286, 300)

In [None]:
X_train_val_selected_final_features = X_train_val[:, selected_features]

In [None]:
X_train_val_selected_final_features.shape

(1141, 300)

In [None]:
#best_C = study.best_params['C']
#best_gamma = study.best_params['gamma']
#best_gamma = study.best_params['gamma']
#best_classifier = SVC(C=best_C, gamma=best_gamma)
best_classifier_GNB = GaussianNB(**best_params)

In [None]:
best_classifier_GNB.fit(X_train_val_selected_final_features, y_train_val)

In [None]:
y_test_pred = best_classifier_GNB.predict(X_test_selected)

In [None]:
balanced_acc = balanced_accuracy_score(y_test, y_test_pred)
precision = precision_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)
f1 = f1_score(y_test, y_test_pred)