# CASE DATASET ANALYSIS NOTEBOOK
Created on Tue Oct  3 16:46:11 2023

@author: lrm22005

This notebook can be runned just using the paths.

## EDA_GRAPH FEATURES CLASSIFICATION WITH OUR LABELS

**Data Loading and Preprocessing:**
The dataset is loaded from a CSV file named 'EDA_graph_features.csv'.
The 'subject' column is used to identify different subjects.

**Classifier Definitions:**
Several classifiers, such as Naive Bayes, K-Nearest Neighbors, Random Forest, AdaBoost, Gradient Boosting, Decision Tree, and Support Vector Machine, are defined along with their hyperparameter grids for grid search.

**Subjects Identification:**
The unique subjects present in the dataset are identified.

**Cross-Validation Strategy (Leave-One-Subject-Out):**
The GroupKFold is used as the cross-validation strategy with the number of splits equal to the number of unique subjects. This means that each subject will be left out as a test set in one iteration, while the others will be used for training.

**Classifier Evaluation Loop:**
The code iterates over each classifier defined earlier.
    For each classifier:
        It initializes variables for accuracy and balanced accuracy.
        It initializes empty lists to store true labels and predicted labels across all test subjects.
        It iterates over each unique subject:
            Separates the data into training and test sets, where the current subject is the test set.
            Performs a grid search to find the best hyperparameters for the classifier using the training data.
            Trains the best classifier on the training data.
            Makes predictions on the test data.
            Appends the true and predicted labels for later evaluation.
        Calculates the average accuracy and balanced accuracy across all subjects.
        Prints the classification report, including precision, recall, and F1-score.

## CLASSIFICATION USING THE 5 MOST RELEVANT FEATURES

In [2]:
#---------------------------------------------------------------------------------------------
# load the libraries that are required for this project:
#---------------------------------------------------------------------------------------------
import sys

import time                     # Time is for estimating the computational time of every operation
import pandas as pd
import numpy as np              # NumPy is for numerical operations
import matplotlib               # MatPlotLib is for making plots & figures
import matplotlib.pyplot as plt # PyPlot is a subset of the library for making MATLAB-style plots
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, balanced_accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import cross_val_score, LeaveOneGroupOut, GridSearchCV
from sklearn.model_selection import GroupKFold
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import warnings

def grid_search(clf, params, X_train, y_train):
    grid_search = GridSearchCV(clf, params, cv=5, n_jobs=-1)
    grid_search.fit(X_train, y_train)
    return grid_search.best_estimator_

# Load the dataset
path = r'C:\Users\lrm22005\OneDrive - University of Connecticut\Research\emotion_graph\codes\EDA-graph\\'
data = pd.read_csv(path + '\\EDA_graph_features.csv')

# scaler = StandardScaler()
X = data[['total_load_centrality','total_harmonic_centrality','graph_number_of_cliqs','P_diameter','P_radius','subject']]
y = data['class']

# Define the classifiers
classifiers = {
    'nb': (OneVsRestClassifier(GaussianNB()), {'estimator__var_smoothing': [1e-6, 1e-5, 1e-4]}),
    'knn': (OneVsRestClassifier(KNeighborsClassifier()), {'estimator__n_neighbors': [1, 2, 3, 4, 5], 'estimator__weights': ['uniform', 'distance'], 'estimator__p': [1, 2]}),
    'rf': (OneVsRestClassifier(RandomForestClassifier()), {'estimator__n_estimators': [100, 200], 'estimator__max_depth': [20, 30], 'estimator__random_state': [42]}),
    'abc': (OneVsRestClassifier(AdaBoostClassifier()), {'estimator__n_estimators': [50, 100, 150], 'estimator__learning_rate': [0.1, 0.5]}),
    'gbc': (OneVsRestClassifier(GradientBoostingClassifier()), {'estimator__n_estimators': [50, 100], 'estimator__learning_rate': [0.1, 1.0], 'estimator__max_depth': [40, 60]}),
    'dt': (OneVsRestClassifier(DecisionTreeClassifier()), {'estimator__max_depth': [10, 30], 'estimator__min_samples_split': [2, 10], 'estimator__random_state': [42]}),
    'svm': (OneVsRestClassifier(SVC()), {'estimator__C': [1, 10, 100], 'estimator__gamma': [0.1, 1, 10], 'estimator__kernel': ['rbf']})
}

# Define the subjects
subjects = X['subject']
unique_subjects = X['subject'].unique()

X = X.drop('subject', axis=1)

# Define the subjects

# Define the cross-validation strategy (leave-one-subject-out)
cv = GroupKFold(n_splits=len(unique_subjects))

# Perform grid search for each classifier
best_classifiers = {}
for name, (clf, params) in classifiers.items():
    avg_accuracy = 0.0
    avg_balanced_accuracy = 0.0
    
    # Initialize variables for evaluation
    y_test_all = []
    y_pred_all = []
    
    # Iterate over each subject as a separate test fold
    for subject in unique_subjects:
        # Identify the indices for the current subject in both features and target labels
        subject_indices = subjects == subject
        X_test, y_test = X[subject_indices], y[subject_indices]
        X_train, y_train = X[~subject_indices], y[~subject_indices]
        
        # Grid search for the current classifier
        best_clf = grid_search(clf, params, X_train, y_train)
        best_classifiers[name] = best_clf
        
        # Train the best classifier on the training data
        best_clf.fit(X_train, y_train)
        
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", category=UserWarning)
            y_pred = best_clf.predict(X_test)  
            
        # Append current test data and predictions for overall evaluation
#         y_pred = best_clf.predict(X_test)
        y_test_all.extend(y_test)
        y_pred_all.extend(y_pred)
    
    # Calculate overall metrics
    avg_accuracy = accuracy_score(y_test_all, y_pred_all)
    avg_balanced_accuracy = balanced_accuracy_score(y_test_all, y_pred_all)
    
    # Print classification report
    print(f"{name}:")
    print(f"Average Accuracy: {avg_accuracy:.3f}")
    print(f"Average Balanced Accuracy: {avg_balanced_accuracy:.3f}")
    print(classification_report(y_test_all, y_pred_all, zero_division=0))

nb:
Average Accuracy: 0.747
Average Balanced Accuracy: 0.243
              precision    recall  f1-score   support

           0       0.78      0.94      0.85     12527
           1       0.00      0.00      0.00       834
           2       0.00      0.00      0.00       413
           3       0.38      0.27      0.32      1807
           4       0.00      0.00      0.00       889

    accuracy                           0.75     16470
   macro avg       0.23      0.24      0.23     16470
weighted avg       0.63      0.75      0.68     16470

knn:
Average Accuracy: 0.662
Average Balanced Accuracy: 0.218
              precision    recall  f1-score   support

           0       0.76      0.84      0.80     12527
           1       0.10      0.06      0.08       834
           2       0.00      0.00      0.00       413
           3       0.15      0.14      0.15      1807
           4       0.13      0.04      0.06       889

    accuracy                           0.66     16470
   macro

## TRADITIONAL METHODS CLASSIFICATION WITH OUR LABELS

In [3]:
#---------------------------------------------------------------------------------------------
# load the libraries that are required for this project:
#---------------------------------------------------------------------------------------------
import sys
import time                     # Time is for estimating the computational time of every operation
import pandas as pd
import numpy as np              # NumPy is for numerical operations
import matplotlib               # MatPlotLib is for making plots & figures
import matplotlib.pyplot as plt # PyPlot is a subset of the library for making MATLAB-style plots
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, balanced_accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import cross_val_score, LeaveOneGroupOut, GridSearchCV
from sklearn.model_selection import GroupKFold
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import warnings

def grid_search(clf, params, X_train, y_train):
    grid_search = GridSearchCV(clf, params, cv=5, n_jobs=-1)
    grid_search.fit(X_train, y_train)
    return grid_search.best_estimator_

# Load the dataset
path = r'C:\Users\lrm22005\OneDrive - University of Connecticut\Research\emotion_graph\codes\EDA-graph\\'
data = pd.read_csv(path + '\\traditional_lab_feature_matrix_labeled.csv')

# scaler = StandardScaler()
X = data.drop(['class','valence','arousal','Class'], axis=1)
y = data['class']

# Define the classifiers
classifiers = {
    'nb': (OneVsRestClassifier(GaussianNB()), {'estimator__var_smoothing': [1e-6, 1e-5, 1e-4]}),
    'knn': (OneVsRestClassifier(KNeighborsClassifier()), {'estimator__n_neighbors': [1, 2, 3, 4, 5], 'estimator__weights': ['uniform', 'distance'], 'estimator__p': [1, 2]}),
    'rf': (OneVsRestClassifier(RandomForestClassifier()), {'estimator__n_estimators': [100, 200], 'estimator__max_depth': [20, 30], 'estimator__random_state': [42]}),
    'abc': (OneVsRestClassifier(AdaBoostClassifier()), {'estimator__n_estimators': [50, 100, 150], 'estimator__learning_rate': [0.1, 0.5]}),
    'gbc': (OneVsRestClassifier(GradientBoostingClassifier()), {'estimator__n_estimators': [50, 100], 'estimator__learning_rate': [0.1, 1.0], 'estimator__max_depth': [40, 60]}),
    'dt': (OneVsRestClassifier(DecisionTreeClassifier()), {'estimator__max_depth': [10, 30], 'estimator__min_samples_split': [2, 10], 'estimator__random_state': [42]}),
    'svm': (OneVsRestClassifier(SVC()), {'estimator__C': [1, 10, 100], 'estimator__gamma': [0.1, 1, 10], 'estimator__kernel': ['rbf']})
}

# Define the subjects
subjects = X['subject']
unique_subjects = X['subject'].unique()

X = X.drop('subject', axis=1)

# Define the subjects

# Define the cross-validation strategy (leave-one-subject-out)
cv = GroupKFold(n_splits=len(unique_subjects))

# Perform grid search for each classifier
best_classifiers = {}
for name, (clf, params) in classifiers.items():
    avg_accuracy = 0.0
    avg_balanced_accuracy = 0.0
    
    # Initialize variables for evaluation
    y_test_all = []
    y_pred_all = []
    
    # Iterate over each subject as a separate test fold
    for subject in unique_subjects:
        # Identify the indices for the current subject in both features and target labels
        subject_indices = subjects == subject
        X_test, y_test = X[subject_indices], y[subject_indices]
        X_train, y_train = X[~subject_indices], y[~subject_indices]
        
        # Grid search for the current classifier
        best_clf = grid_search(clf, params, X_train, y_train)
        best_classifiers[name] = best_clf
        
        # Train the best classifier on the training data
        best_clf.fit(X_train, y_train)
        
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", category=UserWarning)
            y_pred = best_clf.predict(X_test)  
            
        # Append current test data and predictions for overall evaluation
#         y_pred = best_clf.predict(X_test)
        y_test_all.extend(y_test)
        y_pred_all.extend(y_pred)
    
    # Calculate overall metrics
    avg_accuracy = accuracy_score(y_test_all, y_pred_all)
    avg_balanced_accuracy = balanced_accuracy_score(y_test_all, y_pred_all)
    
    # Print classification report
    print(f"{name}:")
    print(f"Average Accuracy: {avg_accuracy:.3f}")
    print(f"Average Balanced Accuracy: {avg_balanced_accuracy:.3f}")
    print(classification_report(y_test_all, y_pred_all, zero_division=0))


nb:
Average Accuracy: 0.670
Average Balanced Accuracy: 0.166
              precision    recall  f1-score   support

           0       0.67      1.00      0.80     11085
           1       0.00      0.00      0.00       438
           2       0.00      0.00      0.00       222
           3       0.00      0.00      0.00       446
           4       0.00      0.00      0.00       864
           5       1.00      0.00      0.00      3415

    accuracy                           0.67     16470
   macro avg       0.28      0.17      0.13     16470
weighted avg       0.66      0.67      0.54     16470

knn:
Average Accuracy: 0.553
Average Balanced Accuracy: 0.161
              precision    recall  f1-score   support

           0       0.68      0.77      0.72     11085
           1       0.03      0.01      0.02       438
           2       0.00      0.00      0.00       222
           3       0.00      0.00      0.00       446
           4       0.04      0.03      0.03       864
         

NameError: name 'class_counts_train' is not defined