# Table of Contents

- [Import libraries](#import-libraries)
- [1. Load and preprocess data](#1.-load-and-preprocess-data)
- [2. Feature Selection Methods](#2.-feature-selection-methods)
    - [2.1. Feature Selection using Statistical Tests and Machine Learning Models](#2.1.-feature-selection-using-statistical-tests-and-machine-learning-models)
        - [2.1.1. Random Forest Feature Importance](#2.1.1.-random-forest-feature-importance)
        - [2.1.2. Mutual Information](#2.1.2.-mutual-information)
        - [2.1.3. Recursive Feature Elimination (RFE)](#2.1.3.-recursive-feature-elimination-(rfe))
        - [2.1.4. Implementing Feature Selection Methods](#2.1.4.-implementing-feature-selection-methods)

# Import libraries

In [22]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_selection import SelectKBest, mutual_info_classif, RFE
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import joblib
import json
import seaborn as sns

# 1. Load and preprocess the data

- Load and take a look at the data:

In [6]:
file_path_to_eeg_data = '/home/letruongzzio/PRML-MidTerm-Project/data/extracted_eeg15.txt'
eeg_data = pd.read_csv(file_path_to_eeg_data)

In [282]:
eeg_data

Unnamed: 0,t,ED_AF3,ED_F7,ED_F3,ED_FC5,ED_T7,ED_P7,ED_O1,ED_O2,ED_P8,ED_T8,ED_FC6,ED_F4,ED_F8,ED_AF4,state
0,0,36.308042,-55.186421,152.140934,-69.699336,-33.371978,7.011785,3.676519,-36.256532,-7.410987,13.321748,-12.999812,10.527336,18.369719,-26.431018,focused
1,1,145.275957,-222.151700,613.645042,-281.397540,-134.994174,29.029020,15.308817,-145.544705,-29.492744,53.302692,-51.218925,42.033961,72.757687,-106.553388,focused
2,2,221.156883,-342.470203,948.443954,-435.638306,-209.620124,47.306369,24.710285,-224.045032,-46.279805,81.361800,-73.960070,63.947824,109.184230,-164.097805,focused
3,3,174.543454,-277.631682,768.868855,-354.526413,-171.274459,41.546387,20.535896,-182.844315,-42.460191,64.876410,-47.334591,50.743829,85.960672,-131.003854,focused
4,4,130.972434,-211.498239,578.466515,-268.812212,-129.841300,28.959749,13.057790,-141.738067,-38.558418,48.789203,-21.136371,38.098960,68.718020,-95.478064,focused
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
443967,443967,5.453479,-8.705584,-12.619785,9.845330,8.242794,1.594757,-7.147407,-7.720854,-4.353289,4.299503,6.283642,5.927045,5.917661,-7.017291,drownsy
443968,443968,4.024225,-3.848896,-7.586209,8.348809,6.616827,1.864122,-3.487370,-8.551686,-7.983080,3.394277,5.859646,4.946830,4.072439,-7.669937,drownsy
443969,443969,3.083207,-4.562668,-8.853218,8.208134,4.907967,3.653700,-2.050684,-6.375272,-6.582744,3.772559,6.699536,3.841765,2.970310,-8.712591,drownsy
443970,443970,3.250555,-10.614087,-12.047371,7.896189,5.027693,3.962645,-2.481863,-2.894899,-3.065567,6.920169,5.976134,3.078008,3.257538,-8.265146,drownsy


In [283]:
eeg_data.describe()

Unnamed: 0,t,ED_AF3,ED_F7,ED_F3,ED_FC5,ED_T7,ED_P7,ED_O1,ED_O2,ED_P8,ED_T8,ED_FC6,ED_F4,ED_F8,ED_AF4
count,443972.0,443972.0,443972.0,443972.0,443972.0,443972.0,443972.0,443972.0,443972.0,443972.0,443972.0,443972.0,443972.0,443972.0,443972.0
mean,221985.5,-0.000474,0.000254,0.000264,-0.000489,-0.000301,0.000357,0.000878,0.000812,0.000293,-0.000492,-0.00021,-0.000466,-0.000433,7e-06
std,128163.821194,5.971115,5.789471,9.580238,6.345,5.736325,7.184243,8.862893,10.177508,8.781786,5.496789,6.715437,5.447905,5.644232,10.627613
min,0.0,-333.198537,-342.470203,-345.877259,-435.638306,-331.658436,-86.345335,-133.566758,-224.045032,-272.703568,-322.266576,-636.440961,-332.748491,-333.645798,-220.221984
25%,110992.75,-3.437857,-3.34082,-4.601394,-3.544239,-3.224365,-4.221609,-5.136229,-5.667328,-4.730517,-3.154651,-3.276947,-3.117195,-3.240347,-4.595818
50%,221985.5,-0.02947,-0.001993,0.002495,0.015273,0.01074,0.0486,0.049457,0.069562,-0.009202,-0.021418,-0.018482,-0.024031,0.005189,-0.035168
75%,332978.25,3.400954,3.344064,4.623114,3.545748,3.230481,4.264691,5.198373,5.762395,4.768521,3.117404,3.246473,3.080821,3.229116,4.55518
max,443971.0,221.156883,343.302853,948.443954,141.838762,107.872518,260.156752,329.346813,348.731149,368.233946,107.825538,286.522466,107.653541,109.18423,1123.877264


In [284]:
eeg_data['state'].unique()

array(['focused', 'unfocused', 'drownsy'], dtype=object)

- Separate the data into features and labels:

In [7]:
X = eeg_data.drop(columns=["state"]).values
y = eeg_data["state"].values

- Label encoding for the target column:
  - `drowsy` -> 0
  - `focused` -> 1
  - `unfocused` -> 2

In [8]:
lb = LabelEncoder()
y_encoded = lb.fit_transform(y)
print(y_encoded)

[1 1 1 ... 0 0 0]


- Scale the features:

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Feature Selection Methods

## 2.1. Feature Selection using Statistical Tests and Machine Learning Models

### 2.1.1. Random Forest Feature Importance

In [288]:
def select_features_with_random_forest(X, y, num_features=10):
    """Select features using random forest classifier"""
    rf = RandomForestClassifier(random_state=42)
    rf.fit(X, y)
    importances = rf.feature_importances_
    indices = np.argsort(importances)[::-1][:num_features]
    return X[:, indices], indices

### 2.1.2. Mutual Information

Mutual Information (MI) measures the dependency between two random variables. It is suitable for both continuous and discrete variables and can capture any kind of statistical dependency.

Mathematical Principle: Let $X$ and $Y$ be two random variables, the mutual information between $X$ and $Y$ is defined as:

  $$
  MI(X; Y) := \sum_{x \in X} \sum_{y \in Y} \mathbb{P}(x, y) \log \left( \frac{\mathbb{P}(x, y)}{\mathbb{P}(x)\mathbb{P}(y)} \right)
  $$
  Where:
  - $\mathbb{P}(x, y)$: Joint probability distribution of $X$ and $Y$.
  - $\mathbb{P}(x)$: Marginal probability distribution of $X$.
  - $\mathbb{P}(y)$: Marginal probability distribution of $Y$.

MI is high when the variables are highly dependent and low when they are independent. It is equal to zero if and only if $X$ and $Y$ are independent.

In [289]:
def select_features_with_mutual_info(X, y, num_features=10):
    """Select features using mutual information"""
    selector = SelectKBest(score_func=mutual_info_classif, k=num_features)
    X_new = selector.fit_transform(X, y)
    indices = selector.get_support(indices=True)
    return X_new, indices

### 2.1.3. Recursive Feature Elimination (RFE)

Recursive Feature Elimination (RFE) is a feature selection method that recursively removes the least important features based on their importance weights. It requires a base model that can evaluate feature importance, such as linear models (Logistic Regression, Linear SVM) or tree-based models (Random Forest, Decision Trees).

RFE is useful when we want to find the optimal subset of features for our model that can be computationally expensive, especially for large datasets with many features. In addition, the number of features to select is a hyperparameter that needs to be tuned based on the model performance.

Mathematical Principle:
  1. Train a model (e.g., Logistic Regression, SVM, Random Forest).
  2. Remove the least important feature.
  3. Repeat until the desired number of features is reached.

In [290]:
def select_features_with_rfe(X, y, num_features=10):
    """Select features using recursive feature elimination"""
    rfe_model = RandomForestClassifier(random_state=42)
    selector = RFE(estimator=rfe_model, n_features_to_select=num_features)
    X_new = selector.fit_transform(X, y)
    indices = selector.get_support(indices=True)
    return X_new, indices

### 2.1.4. Implementing Feature Selection Methods

In [294]:
selected_features = {}

methods = {
    "Random Forest": select_features_with_random_forest,
    "Mutual Info": select_features_with_mutual_info,
    "RFE": select_features_with_rfe,
}

for method_name, method_func in methods.items():
    X_selected, indices = method_func(X_scaled, y_encoded, num_features=8)
    selected_features[method_name] = {
        "features": eeg_data.drop(columns=["state"]).columns[indices].tolist(),
        "indices": indices,
    }

In [295]:
# Display selected features
for method, result in selected_features.items():
    print(f"\n{method} Selected Features:")
    if method == "PCA":
        print(f"Principal Components: {result['principal_components']}")
        print(f"Top Features per Component:")
        for pc, features in result['top_features_per_component'].items():
            print(f"  {pc}: {features}")
    else:
        print(f"Feature Names: {result['features']}")


Random Forest Selected Features:
Feature Names: ['t', 'ED_F8', 'ED_F3', 'ED_P7', 'ED_O2', 'ED_AF3', 'ED_T8', 'ED_T7']

Mutual Info Selected Features:
Feature Names: ['t', 'ED_AF3', 'ED_F3', 'ED_T7', 'ED_O2', 'ED_P8', 'ED_T8', 'ED_F4']

RFE Selected Features:
Feature Names: ['t', 'ED_AF3', 'ED_F3', 'ED_P7', 'ED_O1', 'ED_O2', 'ED_T8', 'ED_F8']


In [None]:
def convert_ndarray_to_list(data):
    """Convert all ndarrays in a nested structure to lists"""
    if isinstance(data, np.ndarray):
        return data.tolist()
    elif isinstance(data, dict):
        return {key: convert_ndarray_to_list(value) for key, value in data.items()}
    elif isinstance(data, list):
        return [convert_ndarray_to_list(item) for item in data]
    else:
        return data

# Convert all ndarrays in selected_features to lists
selected_features_serializable = convert_ndarray_to_list(selected_features)

# File paths for saving results
methods_output_path = "/home/letruongzzio/PRML-MidTerm-Project/model_implementation/LDA/feature_selection_files/selected_features_methods.json"

# Save other methods' results to another JSON file
with open(methods_output_path, "w") as methods_file:
    json.dump(selected_features_serializable, methods_file, indent=4)
print(f"Selected features for other methods saved to {methods_output_path}")

Selected features for other methods saved to selected_features_methods.json
