This notebook exists too experiment with the different methods seen to perform feature selection. Methods include:
- Statistical Measures
- PCA
- Wrapper Methods

In [1]:
%run "Parameter_Estimation.ipynb" #allowing access to parameters

100%|████████████████████████████████████████████████████████████████████████████████| 549/549 [00:27<00:00, 19.74it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 202/202 [00:02<00:00, 88.80it/s]


Unhealthy lf_power: mean:0.002246771347577131, std: 0.005716348092590171
Healthy lf_power: mean:0.001986485016994376, std:0.004750847001099235
Unhealthy hf_power: mean:0.004085063574954664, std: 0.00801349983605936
Healthy hf_power: mean:0.0029768474608233256, std:0.006931241434845449
Unhealthy ratio of power bands: mean:2.1906510706529017, std: 4.5578331544128385
Healthy ratio of power bands: mean:3.6707058173199845, std:6.545749052455982


In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.svm import SVC

In [3]:
health_state = allowed_patients.get_diagnoses()

encoded_health_state = [True if label == 'Unhealthy' else False for label in health_state]

## Parameter Selection

### correlation 
#### pmcc


In [5]:
from scipy.stats import pearsonr
import itertools

for key1, key2 in itertools.combinations(params.keys(), 2):
    corr, p_value = pearsonr(params[key1], params[key2])
    if p_value < 0.05:
        print(f"parameter {key1} and parameter {key2} are significantly correlated, p = {p_value}, corr = {corr}")


parameter rr_mean and parameter rr_std are significantly correlated, p = 4.062395325115764e-08, corr = 0.3743340369120984
parameter rr_mean and parameter RMSSD are significantly correlated, p = 2.2987696809200955e-06, corr = 0.32539154949177745
parameter rr_mean and parameter pNN50 are significantly correlated, p = 0.0007941729849659782, corr = 0.2342059568006114
parameter rr_mean and parameter std are significantly correlated, p = 5.435433259682619e-10, corr = -0.4189936627892915
parameter rr_mean and parameter kurtosis are significantly correlated, p = 9.662209716789082e-10, corr = 0.4134062412423312
parameter rr_mean and parameter shannon_en are significantly correlated, p = 7.227845542211822e-05, corr = -0.275519796590391
parameter rr_std and parameter RMSSD are significantly correlated, p = 2.4389447379852025e-120, corr = 0.9666972499757714
parameter rr_std and parameter pNN50 are significantly correlated, p = 5.386345605631653e-40, corr = 0.7644747575824683
parameter rr_std and p

#### wilcoxon 
compare the medians of two related samples or to compare repeated measurements of the same sample under different conditions. Should be used to test wether a parameter has provided a significant difference to the model.

compare above results with papers

## PCA
- loses the knowledge of features, less intuitive
- will experiment with it anyway

In [6]:
# Initialize the array
X = np.zeros((no_patients, 4))#need no. samples as rows, no. features as columns for machine learning analysis

# Populate the array with values from the dictionary
X[:, 0] = params['rr_mean']
X[:, 1] = params['kurtosis']
X[:, 2] = params['RMSSD']
X[:, 3] = params['shannon_en']
    
#standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

#set desired number of principle components
num_components = 2

#using sklearn PCA
pca = PCA(n_components=num_components)
X_pca = pca.fit_transform(X_scaled)

In [7]:
#using principle components to do ML

#splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_pca, health_state, test_size=0.3)

#init and train model, using radial basis functions
svm_classifier = SVC(kernel='rbf', gamma='scale')  #'scale' normalises data, prevents overfitting
svm_classifier.fit(X_train, y_train)

#predictions
y_pred = svm_classifier.predict(X_test)

#evaluating accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(y_pred)
print("Accuracy:", accuracy)

['Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy'
 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy'
 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy'
 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy'
 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy'
 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy'
 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy'
 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy'
 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy'
 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy' 'Unhealthy'
 'Unhealthy']
Accuracy: 0.6885245901639344


## Wrapper Methods:

These methods do feature selection whilst using the model

 - Forward Selection: Features are sequentially added to the model, starting with an empty set and adding the feature that improves model performance the most at each step.
 - Backward Elimination: Features are sequentially removed from the model, starting with the full set of features and removing the feature that decreases model performance the least at each step.
 - Recursive Feature Elimination (RFE): Features are recursively pruned based on the importance assigned to them by the model. Less important features are eliminated iteratively until the desired number of features is reached.

### RFE

In [8]:
from sklearn.feature_selection import RFE

# initializing parameter array
X = np.zeros((no_patients, 3))#need no. samples as rows, no. features as columns for machine learning analysis
X[:, 0] = params['rr_mean']
X[:, 1] = params['kurtosis']
X[:, 2] = params['shannon_en']
#X[:, 3] = params['pNN50']

#splitting data into test and train sets
X_train, X_test, y_train, y_test = train_test_split(X, health_state, test_size=0.3)

#initialise SVM -- have to use a linear kernel??
svm = SVC(kernel="linear")

#initialize RFE with the SVM model and desired number of feauters
rfe = RFE(estimator=svm, n_features_to_select=1)

rfe.fit(X_train, y_train)

In [9]:
print("Selected features:", rfe.support_)
print("Feature ranking:", rfe.ranking_)

Selected features: [False False  True]
Feature ranking: [3 2 1]


In [10]:
# transform the dataset to include only the selected features
X_train_rfe = rfe.transform(X_train)
X_test_rfe = rfe.transform(X_test)

# train the SVM on the selected features
svm.fit(X_train_rfe, y_train)

# Make predictions on the test set
y_pred = svm.predict(X_test_rfe)

# Evaluate the model performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy with selected features:", accuracy)

Accuracy with selected features: 0.6721311475409836


### Forward/Backward Elimination

May be slower than RFE but does not need to have coefficients i.e. a linear kernel.

In [11]:
from sklearn.feature_selection import SequentialFeatureSelector 

svm_rbf = SVC(kernel='rbf')

SFS_forward = SequentialFeatureSelector(estimator=svm_rbf, tol = 5)

SFS_forward.fit(X_train, y_train)

SFS_forward.get_support()

array([ True, False, False])

In [12]:
SFS_backward = SequentialFeatureSelector(estimator=svm_rbf, tol=-5, direction='backward')

SFS_backward.fit(X_train, y_train)

SFS_forward.get_support()

array([ True, False, False])

In [13]:
#these agree only first one should be kept but disagrees with RFE???

## Embedded Methods:

Also done whilst using the model

- Regularization: Techniques like LASSO (L1 regularization) and Ridge (L2 regularization) penalize the magnitude of feature coefficients, forcing less important features to have coefficients close to zero.

In [14]:
#LASSO/Ridge Regression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

#changing health_state to binary for use in regression
binary_health_state = [1 if label == 'Unhealthy' else 0 for label in health_state]

#splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, binary_health_state, test_size=0.3)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

#create and fit regression models
lasso_alpha = 0.1  # Regularization strength (hyperparameter)
lasso = Lasso(alpha=lasso_alpha)
lasso.fit(X_train_scaled, y_train)

ridge_alpha = 0.1 
ridge = Ridge(alpha=ridge_alpha)
ridge.fit(X_train_scaled, y_train)

#use the trained models for prediction
X_test_scaled = scaler.fit_transform(X_test)

y_pred_ridge = ridge.predict(X_test_scaled)
y_pred_lasso = lasso.predict(X_test_scaled)
