In [40]:
from pyGeneticAlgorithm.discrete_solver import discreteGeneticSolver
import pandas as pd
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.model_selection import train_test_split
import numpy as np
import warnings
warnings.filterwarnings("ignore")

# Using GA for feature selection

Feature selection in machine learning refers to selecting a subset of features to build your machine learning model. The idea is to remove irrelevant or redundant features, ending with a reduced set of features that can: <br>

* Improve model performance by alleviating the curse of dimensionality and reducing overfitting
* Reduce the cost of data acquisition. In genomics research, we can probe the entire human genome for some biomedical application, such as classifying cancer patients based on their genome. If we can reduce the number of genes probed (from the entirety of the human genome to a few hundred) we can reduce the costs of data acquisition.
* Reduce data storage and shorter processing times: even if the data acquisition is the same after feature selection, we can still benefit from feature selection by reducing storage costs and processing power neccessary
* Model simplification and better interpretability: I would rather interpret a 10 variable model than a 100 variable model, how about you?

# Feature selection in the arcene data

The arcene dataset, described here, https://www.openml.org/d/1458, is a dataset of mass spectrometry data, where we try to classify cancer and non cancer patients from protein abundance values. I downloaded the csv filed which was called php8tg99.csv and renamed it to arcene_data.csv. There are 200 data points and 10000 variables in the data. We split the data 60% train, 20% validation, 20% test

Fitness function: since this is a classification problem, classification accuracy seems  the logical choice. We could also try to optimze for sparseness. We fix the test set ahead of time, and dont use it till the very end, no peaking at the test set!

In order to find the feature subset which generalizes best to unseen data, we split the remaining training data into train and validation, and calculate our fitness w.r.t to validation. To avoid fitting the variable selection to one subset of the data, we split training and validation at every model evaluation.

We will try the default QDA lassification algorithm. We should of course try to search for hyperparameters and other algorithms in a real setting, but that goes beyond the scope of this notebook.

Once we find the best subset, we train it on the whole train set and evaluate it on the test set

In [42]:
data = pd.read_csv("./arcene_data.csv",sep = ",")
X = data.values[:,:-1]
Y = data.values[:,-1]

X_train_valid, X_test, y_train_valid, y_test = train_test_split(X, Y, test_size=0.2)

In [80]:
def fitness(features):
    
    features = np.array(features)
    #print(features)
    X_train, X_val, y_train, y_val = train_test_split(X_train_valid, y_train_valid, test_size=0.25)
    x_train = X_train[:,features]
    x_val = X_val[:,features]
    cls = QDA()
    cls.fit(x_train, y_train.ravel())

    return np.mean(cls.predict(x_val) == y_val.ravel())

In [81]:
dGS = discreteGeneticSolver(0.01,"midpoint",[False,True],X_train_valid.shape[1],fitness,pop=200)
res = np.array(dGS.solve(25))

0.775
0.8
0.75
0.8
0.825
0.825
0.8
0.775
0.85
0.775
0.75
0.8
0.775
0.825
0.8
0.85
0.8
0.85
0.8
0.825
0.775
0.8
0.8
0.8
0.825


In [82]:
print(f"Best subset had {np.sum(res)} features")

Best subset had 4909 features


In [84]:
cls = QDA()
cls.fit(X_train_valid[:,res], y_train_valid.ravel())
test_pred =  cls.predict(X_test[:,res])
print(f"Test accuracy using GA features: {np.mean(test_pred == y_test.ravel())}")

Test accuracy using GA features: 0.625


In [85]:
cls = QDA()
cls.fit(X_train_valid, y_train_valid.ravel())
test_pred =  cls.predict(X_test)
print(f"Test accuracy using all features: {np.mean(test_pred == y_test.ravel())}")

Test accuracy using all features: 0.5
