In this project, we have performed performance analysis of fifteen feature selection methods by comparing 'accuracy' performance metric of each method over five classification algorithms. We have used ten publicly available datasets for this purpose.
- Pair-wise Correlation
- Regularized Self Representation
- Variance Threshold
- Logistic Regression based selection
- Random Forest (Gini importance)
- Boruta Algorithm
- LASSO Algorithm
- Extra Tree Classifier
- Mutual Information Classifier
- Chi-Square Test
- Recursive Feature Elimination with RF
- Correlation
- Cosine Similarity and Standard deviation with Exponent
- Laplacian Score
- Iterative Laplacian Score
- Decision Trees
- Logistic Regression
- Random Forest
- KNN
- Naive Bayes
- Iris
- Breast Cancer
- Pima Indians Diabetes
- Cirrhosis Prediction
- Parkinson's Disease
- Heart Disease
- Sonar
- Stroke Prediction
- Wine Quality
- Abalone
Two screenshots of the obtained results are given below.
K is the number of best features taken. k=2 implies 2 best features given by each feature selection methods have been used to perform classification, based on which accuracy was calculated.
Accuracy = (TP + TN)/(TP + TN + FP + FN): where T is True, F is False, P is Positive and N is Negative.