feature-selection

Introduction

In this project, we have performed performance analysis of fifteen feature selection methods by comparing 'accuracy' performance metric of each method over five classification algorithms. We have used ten publicly available datasets for this purpose.

Feature Selection Methods Used:

Pair-wise Correlation
Regularized Self Representation
Variance Threshold
Logistic Regression based selection
Random Forest (Gini importance)
Boruta Algorithm
LASSO Algorithm
Extra Tree Classifier
Mutual Information Classifier
Chi-Square Test
Recursive Feature Elimination with RF
Correlation
Cosine Similarity and Standard deviation with Exponent
Laplacian Score
Iterative Laplacian Score

Classification Algorithms Used:

Decision Trees
Logistic Regression
Random Forest
KNN
Naive Bayes

Datasets Used:

Iris
Breast Cancer
Pima Indians Diabetes
Cirrhosis Prediction
Parkinson's Disease
Heart Disease
Sonar
Stroke Prediction
Wine Quality
Abalone

Results:

Two screenshots of the obtained results are given below.

K is the number of best features taken. k=2 implies 2 best features given by each feature selection methods have been used to perform classification, based on which accuracy was calculated.

Accuracy = (TP + TN)/(TP + TN + FP + FN): where T is True, F is False, P is Positive and N is Negative.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
datasets		datasets
feature_classification		feature_classification
feature_selection		feature_selection
observations		observations
plots		plots
results		results
tables		tables
.gitignore		.gitignore
GBC.csv		GBC.csv
README.md		README.md
main.ipynb		main.ipynb
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

feature-selection

Introduction