A machine learning research project that implements a genetic algorithm-based feature selection method combined with probability-based (PO) statistical analysis for classification tasks on imbalanced datasets.
This project addresses the challenge of feature selection in imbalanced binary classification problems using a hybrid approach that combines:
- PO (Probability) Statistic - A novel feature importance measure based on true negative rate (TNR) and false negative rate (FNR)
- Genetic Algorithm (GA) - Evolutionary optimization for selecting optimal feature subsets
- HEOM Distance - Heterogeneous Euclidean-Overlap Metric for handling mixed data types
- Cross-Validation - Robust fitness evaluation using stratified k-fold cross-validation
- Automated Dataset Processing: Handles 44+ imbalanced datasets in KEEL format
- Parallel Processing: Multi-threaded execution for efficient computation
- Robust Error Handling: Timeout mechanisms and graceful failure recovery
- Mixed Data Type Support: Handles both numerical and categorical features
- Performance Optimization: Adaptive genetic algorithm parameters and efficient distance calculations
Computes feature importance based on classification performance metrics:
- Uses multiple k-values (1, 3, 5, 7) for robust estimation
- Combines TNR and FNR using power functions
- Normalizes scores to create probability distributions for feature selection
- Population Size: Adaptive based on dataset size (20-50 individuals)
- Crossover: Two-point crossover with 80% probability
- Mutation: Adaptive mutation rate (starts at 5%, decreases with improvement)
- Selection: Tournament selection with elitism
- Termination: Early stopping based on convergence or performance threshold
Handles heterogeneous data by:
- Normalized Euclidean distance for numerical features
- Overlap distance for categorical features
- Precomputed ranges for efficiency
The system processes imbalanced binary classification datasets from the KEEL repository, including:
- Abalone (age prediction)
- Cleveland (heart disease)
- Ecoli (protein localization)
- Glass (glass type identification)
- LED7digit (digit recognition)
- Page-blocks (document layout)
- Shuttle (space shuttle status)
- Vowel (vowel recognition)
- Yeast (protein function prediction)
- Clone the repository:
git clone https://github.com/ItsTatsuya/Evolution_FeatureExtraction.git
cd Evolution_FeatureExtraction- Install dependencies:
pip install -r requirements.txt- Ensure your datasets are in the
dataset/directory in KEEL format - Run the main script:
python main.pyThe program will:
- Process all datasets in parallel
- Extract and parse KEEL .dat files
- Apply feature selection using PO statistics and GA
- Evaluate performance using 1-NN classification with HEOM distance
- Output AUC scores for each dataset
Key parameters can be modified in the configuration section:
BASE_DIR = "dataset/"
SEED = 42
MAX_THREADS = max(1, multiprocessing.cpu_count() - 1)
DATASET_TIMEOUT = 1200 # 20 minutes per dataset
GA_ITERATIONS = 1000
GA_CROSSOVER_PROB = 0.8
N_FOLDS = 5 # Cross-validation foldsThe program generates:
- Real-time progress updates for each dataset
- Final results table with AUC scores
- Summary statistics (average, min, max AUC)
- Processing time and success rate
- Error reports for failed datasets
Example output:
=== Final Results ===
Dataset AUC
--------------------------------------------------
abalone19-5 0.4988
abalone9-18-5 0.9408
cleveland-0_vs_4-5 0.4844
...
Average AUC: 0.7845
Successfully processed: 44/44 datasets
- KEEL dataset repository for providing standardized imbalanced datasets
- Research community for foundational work in feature selection and evolutionary algorithms