Skip to content

ItsTatsuya/Evolution_FeatureExtraction

Repository files navigation

Evolution-Based Feature Extraction for Imbalanced Datasets

A machine learning research project that implements a genetic algorithm-based feature selection method combined with probability-based (PO) statistical analysis for classification tasks on imbalanced datasets.

Overview

This project addresses the challenge of feature selection in imbalanced binary classification problems using a hybrid approach that combines:

  1. PO (Probability) Statistic - A novel feature importance measure based on true negative rate (TNR) and false negative rate (FNR)
  2. Genetic Algorithm (GA) - Evolutionary optimization for selecting optimal feature subsets
  3. HEOM Distance - Heterogeneous Euclidean-Overlap Metric for handling mixed data types
  4. Cross-Validation - Robust fitness evaluation using stratified k-fold cross-validation

Key Features

  • Automated Dataset Processing: Handles 44+ imbalanced datasets in KEEL format
  • Parallel Processing: Multi-threaded execution for efficient computation
  • Robust Error Handling: Timeout mechanisms and graceful failure recovery
  • Mixed Data Type Support: Handles both numerical and categorical features
  • Performance Optimization: Adaptive genetic algorithm parameters and efficient distance calculations

Algorithm Components

1. PO Statistic

Computes feature importance based on classification performance metrics:

  • Uses multiple k-values (1, 3, 5, 7) for robust estimation
  • Combines TNR and FNR using power functions
  • Normalizes scores to create probability distributions for feature selection

2. Genetic Algorithm

  • Population Size: Adaptive based on dataset size (20-50 individuals)
  • Crossover: Two-point crossover with 80% probability
  • Mutation: Adaptive mutation rate (starts at 5%, decreases with improvement)
  • Selection: Tournament selection with elitism
  • Termination: Early stopping based on convergence or performance threshold

3. HEOM Distance

Handles heterogeneous data by:

  • Normalized Euclidean distance for numerical features
  • Overlap distance for categorical features
  • Precomputed ranges for efficiency

Dataset Support

The system processes imbalanced binary classification datasets from the KEEL repository, including:

  • Abalone (age prediction)
  • Cleveland (heart disease)
  • Ecoli (protein localization)
  • Glass (glass type identification)
  • LED7digit (digit recognition)
  • Page-blocks (document layout)
  • Shuttle (space shuttle status)
  • Vowel (vowel recognition)
  • Yeast (protein function prediction)

Installation

  1. Clone the repository:
git clone https://github.com/ItsTatsuya/Evolution_FeatureExtraction.git
cd Evolution_FeatureExtraction
  1. Install dependencies:
pip install -r requirements.txt

Usage

  1. Ensure your datasets are in the dataset/ directory in KEEL format
  2. Run the main script:
python main.py

The program will:

  • Process all datasets in parallel
  • Extract and parse KEEL .dat files
  • Apply feature selection using PO statistics and GA
  • Evaluate performance using 1-NN classification with HEOM distance
  • Output AUC scores for each dataset

Configuration

Key parameters can be modified in the configuration section:

BASE_DIR = "dataset/"
SEED = 42
MAX_THREADS = max(1, multiprocessing.cpu_count() - 1)
DATASET_TIMEOUT = 1200  # 20 minutes per dataset
GA_ITERATIONS = 1000
GA_CROSSOVER_PROB = 0.8
N_FOLDS = 5  # Cross-validation folds

Output

The program generates:

  • Real-time progress updates for each dataset
  • Final results table with AUC scores
  • Summary statistics (average, min, max AUC)
  • Processing time and success rate
  • Error reports for failed datasets

Example output:

=== Final Results ===
Dataset                                         AUC
--------------------------------------------------
abalone19-5                                  0.4988
abalone9-18-5                                0.9408
cleveland-0_vs_4-5                           0.4844
...

Average AUC: 0.7845
Successfully processed: 44/44 datasets

Acknowledgments

  • KEEL dataset repository for providing standardized imbalanced datasets
  • Research community for foundational work in feature selection and evolutionary algorithms

About

A machine learning research project that implements a genetic algorithm-based feature selection method combined with probability-based (PO) statistical analysis for classification tasks on imbalanced datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages