This project aims to utilize Genetic Algorithms (GA) to perform feature selection for a bank dataset to enhance the efficiency of predictive models.
For this analysis, you need the following Python libraries:
- pandas
- numpy
- sklearn
- lightgbm
- genetic_selection
- matplotlib
- itertools
- DataTable
- warnings
The bank dataset bank-full.csv
is loaded and preprocessed. This involves:
- Reading the data into a pandas DataFrame.
- Factorizing categorical data to convert them into numeric form.
- Splitting the data into training and testing datasets.
A decision tree classifier is utilized as a baseline model for comparison.
A Genetic Algorithm approach is utilized to perform feature selection:
- Multiple configurations of the GA parameters such as population size, crossover rate, and mutation rate are tested.
- The best features are selected based on the highest accuracy achieved using cross-validation.
- The results for each configuration are stored, and the best set of features is identified.
Various plots are generated:
- A plot to visualize the best fitness value of the GA across generations for different configurations.
- Error plots to understand the standard error for various configurations of the GA.
The results are stored in a CSV file named GA_report.csv
.
- Dataset: Moro, Sérgio & Cortez, Paulo & Laureano, Raul. (2012). Enhancing Bank Direct Marketing through Data Mining. CAlg European Marketing Academy. Available at: https://repositorium.sdum.uminho.pt/handle/1822/21409
- GA Package: Manuel Calzolari. (2021, April 3). manuel-calzolari/sklearn-genetic: sklearn-genetic 0.4.1 (Version 0.4.1). Zenodo. http://doi.org/10.5281/zenodo.4661248