# Feature Selection with Genetic Algorithm (GA)

The search space for this problem corresponds to the power set of the set of the 54 attributes, which means
$2^{54}$ possibilities. Denote such space by $\mathcal{S}$. A local search algorithm can seek the desired subset of attributes without having
to pass through each of the elements in this colossus space.

Before submitting this problem to the GA algorithm, it is necessary to define the way solutions (subsets) will
be represented and the function to rank the solutions, also known as the fitness function.

## Solutions representation

For this problem, the chromosome representation is a vector $x^i \in \{0,1\}^{54}$, in which each component $j$,  $x^i_j$, means the presence of the attribute $j$ in the solution $i$.

## Fitness function

This function will rank each solution according to the objective of selecting the most appropriate attribute for this classification problem. Since the purpose is to determine the class of an instance based on the attribute values, the similarity measure of correlation will be applied here. Correlation between two random variables $X$ and $Y$ is given by the
equation:

$$
corr(X,Y) = \frac{cov(X,Y)}{\sigma_X\sigma_Y}
$$

where $cov(X,Y)$ is the covariance between $X$ and $Y$, and $\sigma_X, \sigma_Y$ are the stardard deviations for $X$ and $Y$ respectively. The fitness function is defined as the mean of the unsigned correlations (in $[0,1]$) of each selected attribute with respect to the class:

$$
f(x^i) = \frac{\sum_{j, x^i_j = 1} |corr(x^i_j,cover\_type)|}{\sum_{j, x^i_j=1} 1}.
$$

## Implementation

First of all, it is necessary to import libraries and load the dataset:

In [17]:
import pandas as pd
import numpy as np
from pyeasyga.pyeasyga import GeneticAlgorithm

# read the dataset
dataset = pd.read_csv("datasets/new_dataset_covertype.csv")
# preview of dataset
dataset.head()

Unnamed: 0,elevation,aspect,slope,horiz_dist_hydro,vert_dist_hydro,horiz_dist_road,hillshade_9,hill_shade_noon,hill_shade_15,horiz_dist_fire,...,soil_type_31,soil_type_32,soil_type_33,soil_type_34,soil_type_35,soil_type_36,soil_type_37,soil_type_38,soil_type_39,cover_type
0,3254,75,7,365,49,3034,228,228,133,4708,...,0,0,0,0,0,0,0,0,1,1
1,3149,341,16,216,30,3241,186,215,167,3085,...,0,0,0,0,0,0,0,0,0,1
2,2972,321,10,150,13,4796,194,230,176,4607,...,0,0,0,0,0,0,0,0,0,1
3,3097,265,21,430,60,3290,162,244,218,1503,...,0,0,0,0,0,0,0,0,0,1
4,3321,286,7,660,118,797,201,240,179,968,...,1,0,0,0,0,0,0,0,0,1


Correlations with respect to the class are fixed values, and can be calculated using the following:

In [None]:
# class correlations
class_correlations = dataset.corr(method="pearson")['cover_type']
print(class_correlations)

# Replacing NaN values with 0
class_correlations.fillna(0, inplace=True)

Now, we define the fitness function:

In [30]:
data = class_correlations[:54].tolist()

# Fitness function
def fitness (individual, data):
    '''
    Compute the fitness of a solution.
    '''
    if individual.count(1) > 0:
        mean = data[0]
        for i in range(len(individual)):
            if individual[i] == 1:
                mean += abs(data[i])
        return (mean/individual.count(1))
    return 0.0

# Dataframe to store results
columns = ['population', 'max_generations', \
           'combination_prob', 'mutation_prob', 'iteration']
columns = columns + class_correlations[:54].index.tolist()

results = pd.DataFrame(columns=columns)

Now, we run the algorithm, using different parameter values:

In [None]:
populations  = [25, 50, 100]
generations  = [50,100,200]
combinations = [.7, .8, .9]
mutations    = [.05, .1, .15]
iterations   = 30

# Algorithm definition 
ga = GeneticAlgorithm(data, elitism=True, maximise_fitness=True)

# Use own fitness function
ga.fitness_function = fitness

index = 0

for population in populations:
    ga.population_size = population
    for generation in generations:
        ga.generations = generation
        for combination in combinations:
            ga.crossover_probability = combination
            for mutation in mutations:
                ga.mutation_probability = mutation
                for i in range(iterations):
                    # Running genetic algorithm
                    ga.run()
                    print("pop: ", population, "generation: ", generation, \
                          "comb: ", combination, "mu: ", mutation, "it: ", (i+1))
                    r = [population, generation, combination, mutation, i+1] +\
                    [x==1 for x in ga.best_individual()[1]]
                    results.loc[index] = r
                    index += 1
                    

Finally, we store the results in a CSV:

In [34]:
# Create CSV with new dataset
results.to_csv('results/genetic_algorithm.csv', index=False)