# Feature Selection using Particle Swarm Optimization (PSO)

The search space for this problem corresponds to the power set of the set of the 54 attributes, which means
$2^{54}$ possibilities. Denote such space by $\mathcal{S}$. A local search algorithm can seek the desired subset of attributes without having
to pass through each of the elements in this colossus space.

Before submitting this problem to the PSO algorithm, it is necessary to define the way solutions (subsets) will
be represented and the function to rank the solutions, also known as the *fitness function*.

## Solutions representation

For this problem, a solution $i$ will be represented by a vector $x^i \in [0,1]^{54}$, in which each component $j$, $x^i_j$, means the probability of selecting the corresponding attribute $j$. Denote $[0,1]^{54}$ by $\mathcal{S_2}$. A threshold $0 \leq \theta \leq 1$ defines 
if the attribute was selected or not: $x^i_j \geq \theta$ means that the attribute $j$ is selected in the solution $i$,
as implemented by the simple function below. In this implementation, $\theta = 0.5$:

In [1]:
def is_selected(xij):
    '''
    Determines if attribute j in solution i was selected
    based on the value on position j of vector x^i.
    '''
    return xij >= 0.5

## Fitness function

This function will rank each solution according to the objective of selecting the most appropriate attribute for this classification problem. Since the purpose is to determine the class of an instance based on the attribute values, the similarity measure of correlation will be applied here. Correlation between two random variables $X$ and $Y$ is given by the
equation:

$$
corr(X,Y) = \frac{cov(X,Y)}{\sigma_X\sigma_Y}
$$

where $cov(X,Y)$ is the covariance between $X$ and $Y$. The fitness function is defined as the mean of the unsigned correlations (in $[0,1]$) of each selected attribute with respect to the class. Given that the PSO algorithm used here seeks to minimize the fitness value, the fitness function $f:\mathcal{S_2}\to\mathbb{R}$ is given by the equation:

$$
f(x^i) = 1 - \left(\frac{\sum_{j, x^i_j = 1} |corr(x^i_j,cover\_type)|}{\sum_{j, x^i_j=1} 1}\right).
$$

## Implementation

First of all, it is necessary to load the dataset:

In [2]:
import pandas as pd
# read the dataset
dataset = pd.read_csv("datasets/new_dataset_covertype.csv")
# preview
dataset.head()

Unnamed: 0,elevation,aspect,slope,horiz_dist_hydro,vert_dist_hydro,horiz_dist_road,hillshade_9,hill_shade_noon,hill_shade_15,horiz_dist_fire,...,soil_type_31,soil_type_32,soil_type_33,soil_type_34,soil_type_35,soil_type_36,soil_type_37,soil_type_38,soil_type_39,cover_type
0,3254,75,7,365,49,3034,228,228,133,4708,...,0,0,0,0,0,0,0,0,1,1
1,3149,341,16,216,30,3241,186,215,167,3085,...,0,0,0,0,0,0,0,0,0,1
2,2972,321,10,150,13,4796,194,230,176,4607,...,0,0,0,0,0,0,0,0,0,1
3,3097,265,21,430,60,3290,162,244,218,1503,...,0,0,0,0,0,0,0,0,0,1
4,3321,286,7,660,118,797,201,240,179,968,...,1,0,0,0,0,0,0,0,0,1


Correlations with respect to the class are fixed values, and can be calculated using the following:

In [3]:
# compute attribute-class correlations
class_correlations = dataset.corr(method="pearson")['cover_type']

Now, define the fitness function by:

In [4]:
import numpy as np

def f(x, theta=0.5):
    '''
    Takes a vector in [0,1]^54 and compute its fitness value.
    '''
    selected_attrs = list(map(is_selected, x))
    if (any(selected_attrs)):
        sum_corr = sum([abs(class_correlations[i]) for i in np.arange(0,dataset.shape[1]-1) if selected_attrs[i]])
        count_attrs = sum(selected_attrs)
        return 1 - (sum_corr/count_attrs)
    else:
        return 1

The PSO implementation used here comes from the `pyswarm` library. An amount of 30 executions is performed:

In [5]:
from pyswarm import pso
import itertools
# define variables's lower bound
lb = np.zeros(dataset.shape[1] - 1)
# define variables' upper bound
ub = np.ones(dataset.shape[1] - 1)
# number of trials
max_trials = 30
# swarm size
swarm_sizes = [25, 50, 100]
# maximum iterations
max_iterations = [50, 100, 200]
# dataframe to store results
results_columns = pd.Index(['swarm_size', 'max_iterations', 'fitness']).append(dataset.columns[:-1])
results = pd.DataFrame(columns=results_columns)
# running PSO for max_trials times combining parameters
for swarm_size, max_iterations in itertools.product(*[swarm_sizes, max_iterations]):
    print("swarm_size = " + str(swarm_size) + ", max_iterations = " + str(max_iterations))
    # execute PSO for the selected parameters
    for trial in range(1, max_trials + 1):
        # execute PSO
        xopt, fopt = pso(f, lb, ub, swarmsize=swarm_size, maxiter=max_iterations)
        # booleanize vector
        xopt = list(map(is_selected, xopt))
        # print info
        print("Trial " + str(trial) + ": " + str(fopt) + ", selected " + str(sum(xopt)))
        # append result
        results = results.append(pd.Series([swarm_size, max_iterations, fopt] + xopt, index=results_columns), ignore_index=True)
results

swarm_size = 25, max_iterations = 50
Stopping search: maximum iterations reached --> 50
Trial 1: 0.8979742723107548, selected 27
Stopping search: maximum iterations reached --> 50
Trial 2: 0.9059822776566215, selected 26
Stopping search: maximum iterations reached --> 50
Trial 3: 0.9090162911859991, selected 24
Stopping search: maximum iterations reached --> 50
Trial 4: 0.9026336031366436, selected 23
Stopping search: maximum iterations reached --> 50
Trial 5: 0.9036737179769144, selected 24
Stopping search: maximum iterations reached --> 50
Trial 6: 0.913627273752634, selected 19
Stopping search: maximum iterations reached --> 50
Trial 7: 0.8997928685414588, selected 21
Stopping search: maximum iterations reached --> 50
Trial 8: 0.8998176409510279, selected 25
Stopping search: maximum iterations reached --> 50
Trial 9: 0.914120647250839, selected 27
Stopping search: maximum iterations reached --> 50
Trial 10: 0.9047620398913903, selected 26
Stopping search: maximum iterations reached 

Stopping search: maximum iterations reached --> 200
Trial 29: 0.9055072164200074, selected 25
Stopping search: maximum iterations reached --> 200
Trial 30: 0.9162644863698969, selected 22
swarm_size = 50, max_iterations = 50
Stopping search: maximum iterations reached --> 50
Trial 1: 0.8985518499458228, selected 20
Stopping search: maximum iterations reached --> 50
Trial 2: 0.9072121211536335, selected 25
Stopping search: maximum iterations reached --> 50
Trial 3: 0.9029287025768463, selected 27
Stopping search: maximum iterations reached --> 50
Trial 4: 0.9095715875474304, selected 24
Stopping search: maximum iterations reached --> 50
Trial 5: 0.9086918169925441, selected 21
Stopping search: maximum iterations reached --> 50
Trial 6: 0.9121010505922117, selected 23
Stopping search: maximum iterations reached --> 50
Trial 7: 0.9090619570425632, selected 28
Stopping search: maximum iterations reached --> 50
Trial 8: 0.8887725105133109, selected 18
Stopping search: maximum iterations rea

Stopping search: maximum iterations reached --> 200
Trial 26: 0.9090132218904927, selected 26
Stopping search: maximum iterations reached --> 200
Trial 27: 0.907980884311404, selected 20
Stopping search: maximum iterations reached --> 200
Trial 28: 0.9146821235395435, selected 29
Stopping search: maximum iterations reached --> 200
Trial 29: 0.9039655274812013, selected 25
Stopping search: maximum iterations reached --> 200
Trial 30: 0.9129868127703151, selected 24
swarm_size = 100, max_iterations = 50
Stopping search: maximum iterations reached --> 50
Trial 1: 0.8992797588523199, selected 26
Stopping search: maximum iterations reached --> 50
Trial 2: 0.8974737890151389, selected 27
Stopping search: maximum iterations reached --> 50
Trial 3: 0.8892287854787637, selected 19
Stopping search: maximum iterations reached --> 50
Trial 4: 0.8987902469346695, selected 24
Stopping search: maximum iterations reached --> 50
Trial 5: 0.8866802740924468, selected 17
Stopping search: maximum iteratio

Stopping search: maximum iterations reached --> 200
Trial 23: 0.8802710189289364, selected 18
Stopping search: maximum iterations reached --> 200
Trial 24: 0.8892933008154154, selected 17
Stopping search: maximum iterations reached --> 200
Trial 25: 0.8921395422828429, selected 21
Stopping search: maximum iterations reached --> 200
Trial 26: 0.8970149963749383, selected 23
Stopping search: maximum iterations reached --> 200
Trial 27: 0.8960952014864028, selected 22
Stopping search: maximum iterations reached --> 200
Trial 28: 0.8823702922869051, selected 22
Stopping search: maximum iterations reached --> 200
Trial 29: 0.9098980098643604, selected 25
Stopping search: maximum iterations reached --> 200
Trial 30: 0.8879672952812039, selected 22


Unnamed: 0,swarm_size,max_iterations,fitness,elevation,aspect,slope,horiz_dist_hydro,vert_dist_hydro,horiz_dist_road,hillshade_9,...,soil_type_30,soil_type_31,soil_type_32,soil_type_33,soil_type_34,soil_type_35,soil_type_36,soil_type_37,soil_type_38,soil_type_39
0,25,50,8.979743e-01,False,True,False,True,True,True,True,...,True,True,True,False,False,True,False,True,True,True
1,25,50,9.059823e-01,False,True,False,True,True,True,False,...,True,True,True,False,False,True,False,True,True,False
2,25,50,9.090163e-01,True,False,False,False,True,True,True,...,True,True,True,False,False,False,False,False,True,True
3,25,50,9.026336e-01,False,False,False,True,True,True,False,...,False,False,False,False,True,False,False,True,True,True
4,25,50,9.036737e-01,True,False,False,False,False,True,True,...,True,False,True,False,False,False,False,True,True,True
5,25,50,9.136273e-01,False,True,False,True,False,True,False,...,False,True,True,True,True,False,False,True,True,False
6,25,50,8.997929e-01,False,True,False,False,True,True,False,...,True,True,True,False,True,True,False,True,False,True
7,25,50,8.998176e-01,True,False,False,False,True,True,False,...,False,True,True,False,True,False,False,True,True,True
8,25,50,9.141206e-01,True,False,True,False,False,False,False,...,False,False,True,True,True,True,True,True,True,True
9,25,50,9.047620e-01,True,False,False,True,True,False,False,...,False,True,True,False,False,True,False,True,True,True


Now, check the solution with the least fitness values:

In [6]:
# take a solution with minimum fitness
selected = results.loc[results['fitness'] == results['fitness'].min()]
selected

Unnamed: 0,swarm_size,max_iterations,fitness,elevation,aspect,slope,horiz_dist_hydro,vert_dist_hydro,horiz_dist_road,hillshade_9,...,soil_type_30,soil_type_31,soil_type_32,soil_type_33,soil_type_34,soil_type_35,soil_type_36,soil_type_37,soil_type_38,soil_type_39
98,50,50,0.866096,False,False,True,False,False,False,False,...,False,True,False,True,True,False,False,True,True,True


Finally, store the results in a CSV for further usage:

In [7]:
# write the results in a csv
results.to_csv('results/pso_selected_attributes.csv', index=False)