# Feature Selection using Particle Swarm Optimization (PSO)

The search space for this problem corresponds to the power set of the set of the 54 attributes, which means
$2^{54}$ possibilities. Denote such space by $\mathcal{S}$. A local search algorithm can seek the desired subset of attributes without having
to pass through each of the elements in this colossus space.

Before submitting this problem to the PSO algorithm, it is necessary to define the way solutions (subsets) will
be represented and the function to rank the solutions, also known as the fitness function.

## Solutions representation

For this problem, a solution $i$ will be represented by a vector $x^i \in [0,1]^{54}$, in which each component $j$, $x^i_j$, means the probability of selecting the corresponding attribute $j$. Denote $[0,1]^{53}$ by $\mathcal{S_2}$. A threshold $0 \leq \theta \leq 1$ defines 
if the attribute was selected or not: $x^i_j \geq \theta$ means that the attribute $j$ is selected in the solution $i$,
as implemented by the simple function below:

In [1]:
def is_selected(xij, theta):
    '''
    Determines if attribute j in solution i was selected
    based on the value on position j of vector x^i.
    '''
    return xij >= theta

## Fitness function

This function will rank each solution according to the objective of selecting the most appropriate attribute for this classification problem. Since the purpose is to determine the class of an instance based on the attribute values, the similarity measure of correlation will be applied here. Correlation between two random variables $X,Y$ is given by the
equation:

$$
corr(X,Y) = \frac{cov(X,Y)}{\sigma_X\sigma_Y}
$$

where $cov(X,Y)$ is the covariance between X and Y. Also, an importance will be given to larger sets, in order to avoid losing solutions with potentially more information. In this way, the fitness function is defined as a weighted sum of the minimum
of the correlations between each selected atribute and the class with the number of selected attributes. In other words, the fitness function $f:\mathcal{S_2}\to\mathbb{R}$ is given by the equation:

$$
f(x^i) = 0.8 \times \min_{x^i_j = 1} |corr(x^i_j,cover\_type)| + 0.2 \times \eta_{[-1,1]}(\sum_{x^i_j=1} 1)
$$

where $\eta_{[-1,1]}$ normalizes its argument in the interval $[-1,1]$.

## Implementation

First of all, it is necessary to load the dataset:

In [2]:
import pandas as pd
# read the dataset
dataset = pd.read_csv("datasets/new_dataset_covertype.csv")
dataset.head()

Unnamed: 0,elevation,aspect,slope,horiz_dist_hydro,vert_dist_hydro,horiz_dist_road,hillshade_9,hill_shade_noon,hill_shade_15,horiz_dist_fire,...,soil_type_31,soil_type_32,soil_type_33,soil_type_34,soil_type_35,soil_type_36,soil_type_37,soil_type_38,soil_type_39,cover_type
0,3254,75,7,365,49,3034,228,228,133,4708,...,0,0,0,0,0,0,0,0,1,1
1,3149,341,16,216,30,3241,186,215,167,3085,...,0,0,0,0,0,0,0,0,0,1
2,2972,321,10,150,13,4796,194,230,176,4607,...,0,0,0,0,0,0,0,0,0,1
3,3097,265,21,430,60,3290,162,244,218,1503,...,0,0,0,0,0,0,0,0,0,1
4,3321,286,7,660,118,797,201,240,179,968,...,1,0,0,0,0,0,0,0,0,1


Correlations with respect to the class are fixed values, and can be calculated using the following:

In [3]:
# class correlations
class_correlations = dataset.corr(method="pearson")['cover_type']

Now, define the fitness function by:

In [4]:
import numpy as np

def f(x, theta=0.8, weights = np.array([0.8,0.2])):
    '''
    Takes a vector in [0,1]^53 and compute its fitness value.
    '''
    
    def _is_selected(x):
        '''
        Defines if an attribute was selected given an outside theta.
        '''
        return is_selected(x, theta)
    
    selected_attrs = list(map(_is_selected, x))
    min_corr = min([abs(class_correlations[i]) for i in np.arange(0,dataset.shape[1]-1) if selected_attrs[i]])
    count_attrs = (sum(selected_attrs) - (-1))/(1 - (-1))
    return np.dot(weights, np.array([min_corr, count_attrs]))

The PSO implementation used here comes from the `pyswarm` library:

In [6]:
from pyswarm import pso

lb = np.zeros(dataset.shape[1]-1)
ub = np.ones(dataset.shape[1]-1)

xopt, fopt = pso(f, lb, ub)

ValueError: min() arg is an empty sequence

Now, check which attributes were selected:

In [None]:
def _is_selected(x):
    return is_selected(x,theta=0.8)

print("Fitness: " + str(fopt))

selected = dataset.columns[list(map(_is_selected, xopt))]
selected.values