# Feature Selection with R
Jaganadh Gopinadhan
http://jaganadhg.in

Feture selection is one of the important tasks in Machine Learning and Data Science. This notebook is a continuation of my notes on Feature Selection with sklearn. In this notebook we will discuss various feature selection utiliies available in R and how to use them with examples.

### Boruta Algorithm and Package 

One of the most widely used package for feature selection task in R in Boruta. This package wraps the randomForest package in R. A detailed note on the package and algorithm is available in the paper "Feature Selection with the Boruta Package" [1]. I am not ging to discuss the same here, but we will discuss the usage here.

We will use the Boston house price data here. Before starting the excercise make sure that the required libraries are installes. To access the data we need the 'MASS' package. Install the package by 'install.packages('MASS')'. The next package wthich we require is 'Boruta', install it by 'install.packages('Boruta', dependencies=c('Depends','Suggests')). 

First let's load the data and examine the data.

In [26]:
library(MASS)
data(Boston)
head(Boston)


Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
1,0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
2,0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
3,0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
4,0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
5,0.06905,0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
6,0.02985,0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7


The data contains 13 attributes and 'medv' is the target variable. Now let's try to find the feaure importance.

In [2]:
library(MASS)
library(Boruta)

data(Boston)


boruta_feat_imp <- function(data,formula){
    #Compute feature importance with Boruta algorithm
    #:param data: dataframe containing the data
    #:param formula: formula for randomForest 
    #:returns imp_feats: returns report on feature importance
    imp_feats <- Boruta(formula,data=data, doTrace = 2, ntree = 500)
    return(imp_feats)
} 

feats <- boruta_feat_imp(Boston,medv ~ .)
feats

 1. run of importance source...
 2. run of importance source...
 3. run of importance source...
 4. run of importance source...
 5. run of importance source...
 6. run of importance source...
 7. run of importance source...
 8. run of importance source...
 9. run of importance source...
 10. run of importance source...
 11. run of importance source...
Confirmed 13 attributes: age, black, chas, crim, dis and 8 more.


Boruta performed 11 iterations in 24.11364 secs.
 13 attributes confirmed important: age, black, chas, crim, dis and 8
more.
 No attributes deemed unimportant.

### What Just Happened ?

We have loaded the Boston data set first. Then we defined a generic function which will accept a data-set and a forula as argments. The function will pass the data and forumla to Boruta algo, which eventually invokes the randomForest package. Once the comptuig is over it will return the feature importance report. In the Boston case the algorithm foud all the fatures are important :-) . Now it is time for checking the same with some other data.Try the 'HouseVotes84' data from 'mlbench'package.

### Feature Selection with randomForest

Now let's see how we can use the randomForest package to compute the feature importance.



In [9]:
library(MASS)
library(randomForest)

data(Boston)


rf_feat_imp <- function(data,formula){
    #Compute feature importance with randomForest algorithm
    #:param data: dataframe containing the data
    #:param formula: formula for randomForest 
    #:returns imp_feats: returns report on feature importance
    imp_feats <- randomForest(formula,data=data, mtry=2, ntree = 500,importance=TRUE)
    imp_feats_res <- importance(imp_feats,type=1)
    return(imp_feats_res)
} 

feats <- rf_feat_imp(Boston,medv ~ .)
feats

Unnamed: 0,%IncMSE
crim,18.09281
zn,6.570366
indus,13.27625
chas,5.204331
nox,17.64775
rm,32.02581
age,13.74005
dis,17.3245
rad,9.837306
tax,13.75862


### What Just Happened ?
Similer to the previous example we have created a generic function to compute the feature importance. The results will be a dataframe with feature name and percentage of MSE (in regression example). If we cahnge type=2 importance function it will give the node impurity value.

### It is party time : Feature importance with 'party' package

The next package we are exaploring is 'party'

In [None]:
library(party)
library(MASS)

data(Boston)


party_feat_imp <- function(data,formula){
    #Compute feature importance with party package
    #:param data: dataframe containing the data
    #:param formula: formula for randomForest 
    #:returns imp_feats: returns report on feature importance
    imp_feats <- cforest(formula,data=data, control=cforest_unbiased(mtry=2,ntree=50))
    imp_feats_res <- varimp(imp_feats,conditional=TRUE)
    return(imp_feats_res)
} 

feats <- party_feat_imp(Boston,medv ~ .)
print(feats)

Loading required package: grid
Loading required package: mvtnorm
Loading required package: modeltools
Loading required package: stats4
Loading required package: strucchange
Loading required package: zoo

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric

Loading required package: sandwich


## References 
[1] Witold R. Rudnicki and Miron B. Kursa, Feature Selection with the Boruta Package, Journal of Statistical Software, September 2010, Volume 36, Issue 11. http://www.jstatsoft.org/v36/i11/paper 

