### Using Boruta on the Madalon Data Set
Modify by: [Masaki Aota](aotamasakimail@gmail.com)<br>
[Original notebook](https://github.com/scikit-learn-contrib/boruta_py/blob/master/boruta/examples/Madalon_Data_Set.ipynb) Author: [Mike Bernico](mike.bernico@gmail.com)

This example demonstrates using Boruta to find all relevant features in the Madalon dataset, which is an artificial dataset used in NIPS2003 and cited in the [Boruta paper](https://www.jstatsoft.org/article/view/v036i11/v36i11.pdf)


This dataset has 2000 observations and 500 features.  We will use Boruta to identify the features that are relevant to the classification task.

[Madelon data set](https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/Dataset.pdf)(P24)によると, 20個の重要な特徴量が存在する。

In [1]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

from boruta_py import BorutaPy
from multiprocessing import cpu_count

In [2]:
def load_data():
    # URLS for dataset via UCI
    data_url='https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_train.data'
    label_url='https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_train.labels'
        
    
    X_data = pd.read_csv(data_url, sep=" ", header=None)
    y_data = pd.read_csv(label_url, sep=" ", header=None)
    data = X_data.iloc[:,0:500]
    data['target'] = y_data[0] 
    return data

In [3]:
data = load_data()
data.shape

(2000, 501)

In [4]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,491,492,493,494,495,496,497,498,499,target
0,485,477,537,479,452,471,491,476,475,473,...,481,477,485,511,485,481,479,475,496,-1
1,483,458,460,487,587,475,526,479,485,469,...,478,487,338,513,486,483,492,510,517,-1
2,487,542,499,468,448,471,442,478,480,477,...,481,492,650,506,501,480,489,499,498,-1
3,480,491,510,485,495,472,417,474,502,476,...,480,474,572,454,469,475,482,494,461,1
4,484,502,528,489,466,481,402,478,487,468,...,479,452,435,486,508,481,504,495,511,1


In [5]:
data['target'].value_counts()

 1    1000
-1    1000
Name: target, dtype: int64

#### train, test split

In [6]:
y=data['target']
X=data.drop(columns='target')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

#### そのままでランダムフォレストを訓練し判別してみる

In [7]:
rf=RandomForestClassifier(
    n_estimators=500,
    random_state=42,
    n_jobs=int(cpu_count()/2)
)

In [8]:
rf.fit(X_train.values, y_train.values)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=12,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [9]:
print(rf.classes_)
print(confusion_matrix(y_test.values,rf.predict(X_test.values),labels=rf.classes_))

[-1  1]
[[179  70]
 [ 96 155]]


#### Borutaで変数選択したあとで判別

Boruta conforms to the sklearn api and can be used in a Pipeline as well as on it's own. Here we will demonstrate stand alone operation.

First we will instantiate an estimator that Boruta will use.  Then we will instantiate a Boruta Object.

In [10]:
rf = RandomForestClassifier(n_jobs=int(cpu_count()/2), max_depth=7)
# define Boruta feature selection method
feat_selector = BorutaPy(rf, n_estimators='auto', two_step=False,verbose=2, random_state=42)
# two_stepがない方、つまりBonferroniを用いたほうがうまくいく

Once built, we can use this object to identify the relevant features in our dataset.

In [11]:
feat_selector.fit(X_train.values,y_train.values)

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	500
Rejected: 	0

Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	500
Rejected: 	0

Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	500
Rejected: 	0

Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	500
Rejected: 	0

Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	500
Rejected: 	0

Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	500
Rejected: 	0

Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	500
Rejected: 	0

Iteration: 	8 / 100
Confirmed: 	0
Tentative: 	500
Rejected: 	0

Iteration: 	9 / 100
Confirmed: 	0
Tentative: 	500
Rejected: 	0

Iteration: 	10 / 100
Confirmed: 	0
Tentative: 	500
Rejected: 	0

Iteration: 	11 / 100
Confirmed: 	0
Tentative: 	500
Rejected: 	0

Iteration: 	12 / 100
Confirmed: 	0
Tentative: 	500
Rejected: 	0

Iteration: 	13 / 100
Confirmed: 	0
Tentative: 	500
Rejected: 	0

Iteration: 	14 / 100
Confirmed: 	18
Tentative: 	8
Rejected: 	474

Iteration: 	15 / 100
Confirmed: 	18
Tentative: 	8
Rejected: 	474

Iteration: 	16 / 100
Confirmed: 

BorutaPy(alpha=0.05,
     estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=7, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=90, n_jobs=12,
            oob_score=False,
            random_state=<mtrand.RandomState object at 0x7f1bbe335f30>,
            verbose=0, warm_start=False),
     max_iter=100, n_estimators='auto', perc=100,
     random_state=<mtrand.RandomState object at 0x7f1bbe335f30>,
     two_step=False, verbose=2)

19個の特徴量を選ぶことに成功した

Boruta has confirmed only a few features as useful.   When our run ended, Boruta was undecided on 2 features.   '

We can interrogate .support_ to understand which features were selected.   .support_ returns an array of booleans that we can use to slice our feature matrix to include only relevant columns.   Of course, .transform can also be used, as expected in the scikit API.

In [12]:
# check selected features
print(feat_selector.support_)

[False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False  True False False False False False False False
 False False False False False False False False False False False False
  True False False False False False False False False False False False
 False False False False  True False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False  True False False
 False False False False False False False False False False False False
 False False False False False False False False  True False False False
 False False False False False False False False False False False False
 False False False False False False False False False  True False False
 False False False False False False False False Fa

In [13]:
#select the chosen features from our dataframe.
X_train_selected = X_train.iloc[:,feat_selector.support_]
X_test_selected = X_test.iloc[:,feat_selector.support_]
X_test_selected.head()

Unnamed: 0,28,48,64,105,128,153,241,281,318,336,338,378,433,442,453,472,475,493
1860,498,513,531,560,483,566,526,517,533,522,397,539,539,327,424,399,530,426
353,474,513,405,590,485,370,533,412,467,378,532,564,396,550,423,511,540,435
1333,492,542,586,572,486,664,486,561,517,577,384,582,622,358,571,422,469,594
905,489,421,412,605,487,659,518,557,514,378,334,414,616,429,597,452,521,617
1289,491,479,557,530,483,569,426,501,523,571,508,487,530,524,620,494,374,643


In [14]:
rf=RandomForestClassifier(
    n_estimators=500,
    random_state=42,
    n_jobs=int(cpu_count()/2)
)
rf.fit(X_train_selected.values, y_train.values)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=12,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [15]:
print(rf.classes_)
print(confusion_matrix(y_test.values,rf.predict(X_test_selected.values),labels=rf.classes_))

[-1  1]
[[219  30]
 [ 29 222]]
