 # Boruta
 
- ML task에서 적절한 feature selection은 매우 중요한 문제이다. feature 수가 너무 많을 경우 overfitting이 발생하게 되고 실제 예측해야 할 데이터에 대한 성능은 떨어질 수 밖에 없다.
- 기존의 통계적 방법인 step 기법들은 계산량, 시간 등을 고려했을 때 큰 데이터셋에는 적절하지 않을 수 있다.
- 또한 해당 도메인에 대한 이해가 수반되지 않을 경우 섣불리 feature selection을 하기에는 무리가 있다.
- Boruta는 R 패키지 기반의 알고리즘으로 부스팅, 트리 계열 등 자체적인 feature importance를 제공하는 모델에 사용될 수 있다.

***

## 작동 원리

- 기존의 column들을 복사한 새로운 feature(=shadow features)를 만든다.
- 복사된 feature들은 기존 feature들의 값과 똑같기 때문에 랜덤하게 섞는다.(permute)
- 모든 feature들을 concat한 후 모델(feature importance 제공) 학습을 한다.
- 학습한 모델의 feature importance를 기반으로 shadow feature의 가장 큰 importance를 임계치로 설정하고 이보다 작은 importance를 가진 feature를 뺀다.
- 앞선 과정들을 반복한다.
***
## 예제

In [80]:
import pandas as pd
from boruta import BorutaPy
from sklearn.datasets import load_boston, load_breast_cancer
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

### 1) Regression

In [7]:
boston = load_boston()

In [12]:
data = pd.DataFrame(boston.data, columns = boston.feature_names)

In [14]:
data['target'] = boston.target

In [16]:
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [48]:
X = data.drop('target', axis = 1).values
y = data.target.values

In [66]:
rf = RandomForestRegressor(max_depth = 4, n_estimators = 150)

In [67]:
boruta_selector = BorutaPy(rf, n_estimators = 'auto', max_iter = 20, verbose = 1)

In [68]:
boruta_selector.fit(X, y)

Iteration: 1 / 20
Iteration: 2 / 20
Iteration: 3 / 20
Iteration: 4 / 20
Iteration: 5 / 20
Iteration: 6 / 20
Iteration: 7 / 20
Iteration: 8 / 20
Iteration: 9 / 20
Iteration: 10 / 20
Iteration: 11 / 20
Iteration: 12 / 20
Iteration: 13 / 20
Iteration: 14 / 20
Iteration: 15 / 20
Iteration: 16 / 20
Iteration: 17 / 20
Iteration: 18 / 20
Iteration: 19 / 20


BorutaPy finished running.

Iteration: 	20 / 20
Confirmed: 	8
Tentative: 	1
Rejected: 	3


BorutaPy(estimator=RandomForestRegressor(max_depth=4, n_estimators=111,
                                         random_state=RandomState(MT19937) at 0x7F873108D140),
         max_iter=20, n_estimators='auto',
         random_state=RandomState(MT19937) at 0x7F873108D140, verbose=1)

In [69]:
boruta_selector.support_

array([ True, False, False, False,  True,  True,  True,  True, False,
        True,  True, False,  True])

support_ 메서드는 전체 feature 중 선택된 feature들을 boolean 형태로 반환한다.

In [86]:
use_cols = data.columns[:-1][boruta_selector.support_].tolist()

In [76]:
boruta_selector.n_features_

8

n_features에는 최종 선택된 feature의 개수가 저장되어 있다.

In [87]:
use_cols

['CRIM', 'NOX', 'RM', 'AGE', 'DIS', 'TAX', 'PTRATIO', 'LSTAT']

In [77]:
boruta_selector.ranking_

array([1, 6, 3, 5, 1, 1, 1, 1, 4, 1, 1, 2, 1])

ranking_ 메서드는 각 feature들의 랭킹을 반환하는데 1은 선택됨을 의미하고 2는 중간 3 이상은 기각된(사용되지 않을) feature로 보면 된다.

In [81]:
rf_cv = cross_val_score(rf, X, y, scoring = 'neg_mean_squared_error', cv = KFold(n_splits = 5))

In [83]:
abs(rf_cv).mean()

23.76584917569792

모든 feature를 사용한 RandomForest 모델의 5Fold 평균 MSE는 23.77이다.

In [92]:
boruta_X = data[use_cols]

In [93]:
boruta_cv = cross_val_score(rf, boruta_X, y, scoring = 'neg_mean_squared_error', cv = KFold(n_splits = 5))

In [94]:
abs(boruta_cv).mean()

23.058991700529923

Boruta로 추출한 feature만을 사용한 RandomForest 모델의 5Fold 평균 MSE는 23.06으로 성능이 더 우수한 것을 알 수 있다.
***
### 2) Classification

In [95]:
breast_cancer = load_breast_cancer()

In [96]:
data = pd.DataFrame(breast_cancer.data, columns = breast_cancer.feature_names)

In [97]:
data['target'] = breast_cancer.target

In [98]:
data.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [100]:
X = data.drop('target', axis = 1).values
y = data.target.values

In [109]:
rf = RandomForestClassifier(max_depth = 4, n_estimators = 200)

In [112]:
boruta_selector = BorutaPy(rf, n_estimators = 'auto', max_iter = 100)

In [113]:
boruta_selector.fit(X, y)

BorutaPy(estimator=RandomForestClassifier(max_depth=4, n_estimators=193,
                                          random_state=RandomState(MT19937) at 0x7F873108D140),
         n_estimators='auto',
         random_state=RandomState(MT19937) at 0x7F873108D140)

In [114]:
boruta_selector.support_

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True, False,  True,  True, False,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True])

In [115]:
use_cols = data.columns[:-1][boruta_selector.support_].tolist()

In [116]:
boruta_selector.n_features_

27

In [117]:
use_cols

['mean radius',
 'mean texture',
 'mean perimeter',
 'mean area',
 'mean smoothness',
 'mean compactness',
 'mean concavity',
 'mean concave points',
 'mean symmetry',
 'mean fractal dimension',
 'radius error',
 'perimeter error',
 'area error',
 'compactness error',
 'concavity error',
 'concave points error',
 'fractal dimension error',
 'worst radius',
 'worst texture',
 'worst perimeter',
 'worst area',
 'worst smoothness',
 'worst compactness',
 'worst concavity',
 'worst concave points',
 'worst symmetry',
 'worst fractal dimension']

In [118]:
boruta_selector.ranking_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 3, 1, 1, 1, 2, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1])

In [129]:
rf_cv = cross_val_score(rf, X, y, scoring = 'roc_auc', cv = StratifiedKFold(n_splits = 5))

In [130]:
rf_cv.mean()

0.9907337435465507

In [121]:
boruta_X = data[use_cols]

In [127]:
boruta_cv = cross_val_score(rf, boruta_X, y, scoring = 'roc_auc', cv = StratifiedKFold(n_splits = 5))

In [128]:
boruta_cv.mean()

0.9895498312874663