### Iterative Feature Selection
* 반복적으로 feature의 수를 조절 -> 최적의 Feature 선택
* 1개에서 계속 갯수를 늘리거나 n개에서 계속 줄여가면서 가장 성능이 좋은 feature 선택

### RFE : Feature를 삭제할때 많이 사용
* Tree, LR
* n_feature_to_select : 몇 개의 feature 선택 할것인지
* step : 몇번 반복 할것이냐

### Feature 선택시 주의사항
* prediction time 에도 쓸 수 있는 feature 인가?
* 실시간 예측이 필요할 때, 생성이 너무 고비용이 아닌가?
* scale은 일정한가? 또는 비율적으로 표현 가능한가?
* 새롭게 등장하는 category data는?
* 너무 극단적인 분포 -> threshold 기반으로 binarization

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import hourse_price_preprocessor as hpp
import os
import numpy as np
import pandas as pd

In [3]:
DATA_DIR = "house_price"
TEST_FILENAME = "test.csv"
TRAIN_FILENAME = "train.csv"

In [4]:
test_file = os.path.join(DATA_DIR, TEST_FILENAME)
train_file = os.path.join(DATA_DIR, TRAIN_FILENAME)

In [5]:
X_train, X_test, y_train, test_id_idx = hpp.get_train_test_split_dataset(train_file, test_file)

In [6]:
X_train.shape, y_train.shape, X_test.shape, test_id_idx.shape

((1460, 67), (1460,), (1459, 67), (1459,))

In [18]:
from sklearn.feature_selection import RFE
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

In [19]:
y_train

array([208500, 181500, 223500, ..., 266500, 142125, 147500], dtype=int64)

In [26]:
select = RFE(RandomForestRegressor(n_estimators=100))
select.fit(X_train, y_train)

# transform training set
X_train_selected = select.transform(X_train)

In [27]:
X_train_selected.shape

(1460, 33)

In [28]:
np.mean(cross_val_score(LinearRegression(), X_train_selected, y_train, scoring="r2"))

0.7826559065361373

In [29]:
np.mean(cross_val_score(LinearRegression(), X_train, y_train, scoring="r2"))

-9.320642852685059e+20

In [18]:
select.get_support()


array([False,  True, False, False, False,  True, False, False, False,
       False, False, False, False, False, False,  True,  True,  True,
        True, False, False,  True,  True,  True,  True,  True,  True,
        True, False, False, False, False, False, False, False,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
       False,  True, False, False,  True,  True, False, False,  True,
        True,  True,  True,  True,  True,  True, False, False, False,
       False,  True, False, False])

In [19]:
select.pvalues_

array([5.57502768e-001, 5.02124571e-015, 9.99995404e-001, 9.61452045e-002,
       9.78688681e-001, 1.56588106e-018, 8.53544002e-001, 6.08632426e-004,
       1.00000000e+000, 1.00000000e+000, 9.40689416e-001, 4.47384728e-003,
       7.28714288e-001, 9.99971838e-001, 1.00000000e+000, 0.00000000e+000,
       2.74137086e-080, 2.16603844e-032, 9.80605930e-008, 6.03435379e-005,
       5.21139084e-001, 3.28224819e-014, 3.66786948e-057, 3.40254555e-116,
       6.58540871e-136, 0.00000000e+000, 1.56650821e-015, 1.03867093e-011,
       7.86193808e-001, 1.48016042e-002, 9.99660590e-001, 9.72965311e-001,
       9.99999809e-001, 1.62972263e-002, 5.06478431e-003, 9.89489137e-057,
       1.29089040e-029, 2.28852795e-016, 1.17708063e-020, 5.80269212e-013,
       5.31304381e-001, 2.19085586e-007, 9.28134677e-030, 2.04558302e-030,
       3.34282253e-012, 9.82841633e-001, 1.01652264e-060, 4.32652907e-002,
       9.86567394e-001, 5.30633056e-040, 1.41075810e-006, 2.14400746e-003,
       9.62826203e-001, 1