There are many different method's to select the important features from a dataset. In this notebook I will show a quick way to select important features with the use of Boruta.

Boruta tries to find all relevant features that carry information to make an accurate classification. You can read more about Boruta [here](http://danielhomola.com/2015/05/08/borutapy-an-all-relevant-feature-selection-method/)

Let's start by doing all necessary imports.

In [1]:
import numpy as np 
import pandas as pd 
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy

Next we load only the 'application_train' data as this is to demonstrate Boruta only. 

In [2]:
train = pd.read_csv("../input/application_train.csv")
train.shape

(307511, 122)

All categorical values will be one-hot encoded.

In [3]:
train = pd.get_dummies(train, drop_first=True, dummy_na=True)
train.shape

(307511, 246)

Get all feature names from the dataset

In [4]:
features = [f for f in train.columns if f not in ['TARGET','SK_ID_CURR']]
len(features)

244

Replace all missing values with the Mean.

In [5]:
train[features] = train[features].fillna(train[features].mean()).clip(-1e9,1e9)

Get the final dataset *X* and labels *Y*

In [6]:
X = train[features].values
Y = train['TARGET'].values.ravel()

Next we setup the *RandomForrestClassifier* as the estimator to use for Boruta. The *max_depth* of the tree is advised on the Boruta Github page to be between 3 to 7.

In [7]:
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)

Next we setup Boruta. It uses the *scikit-learn* interface as much as possible so we can use *fit(X, y), transform(X), fit_transform(X, y)*. I'll let it run for a maximum of *max_iter = 50* iterations. With *perc = 90* a threshold is specified. The lower the threshold the more features will be selected. I usually use a percentage between 80 and 90. 

In [8]:
boruta_feature_selector = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=4242, max_iter = 50, perc = 90)
boruta_feature_selector.fit(X, Y)

Iteration: 	1 / 50
Confirmed: 	0
Tentative: 	244
Rejected: 	0
Iteration: 	2 / 50
Confirmed: 	0
Tentative: 	244
Rejected: 	0
Iteration: 	3 / 50
Confirmed: 	0
Tentative: 	244
Rejected: 	0
Iteration: 	4 / 50
Confirmed: 	0
Tentative: 	244
Rejected: 	0
Iteration: 	5 / 50
Confirmed: 	0
Tentative: 	244
Rejected: 	0
Iteration: 	6 / 50
Confirmed: 	0
Tentative: 	244
Rejected: 	0
Iteration: 	7 / 50
Confirmed: 	0
Tentative: 	244
Rejected: 	0
Iteration: 	8 / 50
Confirmed: 	97
Tentative: 	16
Rejected: 	131
Iteration: 	9 / 50
Confirmed: 	97
Tentative: 	16
Rejected: 	131
Iteration: 	10 / 50
Confirmed: 	97
Tentative: 	16
Rejected: 	131
Iteration: 	11 / 50
Confirmed: 	97
Tentative: 	16
Rejected: 	131
Iteration: 	12 / 50
Confirmed: 	98
Tentative: 	15
Rejected: 	131
Iteration: 	13 / 50
Confirmed: 	98
Tentative: 	15
Rejected: 	131
Iteration: 	14 / 50
Confirmed: 	98
Tentative: 	15
Rejected: 	131
Iteration: 	15 / 50
Confirmed: 	98
Tentative: 	15
Rejected: 	131
Iteration: 	16 / 50
Confirmed: 	98
Tentative: 	1

BorutaPy(alpha=0.05,
         estimator=RandomForestClassifier(bootstrap=True,
                                          class_weight='balanced',
                                          criterion='gini', max_depth=5,
                                          max_features='auto',
                                          max_leaf_nodes=None,
                                          min_impurity_decrease=0.0,
                                          min_impurity_split=None,
                                          min_samples_leaf=1,
                                          min_samples_split=2,
                                          min_weight_fraction_leaf=0.0,
                                          n_estimators=288, n_jobs=-1,
                                          oob_score=False,
                                          random_state=<mtrand.RandomState object at 0x7fe1a8bf9090>,
                                          verbose=0, warm_start=False),
         max_iter=

After Boruta has run we can transform our dataset.

In [9]:
X_filtered = boruta_feature_selector.transform(X)
X_filtered.shape

(307511, 99)

And we create a list of the feature names if we would like to use them at a later stage.

In [10]:
final_features = list()
indexes = np.where(boruta_feature_selector.support_ == True)
for x in np.nditer(indexes):
    final_features.append(features[x])
print(final_features)

['CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MED

So I hope you enjoyed my very first Kaggle Kernel :-)
Let me know if you have any feedback or suggestions.