### Wrapper Method.
Wrapper methods use combinations of variables to determine predictive power. 
Common wrapper methods include: 

Subset Selection 

Forward Stepwise and 

Backward Stepwise(RFE). 

The wrapper method will find the best combination of variables. 
The wrapper method actually tests each feature against test models that it builds with them to evaluate the results. 
Out of all three methods, this is very computationally intensive. 
It is not recommended that this method be used on a high number of features.

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt  
%matplotlib inline

In [43]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.feature_selection import VarianceThreshold, SelectFromModel, RFE
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

In [21]:
from sklearn.datasets import load_wine, load_breast_cancer
from sklearn.preprocessing import StandardScaler

In [4]:
wine =load_wine()

In [5]:
X =pd.DataFrame(data=wine.data, columns=wine.feature_names)
y =wine.target

In [6]:
X.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


In [7]:
X.isnull().sum()

alcohol                         0
malic_acid                      0
ash                             0
alcalinity_of_ash               0
magnesium                       0
total_phenols                   0
flavanoids                      0
nonflavanoid_phenols            0
proanthocyanins                 0
color_intensity                 0
hue                             0
od280/od315_of_diluted_wines    0
proline                         0
dtype: int64

In [8]:
X_train,X_test,y_train,y_test= train_test_split(X, y, test_size=0.2, random_state=0)

### Step forward feature selection

In [9]:
sfs = SFS(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1), k_features=7, forward=True,
    floating=False,
    verbose=2,scoring='accuracy',
    cv=5,
    n_jobs=-1,).fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  13 out of  13 | elapsed:   13.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  13 out of  13 | elapsed:   13.0s finished

[2020-06-23 10:16:45] Features: 1/7 -- score: 0.7605911330049261[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    7.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    7.2s finished

[2020-06-23 10:16:52] Features: 2/7 -- score: 0.9719211822660098[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  11 out of  11 | elapsed:    7.3s finished

[2020-06-23 10:16:59] Features: 3/7 -- score: 0.9859605911330049[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    6.8s finished

[2020-06-23 10:17:06] Features: 4/7 -- score: 0.985960

In [10]:
sfs.k_feature_names_

('alcohol',
 'ash',
 'magnesium',
 'flavanoids',
 'nonflavanoid_phenols',
 'color_intensity',
 'od280/od315_of_diluted_wines')

In [11]:
 sfs.k_feature_idx_, sfs.k_score_

((0, 2, 4, 6, 7, 9, 11), 0.993103448275862)

In [12]:
pd.DataFrame.from_dict(sfs.get_metric_dict()).T

Unnamed: 0,avg_score,ci_bound,cv_scores,feature_idx,feature_names,std_dev,std_err
1,0.760591,0.113387,"[0.6206896551724138, 0.896551724137931, 0.75, ...","(6,)","(flavanoids,)",0.0882193,0.0441096
2,0.971921,0.0333441,"[0.9310344827586207, 1.0, 1.0, 0.9642857142857...","(6, 9)","(flavanoids, color_intensity)",0.0259429,0.0129714
3,0.985961,0.0221059,"[0.9655172413793104, 1.0, 1.0, 0.9642857142857...","(4, 6, 9)","(magnesium, flavanoids, color_intensity)",0.0171991,0.00859955
4,0.985961,0.0221059,"[0.9655172413793104, 1.0, 1.0, 0.9642857142857...","(0, 4, 6, 9)","(alcohol, magnesium, flavanoids, color_intensity)",0.0171991,0.00859955
5,0.993103,0.0177282,"[0.9655172413793104, 1.0, 1.0, 1.0, 1.0]","(0, 4, 6, 9, 11)","(alcohol, magnesium, flavanoids, color_intensi...",0.0137931,0.00689655
6,0.993103,0.0177282,"[0.9655172413793104, 1.0, 1.0, 1.0, 1.0]","(0, 2, 4, 6, 9, 11)","(alcohol, ash, magnesium, flavanoids, color_in...",0.0137931,0.00689655
7,0.993103,0.0177282,"[0.9655172413793104, 1.0, 1.0, 1.0, 1.0]","(0, 2, 4, 6, 7, 9, 11)","(alcohol, ash, magnesium, flavanoids, nonflava...",0.0137931,0.00689655


In [13]:
sfs = SFS(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1), k_features=(1,8), forward=True,
    floating=False,
    verbose=2,scoring='accuracy',
    cv=5,
    n_jobs=-1,).fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  13 out of  13 | elapsed:    8.3s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  13 out of  13 | elapsed:    8.3s finished

[2020-06-23 10:17:30] Features: 1/8 -- score: 0.7605911330049261[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    6.5s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    6.5s finished

[2020-06-23 10:17:37] Features: 2/8 -- score: 0.9719211822660098[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  11 out of  11 | elapsed:    6.3s finished

[2020-06-23 10:17:43] Features: 3/8 -- score: 0.9859605911330049[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    6.1s finished

[2020-06-23 10:17:49] Features: 4/8 -- score: 0.985960

In [14]:
sfs.k_feature_names_, sfs.k_score_

(('alcohol',
  'magnesium',
  'flavanoids',
  'color_intensity',
  'od280/od315_of_diluted_wines'),
 0.993103448275862)

### Step backward feature selection

In [15]:
sbs = SFS(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1), k_features=(1,8), forward=False,
    floating=False,
    verbose=2,scoring='accuracy',
    cv=5,
    n_jobs=-1).fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  13 out of  13 | elapsed:    8.5s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  13 out of  13 | elapsed:    8.5s finished

[2020-06-23 10:18:19] Features: 12/1 -- score: 0.993103448275862[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    6.5s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    6.5s finished

[2020-06-23 10:18:25] Features: 11/1 -- score: 0.993103448275862[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  11 out of  11 | elapsed:    7.0s finished

[2020-06-23 10:18:33] Features: 10/1 -- score: 0.993103448275862[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    6.6s finished

[2020-06-23 10:18:39] Features: 9/1 -- score: 0.993103

In [16]:
sbs.k_feature_names_, sbs.k_score_

(('alcohol',
  'malic_acid',
  'ash',
  'magnesium',
  'total_phenols',
  'proanthocyanins',
  'color_intensity',
  'od280/od315_of_diluted_wines'),
 0.993103448275862)

### Exhaustic Feature selection

In [17]:
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS

In [18]:
efs = EFS(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1), min_features=4, max_features=5, 
          scoring='accuracy',cv=None,n_jobs=-1).fit(X_train, y_train)

Features: 2002/2002

In [19]:
help(efs)

Help on ExhaustiveFeatureSelector in module mlxtend.feature_selection.exhaustive_feature_selector object:

class ExhaustiveFeatureSelector(sklearn.base.BaseEstimator, sklearn.base.MetaEstimatorMixin)
 |  ExhaustiveFeatureSelector(estimator, min_features=1, max_features=1, print_progress=True, scoring='accuracy', cv=5, n_jobs=1, pre_dispatch='2*n_jobs', clone_estimator=True)
 |  
 |  Exhaustive Feature Selection for Classification and Regression.
 |     (new in v0.4.3)
 |  
 |  Parameters
 |  ----------
 |  estimator : scikit-learn classifier or regressor
 |  min_features : int (default: 1)
 |      Minumum number of features to select
 |  max_features : int (default: 1)
 |      Maximum number of features to select
 |  print_progress : bool (default: True)
 |      Prints progress as the number of epochs
 |      to stderr.
 |  scoring : str, (default='accuracy')
 |      Scoring metric in {accuracy, f1, precision, recall, roc_auc}
 |      for classifiers,
 |      {'mean_absolute_error', 'm

In [20]:
efs.best_feature_names_, efs.best_score_

(('alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash'), 1.0)

### Recursive Feature Elimination (RFE) _ Random Forest _ Gradient Boosting Algorithm

RFE is a wrapper-style feature selection algorithm that also uses filter-based feature selection internally.

RFE works by searching for a subset of features by starting with all features in the training dataset 
and successfully removing features until the desired number remains.

In [22]:
data =load_breast_cancer()

In [24]:
data.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [28]:
X = pd.DataFrame(data.data, columns=data.feature_names)

y=data.target

In [29]:
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [30]:
X_train,X_test,y_train,y_test= train_test_split(X, y, test_size=0.2, random_state=0)

In [31]:
X_train.shape, X_test.shape

((455, 30), (114, 30))

### Feature selection by feature importance of RF

In [34]:
sel = SelectFromModel(RandomForestClassifier(n_estimators=100 , n_jobs=-1, random_state=0))
sel.fit(X_train, y_train)
sel.get_support()

array([ True, False,  True,  True, False, False,  True,  True, False,
       False, False, False, False,  True, False, False, False, False,
       False, False,  True, False,  True,  True, False, False, False,
        True, False, False])

In [35]:
features =X_train.columns[sel.get_support()]
features

Index(['mean radius', 'mean perimeter', 'mean area', 'mean concavity',
       'mean concave points', 'area error', 'worst radius', 'worst perimeter',
       'worst area', 'worst concave points'],
      dtype='object')

In [37]:
sel.estimator_.feature_importances_

array([0.03699612, 0.01561296, 0.06016409, 0.0371452 , 0.0063401 ,
       0.00965994, 0.0798662 , 0.08669071, 0.00474992, 0.00417092,
       0.02407355, 0.00548033, 0.01254423, 0.03880038, 0.00379521,
       0.00435162, 0.00452503, 0.00556905, 0.00610635, 0.00528878,
       0.09556258, 0.01859305, 0.17205401, 0.05065305, 0.00943096,
       0.01565491, 0.02443166, 0.14202709, 0.00964898, 0.01001304])

In [38]:
X_Train_rfc =sel.transform(X_train)
X_Test_rfc =sel.transform(X_test)

In [39]:
X_Train_rfc.shape

(455, 10)

In [41]:
def randomforest(X_train, X_test, y_train, y_test):
    rf= RandomForestClassifier(n_estimators=100 , n_jobs=-1, random_state=0)
    rf.fit(X_train, y_train)
    y_pred =rf.predict(X_test)
    print("Accuracy on test set:", accuracy_score(y_test, y_pred))

In [42]:
%%time
randomforest(X_Train_rfc, X_Test_rfc, y_train, y_test)

Accuracy on test set: 0.9473684210526315
Wall time: 438 ms


### Recursive Feature Elimination (RFE) 

In [45]:
sel = RFE(RandomForestClassifier(n_estimators=100 , n_jobs=-1, random_state=0)  ,  
          n_features_to_select=15, step=1, verbose=0)
sel.fit(X_train, y_train)
sel.get_support()

array([ True,  True,  True,  True, False, False,  True,  True, False,
       False, False, False, False,  True, False, False, False, False,
       False, False,  True,  True,  True,  True,  True, False,  True,
        True,  True, False])

In [46]:
features =X_train.columns[sel.get_support()]
features

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean concavity', 'mean concave points', 'area error', 'worst radius',
       'worst texture', 'worst perimeter', 'worst area', 'worst smoothness',
       'worst concavity', 'worst concave points', 'worst symmetry'],
      dtype='object')

In [47]:
X_Train_rfe =sel.transform(X_train)
X_Test_rfe =sel.transform(X_test)

In [48]:
%%time
randomforest(X_Train_rfe, X_Test_rfe, y_train, y_test)

Accuracy on test set: 0.9736842105263158
Wall time: 293 ms


### Feature selection by Gradient Boost Tree impotance

In [50]:
from sklearn.ensemble import GradientBoostingClassifier

In [53]:
sel = RFE(GradientBoostingClassifier(n_estimators=100 , random_state=0)  ,  
          n_features_to_select=12, step=1, verbose=0)
sel.fit(X_train, y_train)
sel.get_support()

array([False,  True, False, False,  True, False, False,  True,  True,
       False, False, False, False,  True, False, False,  True, False,
       False, False,  True,  True,  True,  True, False, False,  True,
        True, False, False])

In [54]:
features =X_train.columns[sel.get_support()]
features

Index(['mean texture', 'mean smoothness', 'mean concave points',
       'mean symmetry', 'area error', 'concavity error', 'worst radius',
       'worst texture', 'worst perimeter', 'worst area', 'worst concavity',
       'worst concave points'],
      dtype='object')

In [55]:
X_Train_gb =sel.transform(X_train)
X_Test_gb =sel.transform(X_test)

In [56]:
%%time
randomforest(X_Train_gb, X_Test_gb, y_train, y_test)

Accuracy on test set: 0.9736842105263158
Wall time: 303 ms


### Selection best feature by using loop

In [58]:
for feature in range (1, 31):
    sel = RFE(GradientBoostingClassifier(n_estimators=100 , random_state=0)  ,  
          n_features_to_select=feature, step=1, verbose=0)
    sel.fit(X_train, y_train)
    X_Train_gb =sel.transform(X_train)
    X_Test_gb =sel.transform(X_test)
    
    print("selected feature:", feature)
    randomforest(X_Train_gb, X_Test_gb, y_train, y_test)

selected feature: 1
Accuracy on test set: 0.8771929824561403
selected feature: 2
Accuracy on test set: 0.9035087719298246
selected feature: 3
Accuracy on test set: 0.9649122807017544
selected feature: 4
Accuracy on test set: 0.9736842105263158
selected feature: 5
Accuracy on test set: 0.9649122807017544
selected feature: 6
Accuracy on test set: 0.9912280701754386
selected feature: 7
Accuracy on test set: 0.9736842105263158
selected feature: 8
Accuracy on test set: 0.9649122807017544
selected feature: 9
Accuracy on test set: 0.9736842105263158
selected feature: 10
Accuracy on test set: 0.956140350877193
selected feature: 11
Accuracy on test set: 0.956140350877193
selected feature: 12
Accuracy on test set: 0.9736842105263158
selected feature: 13
Accuracy on test set: 0.956140350877193
selected feature: 14
Accuracy on test set: 0.956140350877193
selected feature: 15
Accuracy on test set: 0.9649122807017544
selected feature: 16
Accuracy on test set: 0.956140350877193
selected feature: 17
A

6 features giving better accuracy

In [63]:
sel = RFE(GradientBoostingClassifier(n_estimators=100 , random_state=0)  ,  
          n_features_to_select=6, step=1, verbose=0)
sel.fit(X_train, y_train)
X_Train_gb =sel.transform(X_train)
X_Test_gb =sel.transform(X_test)


randomforest(X_Train_gb, X_Test_gb, y_train, y_test)

Accuracy on test set: 0.9912280701754386


In [60]:
for feature in range (1, 31):
    sel = RFE(RandomForestClassifier(n_estimators=100 , random_state=0, n_jobs=-1)  ,  
          n_features_to_select=feature, step=1, verbose=0)
    sel.fit(X_train, y_train)
    X_Train_gb =sel.transform(X_train)
    X_Test_gb =sel.transform(X_test)
    
    print("selected feature:", feature)
    randomforest(X_Train_gb, X_Test_gb, y_train, y_test)

selected feature: 1
Accuracy on test set: 0.8947368421052632
selected feature: 2
Accuracy on test set: 0.9298245614035088
selected feature: 3
Accuracy on test set: 0.9473684210526315
selected feature: 4
Accuracy on test set: 0.9649122807017544
selected feature: 5
Accuracy on test set: 0.9649122807017544
selected feature: 6
Accuracy on test set: 0.956140350877193
selected feature: 7
Accuracy on test set: 0.956140350877193
selected feature: 8
Accuracy on test set: 0.9649122807017544
selected feature: 9
Accuracy on test set: 0.9736842105263158
selected feature: 10
Accuracy on test set: 0.9736842105263158
selected feature: 11
Accuracy on test set: 0.9649122807017544
selected feature: 12
Accuracy on test set: 0.9736842105263158
selected feature: 13
Accuracy on test set: 0.9649122807017544
selected feature: 14
Accuracy on test set: 0.9736842105263158
selected feature: 15
Accuracy on test set: 0.9736842105263158
selected feature: 16
Accuracy on test set: 0.9736842105263158
selected feature: 1

In [62]:
sel = RFE(RandomForestClassifier(n_estimators=100 , random_state=0, n_jobs=-1)   ,  
          n_features_to_select=17, step=1, verbose=0)
sel.fit(X_train, y_train)
X_Train_gb =sel.transform(X_train)
X_Test_gb =sel.transform(X_test)


randomforest(X_Train_gb, X_Test_gb, y_train, y_test)

Accuracy on test set: 0.9824561403508771
