# Feature Importance via Inner Loop Cross-Validation

As machine learning practioners, we must always be wary of introducing bias into our algorithms, which can lead to overfitting. But this is rarely considered in the feature selection process, where it is commonplace to perform feature testing and selection prior to model fitting.

However, according to Jason Brownlee, PhD [this is a mistake](https://machinelearningmastery.com/an-introduction-to-feature-selection/):

> You must include feature selection within the inner-loop when you are using accuracy estimation methods such as cross-validation. This means that feature selection is performed on the prepared fold right before the model is trained. A mistake would be to perform feature selection first to prepare your data, then perform model selection and training on the selected features.

The error occurs because our feature selection is *biased* to the training data. The model could then report inflated results upon testing. So rather than perform feature selection prior to model fitting, we should do it within cross-validated loops. This way, we select the features that perform best when applied to unseen data.

So how do we summarize what happens when we do feature testing within each loop? Brownlee [suggests](https://machinelearningmastery.com/k-fold-cross-validation/#comment-450666):

>you will get different features, and perhaps you can take the average across the findings from each fold.

This suggestion will be implemented in examples in this notebook for XGBoost classification.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import sklearn
from sklearn import datasets
from sklearn.model_selection import KFold
from xgboost import XGBClassifier

In [2]:
# Import Iris dataset
iris = datasets.load_iris()
y = iris.target
print(y)
# 0 is Setosa, 1 is Versicolour, and 2 is Virginica
X = pd.DataFrame(iris.data, columns=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'])
# Show features
X.iloc[[0, 1, 2, -3, -2, -1]]

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


Unnamed: 0,Sepal Length,Sepal Width,Petal Length,Petal Width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3
149,5.9,3.0,5.1,1.8


### Feature Importance

Most tree-based machine learning models contain feature importance methods, which estimate how relevant a feature is in predicting outcomes. We'll use the scores returned in each cross-validation fold. Ordinarily, we would split the dataset into train/validation/test sets. Since the purpose of this notebook is not to train and test models, we'll assume the entire Iris dataset is the training set.

In [3]:
def xgb_cv_feature_importance(X, y, split, clf, eval_metric, esr=10):
    '''
    Example of XGBoost classification implementation of inner-loop cross-validation feature importance
    Must be modified for other algorithms
    
    Returns the best selected features from X along with feature importances
    
    Parameters
    ----------
    X : {Pandas DataFrame}, shape = [n_samples, n_features]
        Training vectors, where n_samples is the number of samples and
        n_features is the number of features.
    y : array-like, shape = [n_samples]
        Target values.
    split : generator object that splits the data
    clf : XGBoost classifier object
    eval_metric : string
                  XGBoost classifier evaluation metric. 
    esr: int (default: 10)
         Number of early stopping rounds to help prevent overfitting.

    Returns
    -------
    Pandas DataFrame of average feature importances, shape={n_samples, n_features}
    '''
    
    # Create lists of selected features and values
    selected_features = []
    selected_vals = []
    
    count = 1
    for train_index, test_index in split:
        print("Split #", count, "\nTRAIN:", train_index, "\nTEST:", test_index)
        
        # Use the splits to divide into X_train, y_train, X_test, y_test and fit
        X_train, X_test = X.iloc[train_index].values, X.iloc[test_index].values
        y_train, y_test = y[train_index], y[test_index]
        eval_set = [(X_train, y_train), (X_test, y_test)]
        clf.fit(X_train, y_train, eval_metric=eval_metric, eval_set=eval_set, early_stopping_rounds=esr, verbose=False)
        
        # Add features and values to lists
        feature_names = sorted([(X.columns[i], clf.feature_importances_[i]) for i in 
                                range(len(X.columns))], key=lambda x: x[1], reverse=True)
        
        selected_feature = [x[0] for x in feature_names]
        selected_val = [x[1] for x in feature_names]
        selected_features.extend(selected_feature)
        selected_vals.extend(selected_val)
        print("Sorted Feature Importance:", feature_names)
        print()
        count += 1
    
    # Create Dataframe of average feature importances
    selected_together = list(zip(selected_features, selected_vals))
    selected_df = pd.DataFrame(selected_together, columns=['Selected_Cols', 'Importance'])
    selected_df['Avg. Importance'] = selected_df.groupby('Selected_Cols')['Importance'].transform(lambda x: x.mean())
    selected_df = selected_df.groupby('Selected_Cols').first().sort_values(by='Avg. Importance', 
                                                                           ascending=False).drop('Importance', 1)
    return selected_df

In [4]:
# Set remaining parameters
kf = KFold(n_splits=3, shuffle=True)
kf_split = kf.split(X, y)
xgb = XGBClassifier(objective='binary:logistic', n_estimators=50)
eval_metric = 'mlogloss'
esr = 10

In [5]:
xgb_cv_feature_importance(X, y, kf_split, xgb, eval_metric, esr)

Split # 1 
TRAIN: [  0   1   2   4   6   8   9  13  14  16  17  18  19  20  22  24  25  26
  27  30  31  32  33  35  36  37  39  41  42  43  44  46  49  51  52  53
  54  55  57  59  61  63  65  66  67  68  70  73  74  75  76  77  78  79
  80  82  83  85  87  90  91  93  94  95  96  97  98 100 102 103 104 106
 110 111 112 115 116 118 119 120 121 122 123 124 125 127 131 133 134 135
 137 138 139 140 142 143 144 145 148 149] 
TEST: [  3   5   7  10  11  12  15  21  23  28  29  34  38  40  45  47  48  50
  56  58  60  62  64  69  71  72  81  84  86  88  89  92  99 101 105 107
 108 109 113 114 117 126 128 129 130 132 136 141 146 147]
Sorted Feature Importance: [('Petal Length', 0.5084746), ('Petal Width', 0.2857143), ('Sepal Length', 0.118644066), ('Sepal Width', 0.08716707)]

Split # 2 
TRAIN: [  2   3   5   6   7   8   9  10  11  12  14  15  18  20  21  22  23  24
  25  28  29  30  31  34  35  36  38  39  40  41  43  45  46  47  48  49
  50  51  53  56  58  59  60  62  64  65  66  67  69  

Unnamed: 0_level_0,Avg. Importance
Selected_Cols,Unnamed: 1_level_1
Petal Length,0.550971
Petal Width,0.282617
Sepal Width,0.101497
Sepal Length,0.064916


### More Complex Feature Importance

The Iris dataset only has four features, so there is no real need to prune the features to the most important ones. More likely, we'll be using datasets with many more features. In such scenarios, the effect of features might be masked due to similarity with other features. One possible solution is to initially prune features based on importance, then run a second feature importance test. In the following example, we'll select the top 3 features to pass to the second importance test.

In [6]:
def xgb_cv_feature_importance_complex(num_features, X, y, split, clf, eval_metric, esr=10):
    '''
    Example of XGBoost classification implementation of inner-loop cross-validation feature importance
    Must be modified for other algorithms
    
    Returns the best selected features from X along with feature importances
    
    Parameters
    ----------
    num_features : int
                   The number of features to pass to the second importance test. 
    X : {Pandas DataFrame}, shape = [n_samples, n_features]
        Training vectors, where n_samples is the number of samples and
        n_features is the number of features.
    y : array-like, shape = [n_samples]
        Target values.
    split : generator object that splits the data
    clf : XGBoost classifier object
    eval_metric : string
                  XGBoost classifier evaluation metric. 
    esr: int (default: 10)
         Number of early stopping rounds to help prevent overfitting.

    Returns
    -------
    Pandas DataFrame of feature importances, shape={n_samples, n_features}
    '''
    
    selected_features = []
    selected_vals = []
    count = 1
    for train_index, test_index in split:
        print("Split #", count, "\nTRAIN:", train_index, "\nTEST:", test_index)
        X_train, X_test = X.iloc[train_index].values, X.iloc[test_index].values
        y_train, y_test = y[train_index], y[test_index]
        eval_set = [(X_train, y_train), (X_test, y_test)]
        clf.fit(X_train, y_train, eval_metric=eval_metric, eval_set=eval_set, early_stopping_rounds=esr, verbose=False)
        feature_names = sorted([(X.columns[i], clf.feature_importances_[i]) for i in 
                                range(len(X.columns))], key=lambda x: x[1], reverse=True)[:num_features]
        
        
        selected = [x[0] for x in feature_names]
        X_train, X_test = X.iloc[train_index][selected].values, X.iloc[test_index][selected].values
        y_train, y_test = y[train_index], y[test_index]
        eval_set = [(X_train, y_train), (X_test, y_test)]
        clf.fit(X_train, y_train, eval_metric=eval_metric, eval_set=eval_set, early_stopping_rounds=esr, verbose=False)
        feature_names = sorted([(selected[i], clf.feature_importances_[i]) for i in 
                                range(len(selected))], key=lambda x: x[1], reverse=True)
        selected_feature = [x[0] for x in feature_names]
        selected_val = [x[1] for x in feature_names]
        selected_features.extend(selected_feature)
        selected_vals.extend(selected_val)
        print("Sorted Feature Importance:", feature_names)
        print()
        count += 1

    selected_together = list(zip(selected_features, selected_vals))
    selected_df = pd.DataFrame(selected_together, columns=['Selected_Cols', 'Importance'])
    selected_df['# Selections'] = selected_df.groupby('Selected_Cols')['Selected_Cols'].transform('count')
    selected_df['Avg. Importance'] = selected_df.groupby('Selected_Cols')['Importance'].transform(lambda x: x.sum()/(count-1))
    selected_df = selected_df.groupby('Selected_Cols').first().sort_values(by='Avg. Importance', 
                                                                           ascending=False).drop('Importance', 1)
    return selected_df

In [7]:
kf = KFold(n_splits=3, shuffle=True)
kf_split = kf.split(X, y)
xgb = XGBClassifier(objective='binary:logistic', n_estimators=50)
eval_metric = 'mlogloss'
esr = 10

num_features = 3

In [8]:
xgb_cv_feature_importance_complex(num_features, X, y, kf_split, xgb, eval_metric, esr)

Split # 1 
TRAIN: [  0   1   2   3   5   9  10  11  12  13  14  15  16  17  20  21  22  23
  25  27  28  29  30  31  35  36  37  38  39  40  42  43  44  45  47  49
  50  51  52  53  54  55  56  59  61  62  64  65  66  67  69  70  72  73
  74  75  76  77  78  79  81  82  83  85  86  88  89  91  92  94  97 105
 106 108 109 110 113 115 118 119 121 123 125 128 129 130 131 132 133 135
 136 137 138 139 141 142 143 147 148 149] 
TEST: [  4   6   7   8  18  19  24  26  32  33  34  41  46  48  57  58  60  63
  68  71  80  84  87  90  93  95  96  98  99 100 101 102 103 104 107 111
 112 114 116 117 120 122 124 126 127 134 140 144 145 146]
Sorted Feature Importance: [('Petal Length', 0.5075), ('Petal Width', 0.325), ('Sepal Width', 0.1675)]

Split # 2 
TRAIN: [  2   4   5   6   7   8  10  12  15  16  17  18  19  22  23  24  25  26
  28  32  33  34  35  37  38  40  41  43  44  46  48  49  54  55  57  58
  60  62  63  65  66  67  68  69  70  71  72  76  77  79  80  81  83  84
  85  86  87  90  92  9

Unnamed: 0_level_0,# Selections,Avg. Importance
Selected_Cols,Unnamed: 1_level_1,Unnamed: 2_level_1
Petal Length,3,0.583593
Petal Width,3,0.2903
Sepal Width,2,0.079323
Sepal Length,1,0.046784


With feature importances on unseen data in hand, we can more reliably select features that avoid pitfalls of bias that lead to overfitting