The ***mlxtend*** package provides a function to perform variable permutation and calculate variable importance values: feature_importance_permutation. Let's see how to use it with the Breast Cancer dataset from sklearn.

In [0]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

In [0]:
data = load_breast_cancer()
X, y = data.data, data.target

In [3]:
rf_model = RandomForestClassifier(random_state=168)
rf_model.fit(X, y)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=168,
                       verbose=0, warm_start=False)

Then, we will call the feature_importance_permutation function from mlxtend.evaluate. This function takes the following parameters:

predict_method: A function that will be called for model prediction. Here, we will provide the predict method from our trained rf_model model.

X: The features from the dataset. It needs to be in NumPy array form.

y: The target variable from the dataset. It needs to be in Numpy array form.

metric: The metric used for comparing the performance of the model. For the classification task, we will use accuracy.

num_round: The number of rounds mlxtend will perform permutation on the data and assess the performance change.

seed: The seed set for getting reproducible results.

Consider the following code snippet:

In [4]:
from mlxtend.evaluate import feature_importance_permutation

imp_vals, _ = feature_importance_permutation(predict_method=rf_model.predict, X=X, y=y, metric='r2', num_rounds=1, seed=2)
imp_vals

array([0.       , 0.       , 0.       , 0.       , 0.       , 0.       ,
       0.       , 0.       , 0.       , 0.       , 0.0075181, 0.       ,
       0.       , 0.0075181, 0.       , 0.       , 0.       , 0.       ,
       0.       , 0.       , 0.       , 0.0075181, 0.0075181, 0.       ,
       0.       , 0.       , 0.0075181, 0.       , 0.       , 0.       ])

Let's create a DataFrame containing these values and the names of the features and plot them on a graph with altair:

In [5]:
import pandas as pd
varimp_df = pd.DataFrame()
varimp_df['feature'] = data.feature_names
varimp_df['importance'] = imp_vals
varimp_df.head()

Unnamed: 0,feature,importance
0,mean radius,0.0
1,mean texture,0.0
2,mean perimeter,0.0
3,mean area,0.0
4,mean smoothness,0.0


In [6]:
import altair as alt
alt.Chart(varimp_df).mark_bar().encode(
    x='importance',
    y="feature"
)

**These results are different from the ones we got from RandomForest in the previous section. Here, worst concave points is the most important, followed by worst area, and worst perimeter has a higher value than mean radius. So, we got the same list of the most important variables but in a different order. This confirms these three features are indeed the most important in predicting whether a tumor is malignant or not. The variable importance from RandomForest and the permutation have different logic, therefore, their results can be different.**