# Select important features using permutation feature importance
------------------------
 
Select important features on the wine dataset using scikit-learn. This will be the steps to be done:


1. Train a Random Forest classifier on the data.
2. Compute the feature importance score by permutating each feature.
3. Re-train the model with only the top features.
4. Check other classifiers for comparison.



## pre-process and inspect data


In [2]:
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.svm import SVC
from sklearn.linear_model import Lasso, Ridge
from sklearn.tree import DecisionTreeClassifier


data = load_wine(as_frame=True)
data.frame.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


This dataset consists of 13 numerical features and 3 distinct classes of wine (categorical).

We'll, split the data into training and testing sets, and normalize it using the StandardScaler:

In [3]:
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=42)


scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Train the Random Forest classifier


In [4]:
rf_clf = RandomForestClassifier(n_estimators=10, random_state=42).fit(X_train, y_train)
rf_clf.score(X_test, y_test)

0.9111111111111111

This model achieved an avg accuracy of 91%

# Permutation Feature Importance

In [5]:
def calculate_feature_importance(clf, X, y, top_limit=None):

  feature_importances = permutation_importance(clf, X, y,
                                 n_repeats=50, random_state=42)

  feature_importances_means = feature_importances.importances_mean
  ordered_feature_importances_means_args = np.argsort(feature_importances_means)[::-1]

  if top_limit is None:
    top_limit = len(ordered_feature_importances_means_args)

  for i, _ in zip(ordered_feature_importances_means_args, range(top_limit)):
    name = data.feature_names[i]
    feature_importances_score = feature_importances_means[i]
    feature_importances_std = feature_importances.importances_std[i]
    print(f"Feature {name} with index {i} has an average importance score of {feature_importances_score:.3f} +/- {feature_importances_std:.3f}\n")

Permutation importance measures how much each feature contributes to the model's predictions. Here's how it works:

1. Train a model and calculate its baseline accuracy.
2. Shuffle one feature's values, breaking its relationship with the target variable.
3. Recalculate the model's accuracy with the shuffled data.
4. The drop in accuracy indicates the feature's importance.

This method highlights the impact of each feature on the model's performance, making it useful for model interpretation.

In [6]:
calculate_feature_importance(rf_clf, X_train, y_train)

Feature flavanoids with index 6 has an average importance score of 0.227 +/- 0.025

Feature proline with index 12 has an average importance score of 0.142 +/- 0.019

Feature color_intensity with index 9 has an average importance score of 0.112 +/- 0.023

Feature od280/od315_of_diluted_wines with index 11 has an average importance score of 0.007 +/- 0.005

Feature total_phenols with index 5 has an average importance score of 0.003 +/- 0.004

Feature malic_acid with index 1 has an average importance score of 0.002 +/- 0.004

Feature proanthocyanins with index 8 has an average importance score of 0.002 +/- 0.003

Feature hue with index 10 has an average importance score of 0.002 +/- 0.003

Feature nonflavanoid_phenols with index 7 has an average importance score of 0.000 +/- 0.000

Feature magnesium with index 4 has an average importance score of 0.000 +/- 0.000

Feature alcalinity_of_ash with index 3 has an average importance score of 0.000 +/- 0.000

Feature ash with index 2 has an aver

In [7]:
calculate_feature_importance(rf_clf, X_test, y_test)

Feature flavanoids with index 6 has an average importance score of 0.202 +/- 0.047

Feature proline with index 12 has an average importance score of 0.143 +/- 0.042

Feature color_intensity with index 9 has an average importance score of 0.112 +/- 0.043

Feature alcohol with index 0 has an average importance score of 0.024 +/- 0.017

Feature magnesium with index 4 has an average importance score of 0.021 +/- 0.015

Feature od280/od315_of_diluted_wines with index 11 has an average importance score of 0.015 +/- 0.018

Feature hue with index 10 has an average importance score of 0.013 +/- 0.018

Feature total_phenols with index 5 has an average importance score of 0.002 +/- 0.016

Feature nonflavanoid_phenols with index 7 has an average importance score of 0.000 +/- 0.000

Feature alcalinity_of_ash with index 3 has an average importance score of 0.000 +/- 0.000

Feature malic_acid with index 1 has an average importance score of -0.002 +/- 0.017

Feature ash with index 2 has an average imp

Notice that the most important features are the same for both sets. However, alcohol is more important in the testing set. This suggests that this feature helps the model generalize better.


## Re-train the model with the most important features

In [8]:
print("On TRAIN split:\n")
calculate_feature_importance(rf_clf, X_train, y_train, top_limit=3)

print("\nOn TEST split:\n")
calculate_feature_importance(rf_clf, X_test, y_test, top_limit=3)

On TRAIN split:

Feature flavanoids with index 6 has an average importance score of 0.227 +/- 0.025

Feature proline with index 12 has an average importance score of 0.142 +/- 0.019

Feature color_intensity with index 9 has an average importance score of 0.112 +/- 0.023


On TEST split:

Feature flavanoids with index 6 has an average importance score of 0.202 +/- 0.047

Feature proline with index 12 has an average importance score of 0.143 +/- 0.042

Feature color_intensity with index 9 has an average importance score of 0.112 +/- 0.043



In [9]:
X_train_top_features = X_train[:,[6, 9, 12]]
X_test_top_features = X_test[:,[6, 9, 12]]

rf_clf_top = RandomForestClassifier(n_estimators=10, random_state=42).fit(X_train_top_features, y_train)

rf_clf_top.score(X_test_top_features, y_test)

0.9333333333333333

# Adding the Alcohol feature back to the sets:

In [10]:
X_train_top_features = X_train[:,[0, 6, 9, 12]]
X_test_top_features = X_test[:,[0, 6, 9, 12]]

rf_clf_top = RandomForestClassifier(n_estimators=10, random_state=42).fit(X_train_top_features, y_train)

rf_clf_top.score(X_test_top_features, y_test)

1.0

## Comparison with other classifiers

Permutation Feature Importance is classifier dependant on the. Each classifier follow different rules for classification therefore they will consider different features to be important or unimportant.

In [11]:
clfs = {"Laso": Lasso(alpha=0.05), 
        "Ridge": Ridge(), 
        "Decision Tree": DecisionTreeClassifier(), 
        "Support Vector": SVC()}

def fit_compute_importance(clf):
  clf.fit(X_train, y_train)
  print(f"📏 Mean accuracy score on the test set: {clf.score(X_test, y_test)*100:.2f}%\n")
  print("🔝 Top 4 features when using the test set:\n")
  calculate_feature_importance(clf, X_test, y_test, top_limit=4)

for name, clf in clfs.items():
  print("====="*20)
  print(f"➡️ {name} classifier\n")
  fit_compute_importance(clf)

➡️ Laso classifier

📏 Mean accuracy score on the test set: 86.80%

🔝 Top 4 features when using the test set:

Feature flavanoids with index 6 has an average importance score of 0.323 +/- 0.055

Feature proline with index 12 has an average importance score of 0.203 +/- 0.035

Feature od280/od315_of_diluted_wines with index 11 has an average importance score of 0.146 +/- 0.030

Feature alcalinity_of_ash with index 3 has an average importance score of 0.038 +/- 0.014

➡️ Ridge classifier

📏 Mean accuracy score on the test set: 88.71%

🔝 Top 4 features when using the test set:

Feature flavanoids with index 6 has an average importance score of 0.445 +/- 0.071

Feature proline with index 12 has an average importance score of 0.210 +/- 0.035

Feature color_intensity with index 9 has an average importance score of 0.119 +/- 0.029

Feature od280/od315_of_diluted_wines with index 11 has an average importance score of 0.111 +/- 0.026

➡️ Decision Tree classifier

📏 Mean accuracy score on the tes

**flavanoids** and **proline** are important across all classifiers.