## Overview

In many of the models we've fit, we've looked at the feature importance. This has been accomplished by simply ranking the features after fitting the model. In addition to these basic methods, we can also look at what happens when we change a specific feature. This is called the *permutation importance*.

The permute something means to change the order. When we fit a model, we measure the accuracy by comparing our model predictions to the test or validation data. We can test the importance of a feature by permuting the values and then calculating the accuracy against the test set.

The process works something like this:

* Fit a model and calculate the accuracy
* Choose a feature (by ranking the importances or some other method) and randomly permute the values for just that feature
* Calculate the accuracy again with the permuted column
* Results in a decrease in accuracy: that feature is important to the model
* Results in an accuracy that stays the same: the feature isn't important to the model and could be replaced by random numbers

## Follow Along

We'll use the Australian weather data set from the previous module and permute or randomize a few of the features in the test set. The accuracy should change, or decrease, for features that are important to the model. And the accuracy should remain essentially the same for features that are not very important to the model.

In [1]:
# Import libraries, load data, and view
import pandas as pd
weather = pd.read_csv('weatherAUS.csv')

# Drop columns with high-percentage of missing values (and the leaky feature)
cols_drop = ['Location', 'Evaporation', 'Sunshine', 'Cloud9am', 'Cloud3pm', 'RISK_MM']
weather_drop = weather.drop(cols_drop, axis=1)

# Convert the 'Date' column to datetime, extract month
weather_drop['Date'] = pd.to_datetime(weather_drop['Date'], infer_datetime_format=True).dt.month

### Create Pipeline

Here we're going to create the preprocessing and model fitting pipeline from the first module in this sprint, so you have seen this code before! Once the model is fit, we can demonstrate the feature-permutation process.

In [2]:
# Imports
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.tree import DecisionTreeClassifier


# Define the numeric features
numeric_features = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 
                    'WindSpeed9am','WindSpeed3pm', 'Humidity9am', 
                    'Humidity3pm', 'Pressure9am','Pressure3pm', 
                    'Temp9am', 'Temp3pm']

# Create the transformer (impute, scale)
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define the categorical features
categorical_features = ['WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('ordinal', OrdinalEncoder())])

# Define how the numeric and categorical features will be transformed
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Define the pipeline steps, including the classifier
clf = Pipeline(steps=[('preprocessor', preprocessor),
                  ('classifier', DecisionTreeClassifier())])

### Train and Fit the Model

In [3]:
# Create the feature matrix 
X = weather_drop.drop('RainTomorrow', axis=1)

# Create and encode the target array
from sklearn.preprocessing import LabelEncoder
label_enc = LabelEncoder()
y=label_enc.fit_transform(weather_drop['RainTomorrow'])

# Import the train_test_split utility
from sklearn.model_selection import train_test_split

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Fit the model
clf.fit(X_train,y_train)
print('Validation Accuracy', clf.score(X_test, y_test))

Validation Accuracy 0.7793522979007701


### Feature Importances

We're looking at the feature importances now (after removing the problem leaky feature). This is importance because we need to see how the features rank before we change them around.

In [4]:
# Features (order in which they were preprocessed)
features_order = numeric_features + categorical_features

importances = pd.Series(clf.steps[1][1].feature_importances_, features_order)

# Plot feature importances
import matplotlib.pyplot as plt

n = 10
plt.figure(figsize=(10,n/2))
plt.title(f'Top {n} features')
importances.sort_values()[-n:].plot.barh(color='grey')

plt.clf()

<Figure size 720x360 with 0 Axes>

![mod3_obj1_features.png](https://raw.githubusercontent.com/LambdaSchool/data-science-canvas-images/main/unit_2/sprint_3/mod3_obj1_features.png)

We can now try a few of the columns and see how permutation of their values affects the accuracy. We'll start with the most important feature (`Humidity3pm`) and then do the same with one of the less important features (`WindSpeed3pm`).

We do need to remember to preprocess the permuted data in the same way we did inside of the pipeline above. For the numeric features, we used the `SimpleImputer()` and the `StandardScaler`.

In [5]:
# Permute the values in the more important column
feature = 'Humidity3pm'
X_test_permuted = X_test.copy()

# Fill in missing values
X_test_permuted[feature].fillna(value = X_test_permuted[feature].median(), inplace=True)

# Permute
X_test_permuted[feature] = np.random.permutation(X_test[feature])

print('Feature permuted: ', feature)
print('Validation Accuracy', clf.score(X_test, y_test))
print('Validation Accuracy (permuted)', clf.score(X_test_permuted, y_test))

Feature permuted:  Humidity3pm
Validation Accuracy 0.7793522979007701
Validation Accuracy (permuted) 0.699075213615106


The accuracy went down, as we would expect if this feature was important to the model. So `Humidity3pm` has some affect on the model. Let's try another feature.

In [6]:
# Permute the values in a less important column
feature = 'WindSpeed3pm'
X_test_permuted = X_test.copy()

# Fill in missing values
X_test_permuted[feature].fillna(value = X_test_permuted[feature].median(), inplace=True)

# Permute
X_test_permuted[feature] = np.random.permutation(X_test[feature])

print('Feature permuted: ', feature)
print('Validation Accuracy', clf.score(X_test, y_test))
print('Validation Accuracy (permuted)', clf.score(X_test_permuted, y_test))

Feature permuted:  WindSpeed3pm
Validation Accuracy 0.7793522979007701
Validation Accuracy (permuted) 0.7672913956186926


The decrease in accuracy was not nearly as significant, so `WindSpeed3pm` is not as important to the model.

## Challenge

Using the above code (you'll need to download the data set from the link below), try permuting other features. We only showed the top 10 features; there are others with even lower feature importance which would be good to try the permutation process with.

## Additional Resources

* [Kaggle: Permutation Importance](https://www.kaggle.com/dansbecker/permutation-importance)
* [Kaggle: Australian Weather Data](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package)