# Feature Importance

Let's begin by loading our training data.

In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
import pandas as pd
url = "https://raw.githubusercontent.com/jigsawlabs-student/feature-selection/master/listings_train_df.csv"

train_validate_df = pd.read_csv(url)

Then we can separate our feature and target variables.

In [5]:
target_cols = ['price']
X = train_validate_df.drop(columns = target_cols)
y = train_validate_df['price']

And split our data.

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_validate, y_train, y_validate = train_test_split(X, y, random_state = 1, test_size = .2)

In [7]:
X_train.shape, X_validate.shape

((14361, 322), (3591, 322))

And then train our model.

In [8]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
model.score(X_validate, y_validate)

0.542429555631625

### Permutations

In [9]:
from eli5.sklearn import PermutationImportance
import eli5

perm = PermutationImportance(model).fit(X_validate, y_validate)

In [11]:
# eli5.explain_weights_df(perm, feature_names = list(X_train.columns))

### How Permutation Importance Works

Let's talk about how `eli5` calculates the scores above.  The concept is to remove each of the features and see the reduction in scores.  For example, let's try to drop the `first_reviewElapsed` feature and see how it performs.

In [18]:
X_val_removed = X_validate.drop(columns = ['first_reviewElapsed'])

> Comment and uncomment the line below.

In [20]:
# model.score(X_val_removed, y_validate)

We can see that we can't simply score our model with a reduced dataset.  So instead we do the next best thing.  Instead of removing the feature all together, we shuffle the feature, and then rescore the model.  If the score drops a lot, then the feature must have been significant.

### On with the show

We can see from our below scoring that a lot of the features are repeats of each other.  That is, when `last_reviewYear_is_na` we know that `last_reviewMonth_is_na` as well.

In [28]:
exp_df[:10]

Unnamed: 0,feature,weight,std
0,first_reviewElapsed,33565780.0,233362.828339
1,last_reviewElapsed,28474230.0,341312.429978
2,last_reviewDayofyear_is_na,786954.0,18947.061155
3,last_reviewDay_is_na,782537.3,15185.248973
4,last_reviewYear_is_na,780167.0,9295.946614
5,last_reviewMonth_is_na,774861.2,7128.549279
6,last_reviewWeek_is_na,774555.3,10930.45181
7,last_reviewDayofweek_is_na,770731.0,6016.109522
8,first_reviewDayofweek_is_na,688437.9,6846.01372
9,reviews_per_month_is_na,687906.7,9095.983871


We can remove all of these repeated nas (except for year), to reduce multicollinearity.

In [29]:
def remove_na_cols(df):
    na_cols = np.array(["Month_is_na", "Week_is_na", "Day_is_na", "Dayofweek_is_na", "Dayofyear_is_na"])
    to_remove = []
    for df_col in df.columns:
        if any(df_col.endswith(na_col)   for na_col in na_cols):
            to_remove.append(df_col)
    return to_remove

### Resources

[Interpretable ML Book](https://christophm.github.io/interpretable-ml-book/)