# Feature Importance

|  | Tuesday 4-5:15pm | Friday  4-5:30pm |
|:------:|:-------------------------------------------:|:--------------------------------------------------------------------------:|
| **Week 1** | Introduction | Introduction |
| **Week 2** | Custom computer vision tasks | State of the art in Computer Vision |
| **Week 3** | Introduction to Tabular modeling and pandas | Pandas workshop and feature engineering |
| **Week 4** | Tabular and Image Regression | **Feature importance and advanced feature  engineering** |
| **Week 5** | Natural Language Processing | State of the art in NLP |
| **Week 6** | Segmentation and Kaggle | Audio |
| **Week 7** | Computer vision from scratch | NLP from scratch |
| **Week 8** | Callbacks | Optimizers |
| **Week 9** | Generative Adversarial Networks | Research time / presentations |
| **Week 10** | Putting models into production | Putting models into production |

## What is it?

* A way to examine how particular variables or features are being used or the most helpful for a model

## Why is it important?

* Feature pruning
* Can help explain our models

## How do we do it?

* Permutation importance
* Measure the change in accuracy of a **fully trained model** on a seperate set

For example, instead of our 70% Train, 20% Validation, 10% Test, we could now have:

We have 10% Test, 10% Feature Importance, 64% Train, 16% Validation

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
import pandas as pd

In [0]:
def create_sets(df:pd.DataFrame, is_feat:bool):
  train, test = train_test_split(df, test_size=0.1)
  if is_feat: 
    train, feat = train_test_split(train, test_size=0.1)
    return train, test, feat
  else: return train, test

In [0]:
from fastai.tabular import *

In [0]:
path = untar_data(URLs.ADULT_SAMPLE)

In [0]:
df = pd.read_csv(path/'adult.csv')

In [0]:
dep_var = 'salary'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']

In [0]:
procs = [FillMissing, Categorify, Normalize]

In [10]:
train, test, feat = create_sets(df, True)
len(train), len(test), len(feat)

(26373, 3257, 2931)

# Initial Training

In [0]:
data = (TabularList.from_df(train, path=path, cat_names=cat_names, 
                            cont_names=cont_names, procs=procs)
       .split_by_rand_pct()
       .label_from_df(cols=dep_var)
       .databunch())

In [0]:
learn = tabular_learner(data, layers=[200,100], metrics=accuracy)

In [13]:
learn.fit_one_cycle(5, 3e-02)

epoch,train_loss,valid_loss,accuracy,time
0,0.391439,0.370338,0.828024,00:03
1,0.363934,0.370537,0.822905,00:03
2,0.362987,0.37177,0.824801,00:03
3,0.35023,0.366146,0.826887,00:03
4,0.340899,0.364021,0.82992,00:03


In [14]:
learn.fit_one_cycle(1, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,0.342807,0.365048,0.828972,00:03


# Permutation Selection Algorithm

In [15]:
learn.data.cat_names

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'education-num_na']

In [16]:
learn.data.cat_names

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'education-num_na']

In [0]:
def feature_importance(learn:Learner, dep_var:str, test:DataFrame):
  pd.options.mode.chained_assignment= None # Gets rid of annoying warning
  
  data = learn.data
  cats = [x for x in data.cat_names if '_na' not in x]
  conts = data.cont_names
  procs = data.procs
  
  dt = (TabularList.from_df(test, path='', cat_names=cats.copy(), cont_names=conts.copy(),
                           procs=procs)
       .split_none()
       .label_from_df(cols=dep_var))
  dt.valid = dt.train
  dt = dt.databunch()
  
  learn.data.valid_dl = dt.valid_dl
  loss0 = float(learn.validate()[1])
  
  types = [cats, conts]
  
  fi = dict()
  for j, t in enumerate(types):
    for i, c in enumerate(t):
      print(c)
      base = test.copy()
      base[c] = base[c].sample(n=len(base), replace=True).reset_index(drop=True)
      
      
      dt = (TabularList.from_df(base, path='', cat_names=cats.copy(), cont_names=conts.copy(),
                         procs=procs)
       .split_none()
       .label_from_df(cols=dep_var))
      dt.valid = dt.train
      dt = dt.databunch()
      
      learn.data.valid_dl = dt.valid_dl
      fi[c] = float(learn.validate()[1]) - loss0
  
  d = sorted(fi.items(), key=lambda kv: kv[1])
  df = pd.DataFrame({'Variable': [l for l, v in d], u'Δ Accuracy': [v for l, v in d]})
  df['Type'] = ''
  for x in range(len(df)):
    if df['Variable'].iloc[x] in cats:
      df['Type'].iloc[x] = 'categorical'
    else:
      df['Type'].iloc[x] = 'continuous'
  return df

In [0]:
res = feature_importance(learn, dep_var, feat)

Let's look at how we are measuring. New - Original percentage. The lower the value the more important a value is

In [19]:
res

Unnamed: 0,Variable,Δ Accuracy,Type
0,marital-status,-0.04333,categorical
1,age,-0.024224,continuous
2,occupation,-0.015694,categorical
3,education-num,-0.011259,continuous
4,education,-0.007165,categorical
5,race,-0.004435,categorical
6,fnlwgt,-0.003071,continuous
7,workclass,-0.002729,categorical
8,relationship,-0.002729,categorical


In this case if we wanted to prune features we could drop `relationship`, `workclass`, `fnlwgt`, `race`, and `education`. If anything was positive or zero then dropping the column would **increase** our accuracy so we should definitely drop them