<a href="https://colab.research.google.com/github/muellerzr/FastAI-Experiments/blob/master/FeatureImportanceExperimentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Importance Verification

In this notebook, I will be running a comparitive analysis to try to disprove the following hypothesis:

If the current implementation of permutation selection for deep learning is correct, training a model without the same features should result in the loss being what was found in the original table.

I will be using the ADULTs dataset first, as it is simpler to set up and grade, and then Rossman.

The data will be split into 70% train, 20% validation, and 10% test randomly from `train_test_split` in the sklearn library.

Each test will be done with five epochs.

## Libraries

In [0]:
from fastai import *
from fastai.tabular import *
from sklearn.model_selection import train_test_split

## Data

In [0]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')

In [0]:
random_seed(5)

In [8]:
train, test = train_test_split(df, test_size = 0.1)
len(train), len(test)

(29304, 3257)

In [0]:
dep_var = 'salary'
cats = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
conts = ['age', 'fnlwgt', 'education-num']
procs = [FillMissing, Categorify, Normalize]

In [0]:
data = (TabularList.from_df(train, path=path, cat_names=cats, cont_names=conts, 
                            procs=procs)
                           .split_by_rand_pct(0.2)
                           .label_from_df(cols=dep_var)
                           .databunch())
data_test = (TabularList.from_df(test, path=path, cat_names=cats, cont_names=conts, 
                            procs=procs, processor=data.processor)
                           .split_none()
                           .label_from_df(cols=dep_var)
                           .databunch())

## Baseline

In [0]:
learn.data.valid_dl = data.valid_dl

In [0]:
learn = tabular_learner(data, layers=[200,100], metrics=accuracy)

In [12]:
learn.fit_one_cycle(5, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.358187,0.377014,0.825256,00:05
1,0.366855,0.362309,0.829863,00:05
2,0.357001,0.36194,0.830375,00:05
3,0.343896,0.355529,0.837713,00:05
4,0.342867,0.354512,0.838396,00:05


## Feature Importance

Here I will be using the permutation-based feature importance algorithm on the test set to derive the hypothetical overall importance

In [14]:
cats, conts

(['workclass',
  'education',
  'marital-status',
  'occupation',
  'relationship',
  'race',
  'education-num_na'],
 ['age', 'fnlwgt', 'education-num'])

In [0]:
  cats = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
  conts = ['age', 'fnlwgt', 'education-num']

In [0]:
def tfeature_importance(learn:Learner, cats:list, conts:list, dep_var:str, test:DataFrame):
  data = learn.data.train_ds.x
  procs = data.procs
  dt = (TabularList.from_df(test, path='', cat_names=cats, cont_names=conts, 
                            procs=procs)
                           .split_none()
                           .label_from_df(cols=dep_var)
                           .databunch())
  learn.data.valid_dl = dt.train_dl
  loss0 = float(learn.validate()[1])
  cats = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
  conts = ['age', 'fnlwgt', 'education-num']
  types = [cats, conts]
  fi=dict()
  for j, t in enumerate(types):
    for i, c in enumerate(t):
      print(c)
      base = test.copy()
      base[c] = base[c].sample(n=len(base), replace=True).reset_index(drop=True)
      cats = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
      conts = ['age', 'fnlwgt', 'education-num']
      dt = (TabularList.from_df(base, path='', cat_names=cats, cont_names=conts, 
                            procs=procs)
                           .split_none()
                           .label_from_df(cols=dep_var)
                           .databunch())
      learn.data.valid_dl = dt.train_dl
      fi[c] = float(learn.validate()[1]) - loss0
      
  d = sorted(fi.items(), key =lambda kv: kv[1], reverse=True)
  df = pd.DataFrame({'Variable': [l for l, v in d], 'Accuracy': [v for l, v in d]})
  df['Type'] = ''
  for x in range(len(df)):
    if df['Variable'].iloc[x] in cats:
      df['Type'].iloc[x] = 'categorical'
    if df['Variable'].iloc[x] in conts:
      df['Type'].iloc[x] = 'continuous'
  return df                  

In [34]:
res = tfeature_importance(learn, cats, conts, dep_var, test)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


So here, hypothetically what we should see is a boost in performance when we drop everything that is positive.

Next I will get the actual ground truth level for the test set as well, to compare the baselines

In [22]:
learn.data.valid_dl = data_test.train_dl
learn.validate()

[0.35384032, tensor(0.8422)]

**83.91%** is the baseline we will be going with

# Dropping Columns

Here I make a for loop that drops a variable from the list, continues to the next and trains for 5, gets our test result, and continues on

## Functions

In [0]:
 def random_seed(seed_value):
    import random 
    random.seed(seed_value) # Python
    import numpy as np
    np.random.seed(seed_value) # cpu vars
    import torch
    torch.manual_seed(seed_value) # cpu  vars
    
    if torch.cuda.is_available(): 
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value) # gpu vars
        torch.backends.cudnn.deterministic = True  #needed
        torch.backends.cudnn.benchmark = False

In [0]:
class DropTest:
  def __init__(self, cat_vars:list, cont_vars:list, dep_var:str, df:DataFrame):
    self.cats = cat_vars
    self.conts = cont_vars
    self.dep = dep_var
    self.df = df
    self.procs = [FillMissing, Categorify, Normalize]
    self.res = pd.DataFrame(columns=['Variable', 'Accuracy'])
    self.types = [self.cats, self.conts]
    
  def calc_drop(self):
    train, test = train_test_split(self.df, test_size=0.1)
    k = 0
    for j, t in enumerate(self.types):
      for i, c in enumerate(t):
        random_seed(5)
        cat_copy = self.cats.copy()
        cont_copy = self.conts.copy()
        if c in cat_copy:
          cat_copy.remove(c)
        else:
          cont_copy.remove(c)
        
        df = self.df.drop(c, axis=1)
        
        data = (TabularList.from_df(train, path='', cat_names=cat_copy,
                                   cont_names=cont_copy, procs=self.procs)
               .split_by_rand_pct(0.2)
               .label_from_df(cols=self.dep)
               .databunch())
        data_test = (TabularList.from_df(test, path='', cat_names=cat_copy,
                                        cont_names=cont_copy, procs=self.procs,
                                        processor=data.processor)
                     .split_none()
                     .label_from_df(cols=self.dep)
                     .databunch())
        learn = tabular_learner(data, layers=[200,100], metrics=accuracy)
        learn.fit_one_cycle(5, 1e-2)
        learn.data.valid_dl = data_test.train_dl
        val = learn.validate()
        val = float(val[1])
        self.res.loc[k] = [str(c), val]
        k += 1    
    

## DropTest

In [0]:
t = DropTest(cats, conts, 'salary', df)

In [0]:
t.calc_drop()

In [0]:
re= t.res

In [30]:
re.sort_values('Accuracy',ascending=False)

Unnamed: 0,Variable,Accuracy
8,education-num,0.836875
5,race,0.836563
1,education,0.834063
7,fnlwgt,0.832812
2,marital-status,0.831563
4,relationship,0.830312
3,occupation,0.827187
6,age,0.825625
0,workclass,0.825312


In [36]:
res.sort_values('Accuracy', ascending=True)

Unnamed: 0,Variable,Accuracy,Type
8,marital-status,-0.038125,categorical
7,age,-0.019375,continuous
6,education-num,-0.014375,continuous
5,occupation,-0.0125,categorical
4,relationship,-0.00125,categorical
3,workclass,-0.000938,categorical
2,education,-0.000625,categorical
1,fnlwgt,-0.000313,continuous
0,race,0.000625,categorical
