## Permuted Feature Importance
Problem Statement:
- We want to be able to look at how each variable impacts a model's results
- We can't or don't want to always run a random forest. Random forest models will have more splits on numeric variables, versus binary variables. It will tend to focus more on numeric.

|    | Learning Objectives                                                                           |
| -- | --------------------------------------------------------------------------------------------- |
| A. | Have conceptual and practical understanding of feature importance                             |
| B. | You will be able to implement your own feature importance measure with your choice of loss function                                            |


Goals:
- By the end of this workbook you will conceptually understand how to calculate feature importance
- 



Resources:
- [Article: Bias when using Feature Importance in Random Forest](https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/1471-2105-8-25?site=bmcbioinformatics.biomedcentral.com)
- [Overview](https://blogs.technet.microsoft.com/machinelearning/2015/04/14/permutation-feature-importance/)



<br><br><br><br><br><br><center>
## PFI Overview
![](https://amunategui.github.io/variable-importance-shuffler/img/shuffler.png)

## Read in Data

In [54]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [384]:
df = pd.read_csv('./data/titanic_2.csv')

In [41]:
df.isnull().sum(0)

PClass      0
Age         0
Sex         0
Title       0
Survived    0
dtype: int64

In [42]:
df.dtypes

PClass       object
Age         float64
Sex          object
Title        object
Survived      int64
dtype: object

## Generate Dummies

In [385]:
df = pd.concat([df,pd.get_dummies(df.PClass),pd.get_dummies(df.Title),pd.get_dummies(df.Sex)],  1)
df = df.drop(['Sex','PClass','Title'],1)

In [386]:
df.head().append(df.tail())

Unnamed: 0,Age,Survived,1st,2nd,3rd,Miss,Mr,Mrs,Nothing,female,male
0,29.0,1,1,0,0,1,0,0,0,1,0
1,2.0,0,1,0,0,1,0,0,0,1,0
2,30.0,0,1,0,0,0,1,0,0,0,1
3,25.0,0,1,0,0,0,0,1,0,1,0
4,0.92,1,1,0,0,0,0,0,1,0,1
1308,27.0,0,0,0,1,0,1,0,0,0,1
1309,26.0,0,0,0,1,0,1,0,0,0,1
1310,22.0,0,0,0,1,0,1,0,0,0,1
1311,24.0,0,0,0,1,0,1,0,0,0,1
1312,29.0,0,0,0,1,0,0,0,1,0,1


## Train / Test Split

In [387]:
X = df.drop('Survived',1)
y = df.Survived
#train_X, train_y, test_X, test_ytrain_test_split(X,y, train)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .15, random_state = 100)

## Shuffler Function

In [568]:
def shuffler_col(col,frame_in ):
   
    ## create new frame 
    frame_i = pd.concat(
        [frame_in[col].sample(frac = 1).reset_index(drop = True), # shuffle a column, reset index
         frame_in.drop(col, 1).reset_index(drop = True)],1) # drop column shuffled, reset index
    
    frame_i = frame_i[frame_in.columns] # reorder to original
    return frame_i

In [588]:
shuffler_col(frame_in = X_test, col ='female').head()

Unnamed: 0,Age,1st,2nd,3rd,Miss,Mr,Mrs,Nothing,female,male
0,28.0,1,0,0,0,0,1,0,0,0
1,28.0,0,1,0,0,1,0,0,0,1
2,28.0,0,0,1,0,1,0,0,1,1
3,21.0,0,0,1,1,0,0,0,0,0
4,28.0,1,0,0,0,0,1,0,0,0


## Run a Random Forest Model
- Create a table of feature importance 
- Create baseline score based on AUC on test

In [None]:
clf = # rf model here
clf. # fit on X_train,Y_train
baseline =  clf. # use score on y_test, X_test


# Shuffle Columns for Feature Importance

In [589]:
frame_in = X_test
shuffles = 500
measure = [] # will hold all variables

for i in frame_in:
    col_i =[] # hold all shuffles of column i
    for j in range(shuffles):
            col_i_shuffled = shuffler_col(col = i , frame_in = frame_in)
            col_i.append(clf.score(col_i_shuffled, y_test)<baseline) # score and tell me if it's < baseline (it does worse)
     # append the mean, aka - the proportion of times where shuffled data set of a particular column i < baseline score
    measure.append(np.mean(col_i))
    

In [590]:
pd.DataFrame({'var':frame_in.columns,"imp":measure}).sort_values('imp', ascending = False)

Unnamed: 0,imp,var
8,1.0,female
9,0.906,male
3,0.792,3rd
7,0.634,Nothing
2,0.606,2nd
1,0.196,1st
0,0.188,Age
5,0.062,Mr
4,0.048,Miss
6,0.012,Mrs


## Question 1
Run a random forest classifier and acquire feature importance.

## Question 2
Choose other ways to get variable importance: For example, instead of clf.score, how about rmse or log loss?

In [None]:
# here

## Question 3

Choose other ways to get variable importance: For example, instead of clf.score, how about rmse or log loss?

In [None]:
Build a function that produces the above results

## Question 4

1. Download a data set from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.html?format=&task=cla&att=&area=&numAtt=&numIns=&type=&sort=nameUp&view=table)
2. Pick a classification model that we've covered before (naive bayes, linear model, decision tree, random forest)
3. Train then test to produce a baseline
4. Tune the model
5. Acquire Feature Importance
