# Feature Selection using RFE and ANOVA
Previous automated feature selection using SKLearn's gridsearchCV was useful, but lacked some ability for interpretation and transparency. A traditional feature selection process is worth trying, as this will help reduce the number of variables needed for modeling and improve the interpretability of the final model.<br><br>
I plan on using recursive feature elimnition, which is an approximation of forward stepwise feature selection. Using Statmodels(), I plan on running an ANOVA on each feature against the outcome measure `death_within_7_days` to test for signficance. Each resulting F-test value will be modeled and added to an existing model containing other variables. This process starts with the most significant variable, adds the next significant variable, and continues on until limited improvements of explained variance are seen. This process is called *forward feature selection*.


In [21]:
#load libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split

from IPython.display import display #displays full dataframe columns
#display all dataframe columns when printed
pd.options.display.max_columns = None

  from pandas.core import datetools


In [8]:
#load data
df = pd.read_csv('C:/Users/Mark.Burghart/Documents/projects/hospice_carepoint/data/transformed/carepoint_transformed_dummied.csv', index_col=0)
df.shape

(271541, 120)

In [9]:
#separate variables (X) from outcome of interest (y)
df.shape
cols = df.columns.get_values() #converts column names to list
cols = cols.tolist()
feature_cols = [x for x in cols if x != 'death_within_7_days'] #removes outcome of interest from list ('death_within_7_days')

#extract rows
#print(feature_cols) #debug
X = df.loc[:, feature_cols]
X.shape #outcome column has been removed

#save outcome variable as y
y = df.death_within_7_days
y.shape

#separate data into training/test (aka holdout) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 23) #random_state for reproducibility (if needed)
#X_test, y_test should not be used until NO MORE decisions are being made. 
#This is the final, FINAL validation, and more often just used for model performance and generalizability!

In [11]:
#impute missing values: replacing NaNs with Median Column value for each column
X_train = X_train.fillna(X_train.median()) 
y_train = y_train.fillna(y_train.median()) 

## Recursive Feature Elimination

In [12]:
logreg = LogisticRegression()
rfe = RFE(logreg, 40) #selecting top 40 variables for statistical testing
rfe = rfe.fit(X_train, y_train)
print(rfe.support_)
print(rfe.ranking_)

[False False False False  True False False False False False False False
 False False False False  True  True False False False False False False
 False False False False False False False False False False  True  True
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False  True  True  True  True  True  True
  True  True False  True  True  True  True False  True  True False  True
 False False False  True False  True False False False  True False False
  True  True  True  True  True  True  True  True  True False False False
  True  True  True False  True  True  True  True  True False False]
[ 9 69 14 12  1 41 71 25 43 20 78  7 27 32 24 22  1  1 70 30 61 52 17 56
 53 45 38 19 31  6 26 40 35 23  1  1 49 29 60 51 16 55 54 44 37 18 28 74
 50 46 36 57 72 68 76 64 80 73 58 63 59 62 79 47 21 48  1  1  1  1  1  1
  1  1 67  1  1  1  1 15  1  1 39  1 77  3 75  1 11  1 6

In [17]:
#manually flag sig variables
names = X_train.columns.values
print("Features sorted by their rank:")
print(sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), names)))

Features sorted by their rank:
[(1, '3_visit_max_pain'), (1, '3_visit_max_shortnessofbreath'), (1, '3_visit_mean_pain'), (1, '3_visit_mean_shortnessofbreath'), (1, 'AdvanceDirective_Yes - Full Code / Advanced Cardiac Life Support (ACLS)'), (1, 'InsuranceType_Medicaid Traditional'), (1, 'InsuranceType_Medicare (HMO/Per Visit)'), (1, 'InsuranceType_Other Government'), (1, 'InsuranceType_Private Pay'), (1, 'InsuranceType_Self-Pay'), (1, 'InsuranceType_TRICARE'), (1, 'InsuranceType_Title Progams (e.g. Title III, V, or XX)'), (1, 'InsuranceType_Unknown'), (1, 'InsuranceType_Veteran Administration Plan'), (1, 'Lack_of_Appetite'), (1, 'LevelofCare_Continuous (CHC)'), (1, 'LevelofCare_Inpatient (GIP)'), (1, 'LevelofCare_Respite'), (1, 'LevelofCare_Routine'), (1, 'Race_Asian'), (1, 'Race_Black or African American'), (1, 'Race_Native Hawaiian or Pacific Islander'), (1, "ReferralType_Clinic or physician's office"), (1, 'ReferralType_Court/Law Enforcement'), (1, 'ReferralType_Information not avail

In [18]:
cols = ['3_visit_max_pain','3_visit_max_shortnessofbreath','3_visit_mean_pain', '3_visit_mean_shortnessofbreath', 
 'AdvanceDirective_Yes - Full Code / Advanced Cardiac Life Support (ACLS)', 'InsuranceType_Medicaid Traditional', 
 'InsuranceType_Medicare (HMO/Per Visit)', 'InsuranceType_Other Government', 'InsuranceType_Private Pay', 
 'InsuranceType_Self-Pay', 'InsuranceType_TRICARE', 'InsuranceType_Title Progams (e.g. Title III, V, or XX)', 
 'InsuranceType_Unknown', 'InsuranceType_Veteran Administration Plan', 'Lack_of_Appetite', 'LevelofCare_Continuous (CHC)',
'LevelofCare_Inpatient (GIP)', 'LevelofCare_Respite', 'LevelofCare_Routine', 
 'Race_Asian', 'Race_Black or African American', 'Race_Native Hawaiian or Pacific Islander', 
 "ReferralType_Clinic or physician's office", 'ReferralType_Court/Law Enforcement', 'ReferralType_Information not available',
 'ReferralType_Non-health care facility', 'ReferralType_Transfer from Home Health Agency', 
 'ReferralType_Transfer from Hospice', 'ReferralType_Transfer from SNF or ICF', 
 'icd10_cluster_Certain conditions originating in the perinatal period', 'icd10_cluster_Certain infectious and parasitic diseases',
 'icd10_cluster_Congenital malformations, deformations and chromosomal abnormalities', 'icd10_cluster_Diseases of the genitourinary system', 
 'icd10_cluster_Diseases of the musculoskeletal system and connective tissue', 'icd10_cluster_Diseases of the nervous system',
'icd10_cluster_Diseases of the skin and subcutaneous tissue', 
'icd10_cluster_Endocrine, nutritional, and metabolic diseases', 
 'icd10_cluster_Factors influencing health status and contact with health services', 
 'icd10_cluster_Injury, poisonining, and certain other consequences of external causes', 'icd10_cluster_Mental and behavioural disorders']

In [19]:
X_train=X_train[cols]
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 190078 entries, 15610 to 127718
Data columns (total 40 columns):
3_visit_max_pain                                                                        190078 non-null float64
3_visit_max_shortnessofbreath                                                           190078 non-null float64
3_visit_mean_pain                                                                       190078 non-null float64
3_visit_mean_shortnessofbreath                                                          190078 non-null float64
AdvanceDirective_Yes - Full Code / Advanced Cardiac Life Support (ACLS)                 190078 non-null int64
InsuranceType_Medicaid Traditional                                                      190078 non-null int64
InsuranceType_Medicare (HMO/Per Visit)                                                  190078 non-null int64
InsuranceType_Other Government                                                          190078 non-null int

In [25]:
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df) #workaround for deprecated chisq function
logit_model=sm.Logit(y_train,X_train)
result=logit_model.fit()
print(result.summary())

         Current function value: 0.450607
         Iterations: 35




                            Logit Regression Results                           
Dep. Variable:     death_within_7_days   No. Observations:               190078
Model:                           Logit   Df Residuals:                   190038
Method:                            MLE   Df Model:                           39
Date:                 Fri, 13 Apr 2018   Pseudo R-squ.:                  0.1792
Time:                         17:13:14   Log-Likelihood:                -85650.
converged:                       False   LL-Null:                   -1.0435e+05
                                         LLR p-value:                     0.000
                                                                                           coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------------------------------------------------------
3_visit_max_pain                                      

Several of these variables could still be removed. But I could probably keep most... Removing the 6 variables likely wouldn't be necesssary...

<br>
## Feature importance with Random Forest<br>
Building a random forest model for feature importance. Should produce similar results to a for statistical test for inference, but displays data in a way that is easier to interpret without background knowledge of P-values, F-tests, etc.