<img src="https://i.imgur.com/Yy47HAq.png" style="float: left; margin: 20px; height: 290px">

# Feature Selection 

---
Capstone - Predicting mortality outcomes of COVID-19 cases

**Author**: Miriam Sosa

1. [Dummies](#Dummies)
    - [Set Up x & y](#Set-Up-x-&-y) 
2. [Scale Data LR](#Scale-Data-LR) 
    - [Evaluate Performance LR](#Evaluate-Performance-LR) 
    - [Undersample More Frequent Class LR](#Undersample-More-Frequent-Class-LR)
3. [Random Forest Classifier](#Random-Forrest-Classifier)
    - [Evaluate Performance RFC](#Evaluate-Performance-RFC)
    - [Undersample More Frequent Class RFC](#Undersample-More-Frequent-Class-RFC)
4. [Summary](#Summary)

In [1]:
import matplotlib.pyplot as plt
import missingno as msno
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score, balanced_accuracy_score, classification_report, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


from imblearn.under_sampling import NearMiss, RandomUnderSampler
from imblearn.over_sampling import SMOTE, RandomOverSampler
from collections import Counter

In [2]:
df = pd.read_csv('../data/clean_subset.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 19 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   case_month                       100000 non-null  object 
 1   res_state                        100000 non-null  object 
 2   state_fips_code                  100000 non-null  int64  
 3   res_county                       100000 non-null  object 
 4   county_fips_code                 100000 non-null  object 
 5   age_group                        100000 non-null  object 
 6   sex                              100000 non-null  object 
 7   race                             100000 non-null  object 
 8   ethnicity                        100000 non-null  object 
 9   case_positive_specimen_interval  100000 non-null  float64
 10  case_onset_interval              100000 non-null  float64
 11  process                          100000 non-null  object 
 12  exp

In [4]:
df['case_month'] = pd.to_datetime(df['case_month'])
df['case_month'].value_counts().sort_index(ascending=False)

2021-06-01      162
2021-05-01     2851
2021-04-01     5932
2021-03-01     6248
2021-02-01     6415
2021-01-01    15540
2020-12-01    19100
2020-11-01    13641
2020-10-01     6759
2020-09-01     4026
2020-08-01     3957
2020-07-01     5120
2020-06-01     3257
2020-05-01     2581
2020-04-01     2989
2020-03-01     1396
2020-02-01       14
2020-01-01       12
Name: case_month, dtype: int64

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 19 columns):
 #   Column                           Non-Null Count   Dtype         
---  ------                           --------------   -----         
 0   case_month                       100000 non-null  datetime64[ns]
 1   res_state                        100000 non-null  object        
 2   state_fips_code                  100000 non-null  int64         
 3   res_county                       100000 non-null  object        
 4   county_fips_code                 100000 non-null  object        
 5   age_group                        100000 non-null  object        
 6   sex                              100000 non-null  object        
 7   race                             100000 non-null  object        
 8   ethnicity                        100000 non-null  object        
 9   case_positive_specimen_interval  100000 non-null  float64       
 10  case_onset_interval              100000 non-n

In [6]:
# copy df for lr
df_lr = df.drop(columns=[])
df_lr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 19 columns):
 #   Column                           Non-Null Count   Dtype         
---  ------                           --------------   -----         
 0   case_month                       100000 non-null  datetime64[ns]
 1   res_state                        100000 non-null  object        
 2   state_fips_code                  100000 non-null  int64         
 3   res_county                       100000 non-null  object        
 4   county_fips_code                 100000 non-null  object        
 5   age_group                        100000 non-null  object        
 6   sex                              100000 non-null  object        
 7   race                             100000 non-null  object        
 8   ethnicity                        100000 non-null  object        
 9   case_positive_specimen_interval  100000 non-null  float64       
 10  case_onset_interval              100000 non-n

## Dummies

### Get Dummies for all Variables EXCEPT `!case_onset` ,  `!case_specimen`

In [7]:
#df.set_index("case_month", inplace=True)

In [8]:
df_lr = pd.get_dummies(df_lr, columns=['case_month',
                        'res_state', 
                           'res_county',            
                           'age_group', 
                           'sex', 
                           'race', 
                           'ethnicity', 
                           'process', 
                           'exposure_yn', 
                           'current_status', 
                           'symptom_status', 
                           'hosp_yn', 
                           'icu_yn', 
                           'underlying_conditions_yn'
                          ], drop_first=True)

In [9]:
df_lr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Columns: 1144 entries, state_fips_code to underlying_conditions_yn_Yes
dtypes: float64(2), int64(2), object(1), uint8(1139)
memory usage: 112.4+ MB


In [10]:
df_lr.head()

Unnamed: 0,state_fips_code,county_fips_code,case_positive_specimen_interval,case_onset_interval,death_yn,case_month_2020-02-01 00:00:00,case_month_2020-03-01 00:00:00,case_month_2020-04-01 00:00:00,case_month_2020-05-01 00:00:00,case_month_2020-06-01 00:00:00,...,symptom_status_Symptomatic,symptom_status_Unknown,hosp_yn_No,hosp_yn_Unknown,hosp_yn_Yes,icu_yn_No,icu_yn_Unknown,icu_yn_Yes,underlying_conditions_yn_No,underlying_conditions_yn_Yes
0,6,6037.0,0.294085,0.0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
1,40,40137.0,0.294085,0.098388,0,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,0
2,42,42055.0,0.0,0.098388,0,0,0,0,0,0,...,0,1,0,1,0,0,1,0,0,0
3,53,53077.0,0.0,0.0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
4,6,6067.0,0.294085,0.0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [11]:
df_lr['death_yn'].value_counts()

0    99034
1      966
Name: death_yn, dtype: int64

In [12]:
966*4

3864

In [13]:
#minority_class = df[df['death_yn'] == 1]
#majority_class = df[df['death_yn'] == 0].sample(n=3864, random_state=42)

#df_resampled = pd.concat([minority_class, majority_class], axis=0, ignore_index=True)
#df_resampled.shape

In [14]:
#df_resampled['death_yn'].value_counts()

## Set Up `x` & `y`

In [15]:
X = df_lr.drop(columns=['death_yn', 'state_fips_code', 'county_fips_code'])
y = df_lr['death_yn']

In [16]:
X

Unnamed: 0,case_positive_specimen_interval,case_onset_interval,case_month_2020-02-01 00:00:00,case_month_2020-03-01 00:00:00,case_month_2020-04-01 00:00:00,case_month_2020-05-01 00:00:00,case_month_2020-06-01 00:00:00,case_month_2020-07-01 00:00:00,case_month_2020-08-01 00:00:00,case_month_2020-09-01 00:00:00,...,symptom_status_Symptomatic,symptom_status_Unknown,hosp_yn_No,hosp_yn_Unknown,hosp_yn_Yes,icu_yn_No,icu_yn_Unknown,icu_yn_Yes,underlying_conditions_yn_No,underlying_conditions_yn_Yes
0,0.294085,0.000000,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
1,0.294085,0.098388,0,0,0,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,0
2,0.000000,0.098388,0,0,0,0,0,0,0,0,...,0,1,0,1,0,0,1,0,0,0
3,0.000000,0.000000,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
4,0.294085,0.000000,0,0,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,0.000000,0.098388,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,1
99996,0.294085,0.098388,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
99997,0.294085,0.098388,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
99998,1.000000,0.000000,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [17]:
#y.value_counts(normalize=True)

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    stratify = y, 
                                                    test_size = 0.25, 
                                                    random_state = 42)

In [19]:
X.dtypes

case_positive_specimen_interval    float64
case_onset_interval                float64
case_month_2020-02-01 00:00:00       uint8
case_month_2020-03-01 00:00:00       uint8
case_month_2020-04-01 00:00:00       uint8
                                    ...   
icu_yn_No                            uint8
icu_yn_Unknown                       uint8
icu_yn_Yes                           uint8
underlying_conditions_yn_No          uint8
underlying_conditions_yn_Yes         uint8
Length: 1141, dtype: object

## Scale Data LR

In [22]:
ss = StandardScaler()

X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [23]:
log_reg = LogisticRegressionCV(Cs=[.1, 1, 10, 100], random_state=123)
log_reg.fit(X_train_sc, y_train)
log_reg.score(X_test_sc, y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

0.9898

In [24]:
log_reg.C_

array([0.1])

In [25]:
y_pred = log_reg.predict(X_train_sc)

In [26]:
lr_preds = log_reg.predict(X_test_sc)

## Evaluate Performance LR

In [27]:
#training score

f1_score(y_train, y_pred)

0.30390143737166325

In [28]:
#testing score

f1_score(y_test, lr_preds)

0.22018348623853212

In [29]:
balanced_accuracy_score(y_test, lr_preds)

0.5736790628463366

In [30]:
# Confusion matrix

print('Order of classes: ', log_reg.classes_)

confusion_df = pd.DataFrame(
                    data=confusion_matrix(y_test, lr_preds),
                    index=[f'actual {target_class}' for target_class in log_reg.classes_],
                    columns=[f'predicted {target_class}' for target_class in log_reg.classes_])

print('\nTest Confusion Matrix: \n', confusion_df)

Order of classes:  [0 1]

Test Confusion Matrix: 
           predicted 0  predicted 1
actual 0        24709           50
actual 1          205           36


In [31]:
print(classification_report(y_test, lr_preds))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99     24759
           1       0.42      0.15      0.22       241

    accuracy                           0.99     25000
   macro avg       0.71      0.57      0.61     25000
weighted avg       0.99      0.99      0.99     25000



## Undersample More Frequent Class LR

In [32]:
nm = RandomUnderSampler()

X_train_under, y_train_under = nm.fit_resample(X_train_sc, y_train)



In [33]:
log_reg_under = LogisticRegressionCV(Cs=[.1, 1, 10, 100], random_state=123)

undersample_results = (log_reg_under, X_train_under, y_train_under, 
                                  X_test_sc, y_test)

In [34]:
log_reg_under.fit(X_train_under, y_train_under)
log_reg_under.score(X_test_sc, y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

0.8912

In [35]:
y_pred_under = log_reg_under.predict(X_train_under)

In [36]:
lr_preds_under = log_reg_under.predict(X_test_sc)

In [37]:
# Confusion matrix

print('Order of classes: ', log_reg_under.classes_)

confusion_df_under = pd.DataFrame(
                    data=confusion_matrix(y_test, lr_preds_under),
                    index=[f'actual {target_class}' for target_class in log_reg_under.classes_],
                    columns=[f'predicted {target_class}' for target_class in log_reg_under.classes_])

print('\nTest Confusion Matrix: \n', confusion_df_under)

Order of classes:  [0 1]

Test Confusion Matrix: 
           predicted 0  predicted 1
actual 0        22059         2700
actual 1           20          221


In [38]:
print(classification_report(y_test, lr_preds_under))

              precision    recall  f1-score   support

           0       1.00      0.89      0.94     24759
           1       0.08      0.92      0.14       241

    accuracy                           0.89     25000
   macro avg       0.54      0.90      0.54     25000
weighted avg       0.99      0.89      0.93     25000



In [39]:
227/(227+2700)

0.07755380936112061

## Random Forrest Classifier


In [40]:
#get dummies, some variables dropped due to grid search findings

In [41]:
df_rfc = pd.get_dummies(df, columns=['case_month',
                           'age_group', 
                           #'res_county', 
                           'res_state',          
                           'sex', 
                           'race', 
                           'ethnicity', 
                           'process', 
                           'exposure_yn', 
                           'current_status', 
                           'symptom_status', 
                           'hosp_yn', 
                           'icu_yn', 
                           'underlying_conditions_yn'
                          ], drop_first=True)

- Assign X and Y. Remove res_county, res_state based on grid search results

In [42]:
X = df_rfc.drop(columns=['death_yn', 
                         'res_county', 
                         #'res_state',
                         'state_fips_code', 
                         'county_fips_code'])
y = df_rfc['death_yn']

In [43]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    stratify = y, 
                                                    test_size = 0.25, 
                                                    random_state = 42)

In [44]:
rfc = RandomForestClassifier()

In [45]:
rfc.fit(X_train, y_train)
rfc.score(X_test, y_test)

0.99

In [46]:
y_pred_rfc = rfc.predict(X_train)
rfc_preds = rfc.predict(X_test)
print('Train Accuracy: ', accuracy_score(y_train, y_pred_rfc))
print('Test Accuracy: ', accuracy_score(y_test, rfc_preds))

Train Accuracy:  0.9972666666666666
Test Accuracy:  0.99


## Evaluate Performance RFC

In [47]:
#training score

f1_score(y_train, y_pred_rfc)

0.8433919022154316

In [48]:
#testing score

f1_score(y_test, rfc_preds)

0.23780487804878053

In [49]:
balanced_accuracy_score(y_test, rfc_preds)

0.5799435185897446

In [50]:
# Confusion matrix

print('Order of classes: ', rfc.classes_)

confusion_df_rfc = pd.DataFrame(
                    data=confusion_matrix(y_test, rfc_preds),
                    index=[f'actual {target_class}' for target_class in rfc.classes_],
                    columns=[f'predicted {target_class}' for target_class in rfc.classes_])

print('\nTest Confusion Matrix: \n', confusion_df_rfc)

Order of classes:  [0 1]

Test Confusion Matrix: 
           predicted 0  predicted 1
actual 0        24711           48
actual 1          202           39


In [51]:
print(classification_report(y_test, rfc_preds))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99     24759
           1       0.45      0.16      0.24       241

    accuracy                           0.99     25000
   macro avg       0.72      0.58      0.62     25000
weighted avg       0.99      0.99      0.99     25000



In [52]:
#  GridSearch

rf_params = {
    'n_estimators': [100, 150, 200],
    'max_depth': [None, 1, 2, 3, 4, 5],
}

gs = GridSearchCV(rfc, param_grid=rf_params, cv=5)
gs.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'max_depth': [None, 1, 2, 3, 4, 5],
                         'n_estimators': [100, 150, 200]})

In [53]:
gs.best_params_, gs.best_score_

({'max_depth': 1, 'n_estimators': 100}, 0.9903333333333333)

In [54]:
gs.score(X_train, y_train), gs.score(X_test, y_test)

(0.9903333333333333, 0.99036)

In [55]:
rf = gs.best_estimator_
type(rf)

sklearn.ensemble._forest.RandomForestClassifier

In [56]:
pd.set_option('display.max_rows', 125)

In [57]:
pd.DataFrame({
    'features': X_train.columns,
    'feature importance': gs.best_estimator_.feature_importances_
}).sort_values(by='feature importance', ascending=False)

Unnamed: 0,features,feature importance
21,age_group_65+ years,0.09
114,icu_yn_Yes,0.09
58,res_state_NY,0.08
4,case_month_2020-04-01 00:00:00,0.08
111,hosp_yn_Yes,0.08
3,case_month_2020-03-01 00:00:00,0.07
92,ethnicity_Unknown,0.06
116,underlying_conditions_yn_Yes,0.06
88,race_White,0.04
113,icu_yn_Unknown,0.04


In [58]:
pd.set_option('display.max_rows', 10)

## Undersample More Frequent Class RFC

In [59]:
nm = RandomUnderSampler()

X_train_under, y_train_under = nm.fit_resample(X_train, y_train)

In [60]:
rfc_under = RandomForestClassifier()
undersample_RFC_results = (rfc_under, X_train_under, y_train_under, 
                                  X_test, y_test)

In [61]:
rfc_under.fit(X_train_under, y_train_under)
rfc_under.score(X_test, y_test)

0.89552

In [62]:
y_pred_rfc_under = rfc_under.predict(X_train)

In [63]:
rfc_preds_under = rfc_under.predict(X_test)

In [64]:
# Confusion matrix

print('Order of classes: ', rfc_under.classes_)

confusion_df_rfc_under = pd.DataFrame(
                    data=confusion_matrix(y_test, rfc_preds_under),
                    index=[f'actual {target_class}' for target_class in rfc_under.classes_],
                    columns=[f'predicted {target_class}' for target_class in rfc_under.classes_])

print('\nTest Confusion Matrix: \n', confusion_df_rfc_under)

Order of classes:  [0 1]

Test Confusion Matrix: 
           predicted 0  predicted 1
actual 0        22161         2598
actual 1           14          227


In [65]:
print(classification_report(y_test, rfc_preds_under))

              precision    recall  f1-score   support

           0       1.00      0.90      0.94     24759
           1       0.08      0.94      0.15       241

    accuracy                           0.90     25000
   macro avg       0.54      0.92      0.55     25000
weighted avg       0.99      0.90      0.94     25000



In [66]:
(225+ 22189) / 25000

0.89656

## Summary

### Undersampling led to extreme minimization of false negatives and sensitivity (recall >90%), at the expense of specificity (precision). 
### This was true for both Log Reg and RFC models.
### The original models meanwhile had low specificity (precision ~40-45%) and low sensitivity (recall ~ 15%)). 
### In short it didn't do anything very well.

In [67]:
rf_params = {
    'n_estimators': [100, 150, 200],
    'max_depth': [None, 1, 2, 3, 4, 5],
}

gs = GridSearchCV(rfc_under, param_grid=rf_params, cv=5)
gs.fit(X_train_under, y_train_under)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'max_depth': [None, 1, 2, 3, 4, 5],
                         'n_estimators': [100, 150, 200]})

In [68]:
gs.best_params_, gs.best_score_

({'max_depth': None, 'n_estimators': 100}, 0.9379310344827587)

In [69]:
gs.score(X_train_under, y_train_under), gs.score(X_test, y_test)

(0.9986206896551724, 0.89644)

In [70]:
rf = gs.best_estimator_
type(rf)

sklearn.ensemble._forest.RandomForestClassifier

In [77]:
pd.set_option('display.max_rows', 125)

In [78]:
pd.DataFrame({
    'features': X_train_under.columns,
    'feature importance': gs.best_estimator_.feature_importances_
}).sort_values(by='feature importance', ascending=False)

Unnamed: 0,features,feature importance
21,age_group_65+ years,0.278438
111,hosp_yn_Yes,0.111523
19,age_group_18 to 49 years,0.106773
109,hosp_yn_No,0.029599
20,age_group_50 to 64 years,0.026352
91,ethnicity_Non-Hispanic/Latino,0.025202
4,case_month_2020-04-01 00:00:00,0.02008
116,underlying_conditions_yn_Yes,0.018449
88,race_White,0.018261
77,sex_Male,0.017742


In [73]:
pd.set_option('display.max_rows', 10)