# FEATURE SELECTION & CROSS-VALIDATION

- 05/18/21
- onl01-dtsc-ft-022221

## Learning Objectives

- To discuss the 3 general types of feature selection methods and give examples of each. 
- To discuss the ideal use of GridSearch/cross-validation in our modeling process. 
- To learn how to save and load models.

### Resources/References

- [Udemy Course: Feature Selection for Machine Learning Models](https://www.udemy.com/course/feature-selection-for-machine-learning/) - inspired much of today's content. 
- [Tamjid's Blog Post: "Beginners guide for feature selection"](https://tamjida.medium.com/beginners-guide-for-feature-selection-by-a-beginner-cd2158c5c36a)

___
# Predicting Parkinon's Disease from Speech

## INTRODUCTION

- Parkinson's Disease is a neurological disorder that affects coordination, balance, walking, and can also affect speech.
    - [NIA - Parkinson's Disease]( https://www.nia.nih.gov/health/parkinsons-disease#:~:text=Parkinson's%20disease%20is%20a%20brain,have%20difficulty%20walking%20and%20talking)
    
    -[Parkinson's Foundation](https://www.parkinson.org/Understanding-Parkinsons/Symptoms/Non-Movement-Symptoms/Speech-and-Swallowing-Problems)
    
- This dataset was created during the publication for > "A comparative analysis of speech signal processing algorithms for Parkinson’s disease classification and the use of the tunable Q-factor wavelet transform" 
    - https://doi.org/10.1016/j.asoc.2018.10.022

## OBTAIN

- The dataset was downloaded from https://archive.ics.uci.edu/ml/datasets/Parkinson%27s+Disease+Classification. 
>- "Abstract: The data used in this study were gathered from 188 patients with PD (107 men and 81 women) with ages ranging from 33 to 87 (65.1Â±10.9).
    - Data Source: 

- [Related paper](https://www.sciencedirect.com/science/article/abs/pii/S1568494618305799?via%3Dihub)
    - PDF located inside `reference` folder.
    - See Table 1 on page 9.
    

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## Preprocessing tools
from sklearn.model_selection import train_test_split,cross_val_predict,cross_validate
from sklearn.preprocessing import MinMaxScaler,StandardScaler,OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import SMOTE,SMOTENC
from sklearn import metrics

## Models & Utils
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression,LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

from time import time

In [None]:
# ## Changing Pandas Options to see full columns in previews and info
n=800
pd.set_option('display.max_columns',n)
pd.set_option("display.max_info_rows", n)
pd.set_option('display.max_info_columns',n)
pd.set_option('display.float_format',lambda x: f"{x:.2f}")

In [None]:
# Modeling Functions
%load_ext autoreload
%autoreload 2

import project_functions as pf

In [None]:
df = pd.read_csv('data/pd_speech_features.csv',skiprows=1)
df

## SCRUB

In [None]:
## null value check
nulls= df.isna().sum()
nulls.sum()

In [None]:
## Preview columns and dtypes
df.info()

## EXPLORE

> - Too many features to visualize at once. Working on a workflow in the appendix to visualzie related columns, but still work in progress.

In [None]:
corr = df.drop('id',axis=1).corr()
print(corr.shape)
plt.figure(figsize=(15,15))
sns.heatmap(corr,cmap='coolwarm')

#### Features

- In order to preprocess this dataset, I should identify related features based on their names and create a dictionary to be able to slice out all related columns for EDA. [Appendix'd for now]


- Features include results of vairous speech signal processing algorithms including (see Table 1 below):
    - Time Frequency Features
    - Mel Frequency Cepstral Coefficients (MFCCs)
    - Wavelet Transform based Features, 
    - Vocal Fold Features 
    - and TWQT features 

- Remaining Feature Questions

    - [ ] Which cols are "Fundamenal frequency parameters"?
    
<img src="./reference/table_1.png" width=60%>



#### Finding Categorical Features

In [None]:
## Seeing which columns may be categorical
df.nunique()[(df.nunique() < 20)]

In [None]:
## making gender a str so its caught by pipeline
df['gender'] = df['gender'].astype(str)

## PREPROCESSING 

### Train/Test Split

In [None]:
## Specifying root names of types of features to loop through and filter out from df
target_col = 'class'
drop_cols = ['id']

y = df[target_col].copy()
X = df.drop(columns=[target_col,*drop_cols]).copy()
y.value_counts(1)

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y)
X_train

In [None]:
from sklearn import set_config
set_config(display='diagram')

In [None]:
## saving list of numeric vs categorical feature
num_cols = list(X_train.select_dtypes('number').columns)
cat_cols = list(X_train.select_dtypes('object').columns)

## create pipelines and column transformer
num_transformer = Pipeline(steps=[
    ('imputer',SimpleImputer(strategy='median')),
    ('scale',MinMaxScaler())
])

cat_transformer = Pipeline(steps=[
    ('imputer',SimpleImputer(strategy='constant',fill_value='MISSING')),
    ('encoder',OneHotEncoder(sparse=False,drop='first'))])

print('# of num_cols:',len(num_cols))
print('# of cat_cols:',len(cat_cols))

## COMBINE BOTH PIPELINES INTO ONE WITH COLUMN TRANSFORMER
preprocessor=ColumnTransformer(transformers=[
    ('num',num_transformer,num_cols),
    ('cat',cat_transformer,cat_cols)])

preprocessor

In [None]:
## Fit preprocessing pipeline on training data and pull out the feature names and X_cols
preprocessor.fit(X_train)

## Use the encoder's .get_feature_names
cat_features = list(preprocessor.named_transformers_['cat'].named_steps['encoder']\
                            .get_feature_names(cat_cols))
X_cols = num_cols+cat_features

## Transform X_traian,X_test and remake dfs
X_train_df = pd.DataFrame(preprocessor.transform(X_train),
                          index=X_train.index, columns=X_cols)
X_test_df = pd.DataFrame(preprocessor.transform(X_test),
                          index=X_test.index, columns=X_cols)

## Tranform X_train and X_test and make into DataFrames
X_train_df

In [None]:
y.value_counts(1)

### Resampling with SMOTENC

In [None]:
y_train.value_counts(1)

In [None]:
## Save list of trues and falses for each cols
smote_feats = [False]*len(num_cols) +[True]*len(cat_features)
# smote_feats

In [None]:
## resample training data
smote = SMOTENC(smote_feats)
X_train_sm,y_train_sm = smote.fit_resample(X_train_df,y_train)
y_train_sm.value_counts()

## MODELING

#### Setting `train_test_list`

In [None]:
### SAVING XY DATA TO LIST TO UNPACK
train_test_list = [X_train_sm,y_train_sm,X_test_df,y_test]

## Baseline Model: Linear SVC

In [None]:
# Baseline model is a lienar SVC 
svc_linear = SVC(kernel='linear',C=1)
pf.fit_and_time_model(svc_linear,*train_test_list)

___

# **⭐️Feature Selection Study Group⭐️**

- Office Hours for 022221FT
- 05/18/21

## Types  of Feature Selection

- Filter Methods.
- Wrapper Methods.
- Embedded Methods.
- Hybrid Methods (not discussed here, see resources at top of notebook for details)


### Filter Methods

> Filter methods: rely on the characteristics of the features themselves. Does not involve machine learning models. Ideal for quick screen and removal of irrelevant features.

- Advantages:
    - Model agnostic
    - Less computationally expensive than other methods. 
  
    
- Disadvantages:
    - Lower improvement in model performance vs other methods. 


- Example Filter Methods:
    - Variance
    - Correlation
    - Univariate selection

### Wrapper Methods

> Wrapper methods use predictive machine learning models to score various subsets of features. Train a new model for each feature subset.

- Advantages:
    -  Provides the best performing subset for given model type.
    
- Disadvantages:
    -  Very computationally expensive
    - May not produce best feature combos for different methods.
    
- Example Wrapper Methods:
    - Forward selection
    - Backward elimination
    - Exhaustive Search


### Embedded Methods

> Embedded methods performs feature selection as part of the modeling/training process.

- Advantages:
    -  Consider the interactions between features and models.
    - Less computationally expensive than Wrapper methods (only fit the model 1 time vs many)
    
- Disadvantages:
    - only available in some models.
    - selected features may not always be appropriate for different model types
    
- Example Embedded Methods:
    - Lasso Regression
    - Tree importance

## Filter Methods - Applied

- Overall Filter Methods Process:
    1. Rank each feature according to some criterion
    2. Select features with highest ranking. 
- Example Filter Methods (used below):
    1. Variance Threshold
    2. Correlation
    3. Mutual Information
    4. Univariate Models

In [None]:
selected_features = {}

### FM1: Finding Constant & Quasi-Constant Features with `VarianceThreshold`

- `VarianceThreshold`:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html

- Constant Features have the same value for every observation.
- Quasi-Constant Features have 95-98% of the same value for one feature. 

- Using sklearn's VarianceThreshold with either `threshold=0.0` for constant features or `threshold=0.01` for quasi-constant

In [None]:
from sklearn.feature_selection import VarianceThreshold

In [None]:
## checking for constant-features
selector = VarianceThreshold(threshold=0.00)
selector.fit(X_train_sm)

In [None]:
## get support returns true/false for keeping features
keep_features = selector.get_support()
print(keep_features.sum())
keep_features.sum()==len(X_train.columns)

> No constant-features found in dataset. Check for quasi-constant (threshold=0.01)


In [None]:
## checking for constant-features
selector = VarianceThreshold(threshold=0.01)
selector.fit(X_train_sm)

In [None]:
## get support returns true/false for keeping features
keep_features = selector.get_support()
print(keep_features.sum())

In [None]:
keep_features.shape, X_train_sm.shape

In [None]:
X_train_sel = X_train_sm.loc[:,keep_features]
X_test_sel = X_test_df.loc[:,keep_features]
X_train_sel

In [None]:
# tic = time() #timing!
svc_linear = SVC(kernel='linear',C=1)
pf.fit_and_time_model(svc_linear,X_train_sel,y_train_sm,X_test_sel,y_test)

In [None]:
## save to dict
selected_features['variance'] = keep_features

### FM2: Using Correlation to identify & remove highly-correlated features

In [None]:
def get_list_of_corrs(df,drop=[],
                      cutoff=0.75,only_above_cutoff=False,
                     sort_by_col=False):
    """Get dataframe of correlated features, with the option to only show the
    features with correlations > cutoff"""
    ## Claculate correlation and convert to 3-column table.
    corr_df = df.drop(drop,axis=1).corr().unstack().reset_index()
    
    ## Remove self-correlations
    corr_df = corr_df.loc[ corr_df['level_0']!=corr_df['level_1']]
    
    ## Make one column with unique names and drop duplicate pairs of cols
    corr_df['columns'] = corr_df.apply(
        lambda row: '_'.join(set(row[['level_0','level_1']] )), axis=1)
    corr_df.drop_duplicates(subset=['columns'],inplace=True)
    
    ## Rename Columns
    corr_df.rename({0:'r','level_0':'Column1',
               'level_1':'Column2'},axis=1,inplace=True)     

    ## Check if above cutoff 
    corr_df['above_cutoff'] = corr_df['r'] > cutoff
 
    ## Sort by col or by r-value
    if sort_by_col:
        corr_df = corr_df.sort_values( ['Column1','Column2'],ascending=True)
    else:
        corr_df =  corr_df.sort_values('r',ascending=False)
        
    
    ## Return only those above cutoff
    if only_above_cutoff:
        corr_df = corr_df[corr_df['above_cutoff']==True]
        
    ## Reset Index for Aesthetics
    corr_df.reset_index(drop=True)
    return corr_df.round(2)

In [None]:
# corr_df = get_list_of_corrs(df,cutoff=0.75, only_above_cutoff=True)
# corr_df.head()

> As with our Linear Regression, we would want to remove features that are highly multicollinear. (have correlation >0.7-0.8)

### FM3: Using Mutual Information


- [Wikipedia: Mutual Infromation](https://en.wikipedia.org/wiki/Mutual_information)

<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/7f3385f4d779f062696c134223b5683e754a6f1c"> 

- Mutual Information represents how much we can learn about the target from our features. 
    - The higher the value for mi the more information a feature contains about the target.
    - We want to keep features with the highest mutual information with the target.
    
    
- How many features to keep is somewhat arbitrary.
    - Can use `SelectKBest` to select top `K` m.i. features 
    - Can use `SelectPercentile` to select top %

In [None]:
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from sklearn.feature_selection import SelectKBest, SelectPercentile

In [None]:
mi = mutual_info_classif(X_train_sm, y_train_sm)
mi[:5]

In [None]:
## Make a series so we can see which features
mi_scores = pd.Series(mi,index=X_train_sm.columns).sort_values(ascending=False)
mi_scores = mi_scores.to_frame('MI')
mi_scores

> Must choose to select top # or percentile of features to keep

In [None]:
k = 200
top_k_selector = SelectKBest(mutual_info_classif,k=k).fit(X_train_sm,y_train_sm)
top_k_columns = X_train_sm.columns[top_k_selector.get_support()]
top_k_columns

In [None]:
X_train_sel = X_train_sm.loc[:,top_k_columns]
X_test_sel = X_test_df.loc[:,top_k_columns]
X_train_sel

In [None]:
svc_linear = SVC(kernel='linear',C=1)
pf.fit_and_time_model(svc_linear,X_train_sel,y_train_sm,X_test_sel,y_test)

#### FM4: Univariate Models 

- While not demo'd in this notebook,an example univariate modeling approach would be to take all of the features of a house one at a time to make separate simple linear regression models. 
- Then, select the top K features that had the good performance (R-Squared).

## Wrapper Methods Applied

- Overall Wrapper Methods Process:
    - Use a specific classifier to select the optimal number of features. 
    - General approach is to create many recursive models where a feature is added or removed from the dataset and the performance is scored. 
    - Greedy search algorithms (will try all options)

___

- Example Wrapper Methods (used below):
    1. Stepwise Forward Selection 
    2. Stepwise Backward Selection/Recursive Feature Elimination. 
    
    3. Exhaustive Feature Selection

### WM1&2: Stepwise Forward/Backwards Selection with `mlxtend`'s `SequentialFeatureSelector`

- [mlxtend Sequential Feature Selector](http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/)

In [None]:
from mlxtend.feature_selection import SequentialFeatureSelector

- To use SFS, must provide:
    1. Make an instance of the model you wish to optimize for (e.g. Random Forests, SVC)
    2. Choose a stopping criterion (e.g. select 10 features).
    3. Specify if want to step forward or backward. 
    4. Evaluation metric to use
    5. Cross validation

In [None]:
# svc_linear = SVC(kernel='linear',C=1)
sfs = SequentialFeatureSelector( SVC(kernel='linear',C=1), k_features=25,
                               forward=True, floating=True,
                                verbose=2, cv=2,
                                n_jobs=-1)
sfs.fit(X_train_sm,y_train_sm)

In [None]:
sfs.k_feature_idx_

In [None]:
selected_features = list(sfs.k_feature_names_)
selected_features

In [None]:
X_train_sel = X_train_sm[selected_features]
X_test_sel = X_test_df[selected_features]
X_train_sel

In [None]:
svc_linear = SVC(kernel='linear',C=1)
pf.fit_and_time_model(svc_linear,X_train_sel,y_train_sm,X_test_sel,y_test)

### WM3: Exhaustive Feature Selection

In [None]:
from mlxtend.feature_selection import ExhaustiveFeatureSelector

## Using Embedded Methods

- Overall Embedded Methods Process:
    1. Train a machine learning model (Feature selection performed during the model's training. )
    2. Derive feature importance from the model
    3. Remove non-important features.
___

- Example Embedded Methods (used below):
    1. Regression Coefficients 
    2. Tree importance 
    3. LASSO/L1-Regularization


### LogisticRegression Coefficients

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression,LogisticRegressionCV
from sklearn.feature_selection import SelectFromModel

In [None]:
log_reg = LogisticRegression(C=1e12)
pf.fit_and_time_model(log_reg,*train_test_list)

In [None]:
selector = SelectFromModel(log_reg).fit(X_train_sm,y_train_sm)
selector

In [None]:
logreg_features = selector.get_support()
X_train_sm.columns[logreg_features]

In [None]:
coeffs = pd.Series(selector.estimator_.coef_.flatten(),
                   index=X_train_sm.columns)
coeffs[logreg_features]

In [None]:
X_train_sel = X_train_sm.loc[:,logreg_features]
X_test_sel = X_test_df.loc[:,logreg_features]
X_train_sel

In [None]:
svc_linear = SVC(kernel='linear',C=1)
pf.fit_and_time_model(svc_linear,X_train_sel,y_train_sm,X_test_sel,y_test)

### LogisticRegression Coefficients  With Lasso/L1 Reg

In [None]:
l1_reg = LogisticRegression(C=0.5,penalty='l1',solver='liblinear')
pf.fit_and_time_model(l1_reg,*train_test_list)

In [None]:
selector = SelectFromModel(l1_reg)
selector.fit(X_train_sm,y_train_sm)
lasso_features = selector.get_support()
lasso_features.sum()

In [None]:
X_train_sm.columns[lasso_features]

In [None]:
X_train_sel = X_train_sm.loc[:,lasso_features]
X_test_sel = X_test_df.loc[:,lasso_features]
X_train_sel

In [None]:
svc_linear = SVC(kernel='linear',C=1)
pf.fit_and_time_model(svc_linear,X_train_sel,y_train_sm,X_test_sel,y_test)

### Tree Importance

In [None]:
rf = RandomForestClassifier()
# pf.fit_and_time_model(rf,*train_test_list)
# importances = pf.get_importance(rf,X_train_df)
# importances.sort_values(ascending=False)

In [None]:
selector = SelectFromModel(rf).fit(X_train_sm,y_train_sm)
rf_features = selector.get_support()
rf_features.sum()

In [None]:
X_train_sel = X_train_sm.loc[:,rf_features]
X_test_sel = X_test_df.loc[:,rf_features]
X_train_sel

In [None]:
svc_linear = SVC(kernel='linear',C=1)
pf.fit_and_time_model(svc_linear,X_train_sel,y_train_sm,X_test_sel,y_test)

# ⭐️**Cross Validation**⭐️

#### Cross Validation Workflow: 

1. Train/Test split
2. Create a model. 
3. Apply Cross Validation with the training data (`GridsearchCV`,`cross_validate`,`cross_val_score`,`cross_val_predict`) to asses your model/hyperparameter choices.

4. Evaluate Cross validation scores
    - If not happy with the scores:
        - Try different model/hyperparameters.
    - If happy with scores/performance:
        - Train an **individual** model (not-cv) (or take gridsearch's `.best_estimator_` on the **training data** and **evaluate with the test data.**



5. If individual model performs well on test data (isn't overfit) **and you are planning to deploy the model:** 
    - You would **re-train the model** on the **entire combined data set** (X =X_train+X_test, y=y_train+y_test) before pickling/saving the model.
    
    
>- ***Note: step 5 is intended for deploying models and is not required.***

## 3 Different Ways of Cross-Validating

In [None]:
from sklearn.model_selection import cross_validate,cross_val_score,cross_val_predict

- Cross validation **functions** from sklearn.model_selection:
    - `cross_validate`: 
        - returns dict of K-fold  scores for the training data, including the training times.
    - `cross_val_score`:
        - returns the K-fold validation scores for the K-fold's test-splits
        
    - `cross_val_predict`:
        - returns predictions from the cross validated model. 

In [None]:
X_train_final = X_train_sm.loc[:,logreg_features]
X_test_final = X_test_df.loc[:,logreg_features]
display(X_train_final.head(),X_test_final.head())

In [None]:
## make an instance of a model
clf = SVC(kernel='linear',C=1)

In [None]:
## cross_validate  returns scores and times
cv_results = cross_validate(clf,X_train_final,y_train_sm,scoring='recall')
cv_results

In [None]:
## cross_val_score reutns scores
cv_score = cross_val_score(clf,X_train_final,y_train_sm,scoring='recall')
cv_score

In [None]:
## cross_val_predict returns predictions that can be used to validate
y_hat_train_cv = cross_val_predict(clf,X_train_final,y_train_sm)
print(metrics.classification_report(y_train_sm, y_hat_train_cv))

In [None]:
## If happy with results, train an individual model and evaluate with test data
clf = SVC(kernel='linear',C=1)
pf.fit_and_time_model(clf,X_train_final,y_train_sm,X_test_final, y_test,scoring='recall')

In [None]:
## If happy with train/test split results, can re-train model on entire dataset
X_tf = preprocessor.fit_transform(X)
svc_linear.fit(X_tf,y)
pf.evaluate_classification(svc_linear,X_tf,y)

# ⭐️**Saving Models**⭐️

- Guide on Saving Models: 
    - https://scikit-learn.org/stable/modules/model_persistence.html

### With `Pickle`

In [None]:
import pickle
pickle.dump(clf,open('best_model.pickle','wb'))
# s = pickle.dumps(clf)
# type(s)

In [None]:
loaded_pickle = pickle.load(open('best_model.pickle','rb'))
loaded_pickle

In [None]:
pf.evaluate_classification(loaded_pickle,X_test_final,y_test)

### With `joblib` (sklearn's preferred method)

In [None]:
import joblib
joblib.dump(clf, 'best_model.joblib') 

In [None]:
clf_jb = joblib.load('best_model.joblib')
clf_jb

In [None]:
pf.evaluate_classification(clf_jb,X_test_final,y_test)

# Conclusions

- There are many different ways to select features for your models, each with advantages & disadvantages.
- Depending on the size of your dataset and the number of features will determine how much you need to worry about performing feature selection


___

# APPENDIX


## iNTERPRET

In [None]:
# from mlxtend 

## CONCLUSIONS & RECOMMENDATIONS

> Summarize your conclusions and bullet-point your list of recommendations, which are based on your modeling results.

## Grouping Features

In [None]:
## Defining Clusters of related columns for EDA/preprocessing

feature_types = dict(patient_info = ['id','gender'], 
     baseline = ['Jitter', 'Shimmer','Harmonicity', 'RPDE','DFA',"PPE"],
     time_frequency = ['intensity'], 
     mel_spectrogram = ['MFCC'],
     tqwt = ['tqwt'])

feature_types

In [None]:
## Quick test filter for stub names
# list(filter(lambda x: 'intensity' in x.lower(),df.columns))

In [None]:
def make_feature_dict(df,feature_types):
    """Finds column names by recognizing name stubs (partial col names)
    
    Args:
        df (Frame): df with columns to filter.
        feature_types (dict): dict with category of features as the first key
        and a list of stub names of columns that belong to that category.
        
    Returns:
        feature_cols: dict of filtered columns grouped by "feature_types" keys.
        all_columns: list of all filtered columns without grouping.
        
        
    EXAMPLE USAGE:
    >>  feature_types = dict(patient_info = ['id','gender'], 
                        time_frequency = ['intensity'],
                        baseline = ['Jitter','Harmonicity'])
    >> feature_cols ,all_cols = make_feature_dict(df,feature_types)
    >> feature_cols
    ## RETURNS: 
    {'patient_info': ['id', 'gender'],
     'time_frequency': ['minIntensity', 'maxIntensity', 'meanIntensity'],
     'baseline': ['locPctJitter',
      'locAbsJitter',
      'rapJitter',
      'ppq5Jitter',
      'ddpJitter',
      'meanAutoCorrHarmonicity',
      'meanNoiseToHarmHarmonicity',
      'meanHarmToNoiseHarmonicity']}
        """
    ## create epty dict to fill in related features and empty list for all cols
    feature_cols = {}
    all_columns= []
    
    ## For each feature type and the list of stub names
    for feat_type, name_list in feature_types.items():
#         feature_cols[feat_type] = {}

        ## Maker a list to handle single-column results 
        curr_type_cols = []
        
        ## For each name stub
        for name in name_list:
            ## Get all columns containing stub
            cols = [c for c in df.columns if name.lower() in c.lower()]
            
            ## Add cols to both current type and all columns
            curr_type_cols.extend(cols)
            all_columns.extend(cols)
            
            ## save list of columns under feature_type
            feature_cols[feat_type] = curr_type_cols
            
            
            ### OLD CODE WHEN ORIGINALLY USING NESTED DICT
#             ## If the name 
#             if name.lower() == feat_type.lower():
#                 feature_cols[feat_type] = cols
                
#             else:
#                 ## combine names
#                 feature_cols[feat_type] = cols
                
            
            
    return feature_cols, all_columns


In [None]:
## Saving dict of all identified clusters of features
feature_cols,filtered_cols = make_feature_dict(df,feature_types)
feature_cols.keys()

In [None]:
## How many cols grabbed by function
len(filtered_cols)

In [None]:
## testing feat_cols dict
feature_cols['baseline']

### BOOKMARK FOR LATER: Sorting out remaining cols to group/filter

In [None]:
# df_unmatched = df.drop(columns=filtered_cols)
# df_unmatched.info()

### RandomForest

### RandomForest

In [None]:
# rf = RandomForestClassifier()
# pf.fit_and_time_model(rf,*train_test_list)

In [None]:
# pf.get_importance(rf,X_test_df,top_n=100);