__mlmachine - Part 4: Feature Selection__

1. [Crowd-Sourced Feature Importance Estimation](#Crowd-Sourced-Feature-Importance-Estimation)
    1. [Catalog of Techniques](#Catalog-of-Techniques)
1. [Feature Selection Through Iterative Cross-validation](#Feature-Selection-Through-Iterative-Cross-validation)


In [None]:
# standard libary and settings
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, KBinsDiscretizer, RobustScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from category_encoders import WOEEncoder, TargetEncoder, CatBoostEncoder
from xgboost import XGBClassifier
import mlmachine as mlm
from mlmachine.data import titanic
from mlmachine.features.preprocessing import (
    DataFrameSelector,
    PandasTransformer,
    PandasFeatureUnion,
    GroupbyImputer,
    KFoldEncoder,
    DualTransformer,
)

# Crowd-Sourced Feature Importance Estimation

<a id = 'Crowd-Sourced-Feature-Importance-Estimation'></a>

## Catalog of Techniques
---
<br><br>
Here is a non-exhaustive list of feature importance estimation techniques:
- Tree-based Feature Importance
- Recursive Feature Elimination
- Sequential Forward Selection
- Sequential Backward Selection
- F-value / p-value
- Variance 
- Target Correlation
<br><br>

This battery of techniques stems from several different libraries. Ideally, we use all of these techniques (where applicable) to get a broad understanding of the role each feature plays in a machine learning problem. This is a cumbersome series of tasks.
<br><br>

Even if we took time to execute each method, disparate execution leads to disparate variables, making a holistic assessment tedious to compile.
<br><br>

mlmachine's `FeatureSelector` class makes it easy to run all of the feature importance estimation techniques listed above. Further, we can do this for a variety of estimators simultaneously. Let's see mlmachine in action. 
<br><br>

First, we apply data preprocessing techniques to clean up our data. For those who have read parts 1, 2, and 3 of this article series, nothing in this block of code below is new:
<br><br>

<a id = 'Catalog-of-Techniques'></a>

In [None]:
import mlmachine as mlm
from mlmachine.data import titanic

df_train, df_valid = titanic()

ordinal_encodings = {"Pclass": [1, 2, 3]}

mlmachine_titanic_train = mlm.Machine(
    data=df_train,
    target="Survived",
    remove_features=["PassengerId","Ticket","Name","Cabin"],
    identify_as_continuous=["Age","Fare"],
    identify_as_count=["Parch","SibSp"],
    identify_as_nominal=["Embarked"],
    identify_as_ordinal=["Pclass"],
    ordinal_encodings=ordinal_encodings,
    is_classification=True,
)

mlmachine_titanic_valid = mlm.Machine(
    data=df_valid,
    remove_features=["PassengerId","Ticket","Name","Cabin"],
    identify_as_continuous=["Age","Fare"],
    identify_as_count=["Parch","SibSp"],
    identify_as_nominal=["Embarked"],
    identify_as_ordinal=["Pclass"],
    ordinal_encodings=ordinal_encodings,
    is_classification=True,
)

### impute pipeline
impute_pipe = PandasFeatureUnion([
    ("age", make_pipeline(
        DataFrameSelector(include_columns=["Age","SibSp"]),
        GroupbyImputer(null_column="Age", groupby_column="SibSp", strategy="mean")
    )),
    ("fare", make_pipeline(
        DataFrameSelector(include_columns=["Fare","Pclass"]),
        GroupbyImputer(null_column="Fare", groupby_column="Pclass", strategy="mean")
    )),
    ("embarked", make_pipeline(
        DataFrameSelector(include_columns=["Embarked"]),
        PandasTransformer(SimpleImputer(strategy="most_frequent"))
    )),
    ("diff", make_pipeline(
        DataFrameSelector(exclude_columns=["Age","Fare","Embarked"])
    )),
])

mlmachine_titanic_train.data = impute_pipe.fit_transform(mlmachine_titanic_train.data)
mlmachine_titanic_valid.data = impute_pipe.transform(mlmachine_titanic_valid.data)

### encode & bin pipeline
encode_pipe = PandasFeatureUnion([
    ("nominal", make_pipeline(
        DataFrameSelector(include_columns=mlmachine_titanic_train.data.mlm_dtypes["nominal"]),
        PandasTransformer(OneHotEncoder(drop="first")),
    )),
    ("ordinal", make_pipeline(
        DataFrameSelector(include_columns=list(ordinal_encodings.keys())),
        PandasTransformer(OrdinalEncoder(categories=list(ordinal_encodings.values()))),
    )),
    ("bin", make_pipeline(
        DataFrameSelector(include_columns=mlmachine_titanic_train.data.mlm_dtypes["continuous"]),
        PandasTransformer(KBinsDiscretizer(encode="ordinal")),
    )),
    ("diff", make_pipeline(
        DataFrameSelector(exclude_columns=mlmachine_titanic_train.data.mlm_dtypes["nominal"] + list(ordinal_encodings.keys())),
    )),
])

mlmachine_titanic_train.data = encode_pipe.fit_transform(mlmachine_titanic_train.data)
mlmachine_titanic_valid.data = encode_pipe.fit_transform(mlmachine_titanic_valid.data)

mlmachine_titanic_train.update_dtypes()
mlmachine_titanic_valid.update_dtypes()

### impute pipeline
target_encode_pipe = PandasFeatureUnion([
    ("target", make_pipeline(
        DataFrameSelector(include_mlm_dtypes=["category"]), 
        KFoldEncoder(
            target=mlmachine_titanic_train.target,
            cv=KFold(n_splits=5, shuffle=True, random_state=0),
            encoder=TargetEncoder,
        ),
    )),
    ("woe", make_pipeline(
        DataFrameSelector(include_mlm_dtypes=["category"]),
        KFoldEncoder(
            target=mlmachine_titanic_train.target,
            cv=KFold(n_splits=5, shuffle=False),
            encoder=WOEEncoder,
        ),
    )),
    ("catboost", make_pipeline(
        DataFrameSelector(include_mlm_dtypes=["category"]),
        KFoldEncoder(
            target=mlmachine_titanic_train.target,
            cv=KFold(n_splits=5, shuffle=False),
            encoder=CatBoostEncoder,
        ),
    )),
    ("diff", make_pipeline(
        DataFrameSelector(exclude_mlm_dtypes=["category"]),
    )),
])

mlmachine_titanic_train.data = target_encode_pipe.fit_transform(mlmachine_titanic_train.data)
mlmachine_titanic_valid.data = target_encode_pipe.transform(mlmachine_titanic_valid.data)

mlmachine_titanic_train.update_dtypes()
mlmachine_titanic_valid.update_dtypes()

### scale values
scale = PandasTransformer(RobustScaler())

mlmachine_titanic_train.data = scale.fit_transform(mlmachine_titanic_train.data)
mlmachine_titanic_valid.data = scale.transform(mlmachine_titanic_valid.data)

---
<br><br>
Our `DataFrame` has been imputed, encoded in a variety of ways, and has several new features. We're ready for `FeatureSelector`:

In [None]:
estimators = [
    LogisticRegression,
    XGBClassifier,
]

fs = mlmachine_titanic_train.FeatureSelector(
    data=mlmachine_titanic_train.data,
    target=mlmachine_titanic_train.target,
    estimators=estimators,
)
feature_selector_summary = fs.feature_selector_suite(
    sequential_scoring="accuracy",
    sequential_n_folds=0,
    save_to_csv=True,
)

In [None]:
feature_selector_summary

In [None]:
rf2 = RandomForestClassifier(max_depth=2)
rf4 = RandomForestClassifier(max_depth=4)
rf6 = RandomForestClassifier(max_depth=6)

estimators = [
    RandomForestClassifier,
    rf2,
    rf4,
    rf6,
]

fs = mlmachine_titanic_train.FeatureSelector(
    data=mlmachine_titanic_train.data,
    target=mlmachine_titanic_train.target,
    estimators=estimators,
)
feature_selector_summary = fs.feature_selector_suite(
    sequential_scoring="roc_auc",
    sequential_n_folds=0,
    save_to_csv=True,
)

In [None]:
feature_selector_summary

# Feature Selection Through Iterative Cross-validation

<a id = 'Feature-Selection-Through-Iterative-Cross-validation'></a>

# A

<a id = ''></a>

In [None]:
feature_selector_summary = fs.feature_selector_stats(feature_selector_summary)

In [None]:
cv_summary = fs.feature_selector_cross_val(
    feature_selector_summary=feature_selector_summary,
    estimators=estimators,
    scoring="accuracy",
    n_folds=5,
    step=1,
    n_jobs=4,
    save_to_csv=True,
)

In [None]:
fs.feature_selector_results_plot(
    scoring="accuracy_score",
    cv_summary=cv_summary,
    feature_selector_summary=feature_selector_summary,
    title_scale=0.8,
    marker_on=True,
)

In [None]:
cross_val_features_df = fs.create_cross_val_features_df(
    scoring="accuracy_score",
    cv_summary=cv_summary,
    feature_selector_summary=feature_selector_summary,
)

In [None]:
cross_val_features_df

In [None]:
cross_val_feature_dict = fs.create_cross_val_features_dict(
    scoring="accuracy_score",
    cv_summary=cv_summary,
    feature_selector_summary=feature_selector_summary,
)

In [None]:
cross_val_features_dict