## Feature Selection

- Models can be made simpler, more efficient and even more effective by dropping uninformative features.
- There are 2 broad catagories of Feature Selection methods, these are usually supervised:
    - **Wrapper Feature Selection** (searches for well-performing subsets of features)
        - Essentially runs a model and many times with different subsets of features and picks which subsets contribute to the best scores.
        - *RFE* is a common method of this type. Using an estimator that outputs feature importances, fits estimator, prunes least important features, fits again etc... until the desired number of features is left.
        
    - **Filter Feature Selection** (selects subsets of features based on statistical comparisons of input-features and target)
        - Commonly correlation statistical measures are used, the type is dependent in input and output data types. See [here](https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/) for a flow chart.
        - Filter methods compare features to target one at a time, so feature interaction may be lost.
        
- Also there are models where feature importance is an inherent output, decision tree models for example. These are refered to as intrinsic. 

- Dimensionality reduction methods serve similar purposes as feature selection methods, but should considered as an alternative or performed after feature selection.

### Filter Example
- ANOVA feature selection for numeric input and categorical output
- We'll use the sklearn's SelectKBest() with f-classif(i.e. ANOVA) as our statistical test.
- Using a breast-cancer dataset from kaggle. Features are physical characteristics of cell samples from tumor masses. We are predicting malignant vs benign cancers.

In [1]:
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

In [2]:
bcdf = pd.read_csv('breast_cancer.csv')

In [3]:
bcdf.shape
# 33 features

(569, 33)

In [4]:
bcdf['Unnamed: 32'].nunique() # garbage feature

0

In [5]:
# dropping un-useful columns
bcdf.drop(['Unnamed: 32', 'id'], axis=1, inplace=True)

In [6]:
# feature/target slice
target = 'diagnosis'

X = bcdf.drop(target, axis=1)
y = bcdf[target]

In [7]:
y.value_counts(normalize=True)
# a bit unbalanced but ok for demo

B    0.627417
M    0.372583
Name: diagnosis, dtype: float64

In [8]:
# convert target from 'B'/'M' to 0/1
y = y.apply(lambda x: 1 if x == 'M' else 0) 

In [9]:
# train test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [10]:
# Create and fit selector, lets take the top 5 features
selector = SelectKBest(f_classif, k=5)
selector.fit(X_train, y_train)
# Get columns to keep and create new dataframe with those only
cols = selector.get_support(indices=True)
X_train_filt = X_train.iloc[:,cols]

In [11]:
print(X_train_filt.shape)

(381, 5)


In [12]:
X_train_filt.head()

Unnamed: 0,perimeter_mean,concave points_mean,radius_worst,perimeter_worst,concave points_worst
172,102.5,0.1097,18.79,125.0,0.1827
407,82.63,0.01867,14.4,91.63,0.05601
56,125.5,0.08994,26.14,170.1,0.2091
497,80.45,0.02369,14.06,92.82,0.1053
301,80.43,0.03099,13.46,88.13,0.07625


### Wrapper Example
- Using SKlearns RFE() using DecisionTreeClassifier as the estimator
- Again keeping best 5 features


In [13]:
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier

rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)

rfe.fit(X_train, y_train)
cols = rfe.get_support(indices=True)
X_train_wrap = X_train.iloc[:,cols]
X_train_wrap.head()

Unnamed: 0,texture_mean,concave points_mean,radius_worst,perimeter_worst,concavity_worst
172,11.89,0.1097,18.79,125.0,0.583
407,21.37,0.01867,14.4,91.63,0.1838
56,18.57,0.08994,26.14,170.1,0.3879
497,17.31,0.02369,14.06,92.82,0.2028
301,19.89,0.03099,13.46,88.13,0.1904


- Filter and Wrapper methods selected 3 of 5 same features 

In [14]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

pred = rfc.predict(X_test)

In [25]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
print(accuracy_score(y_test, pred))
print(classification_report(y_test, pred))

0.9574468085106383
              precision    recall  f1-score   support

           0       0.95      0.98      0.97       121
           1       0.97      0.91      0.94        67

    accuracy                           0.96       188
   macro avg       0.96      0.95      0.95       188
weighted avg       0.96      0.96      0.96       188



In [27]:
X_test_filt = X_test.loc[:,X_train_filt.columns]

clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_filt, y_train)

pred_f = clf.predict(X_test_filt)

print(accuracy_score(y_test, pred_f))
print(classification_report(y_test, pred_f))

0.9468085106382979
              precision    recall  f1-score   support

           0       0.96      0.96      0.96       121
           1       0.93      0.93      0.93        67

    accuracy                           0.95       188
   macro avg       0.94      0.94      0.94       188
weighted avg       0.95      0.95      0.95       188



In [29]:
# X_test_wrap = rfe.transform(X_test)
X_test_wrap = X_test.loc[:,X_train_wrap.columns]

clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_wrap, y_train)

pred_w = clf.predict(X_test_wrap)

print(accuracy_score(y_test, pred_w))
print(classification_report(y_test, pred_w))

0.9468085106382979
              precision    recall  f1-score   support

           0       0.95      0.97      0.96       121
           1       0.94      0.91      0.92        67

    accuracy                           0.95       188
   macro avg       0.94      0.94      0.94       188
weighted avg       0.95      0.95      0.95       188



- Pretty incredible! Very little loss in accuracy/f1 between the 30 feature test sets and the 5 feature test sets. 
- Also notable, filter and wrapper methods selected only 3 of 5 features the same, but produced nearly identical classification metrics when fed to the model.