# Performing Feature Selection
Question: Why do we do feature selection? 
Answer: In order to have the most predictive model we can for the least computational cost. 

How we do this is by eliminating independent variables that are nonpredictive or only marginally so. This reduces the chance of overfitting to the features, increases accuracy and shortens time to convergence. 

This notebook is a walk through several of the examples in the scikit learn site(https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection), back-to-back. I not only show you how to find out the most predictive features, I show you how to display them to the screen and put these top features into a new dataframe so that you can use that dataframe as input to a downstream process (something often frustratingly not shown my others). 

Caveat: I should have written the conversion to a new dataframe as a #def, but I got too focused on finishing it. On the one hand, that means you can just use each section as a "complete" notebook. I often do this because it is easier in the classroom to show them inline. There is also some slight variations due to different attributes that are available for each feature selection method. The only downside is that this notebook is much longer than most of the ones I publish.

The Feature Selection Techniques covered are: 
* SelectKBest
* Recursive Feature Elimination (RFE)
* RFE with Cross Validation (a favorite of mine as my students know)
* SelectFromModel
* Extra Tree Classification

NOTE: these are being used for classification and the dataset is the extended Wisconsin Breast Cancer dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data. 

In [None]:
import pandas as pd
import numpy as np

# read in the file from UCI <recommend you save locally and load it if your connectivity is iffy>

# Loading the file over the internet
#filename = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data" 

# Loading the file locally in the same folder as the Python Notebook
filename = "wi_breast_cancer.csv"
names = ['ID','Diagnosis',
         'Mean-Radius','Mean-Texture','Mean-Perimeter',
         'Mean-Area','Mean-Smoothness','Mean-Compactness',
         'Mean-Concavity','Mean-ConcavePoints',
         'Mean-Symmetry','Mean-FractalDimension', 
         'StdErr-Radius','StdErr-Texture','StdErr-Perimeter',
         'StdErr-Area','StdErr-Smoothness','StdErr-Compactness',
         'StdErr-Concavity','StdErr-ConcavePoints',
         'StdErr-Symmetry','StdErr-FractalDimension',
         'Worst-Radius','Worst-Texture','Worst-Perimeter',
         'Worst-Area','Worst-Smoothness','Worst-Compactness',
         'Worst-Concavity','Worst-ConcavePoints',
         'Worst-Symmetry','Worst-FractalDimension']

# loading the file into a dataframe
data = pd.read_csv(filename, names=names, header=None) 

# Convert the Diagnosis to a numeric variable
data['Diagnosis'] = data['Diagnosis'].map({'M': 1, 'B': 0})
# Malignant tumors = 1 or True and Benign tumors = 0 or False

# Loading the X and y matrices
X = data.iloc[:, 2:32]   # load features into X dataframe
Y = data.iloc[:, 1]      # Load target into y dataframe

# Get the rows and columns of the numpy array
(nRows, nCols) = X.shape 

## SelectKBest Features 
Testing SelectKBest in order to ensure we are using the right features for our dataset. The example below uses the Chi-Squared ${(χ2)}$ statistical test for non-negative features to select the best features from the dataset

In [None]:
# Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
from sklearn.feature_selection import SelectKBest 
from sklearn.feature_selection import chi2

# Setting precision for display
#pd.options.display.precision = 2
#np.set_printoptions(precision = 2)

fitScores = []

# feature extraction; where k is the number of features you want to select
test = SelectKBest(score_func=chi2, k=5)
fit = test.fit(X, Y)

# Find the scores for every feature so that you know which were selected
fitScores = fit.scores_

# Convert the numpy array of scores back into a DF with the correct column names
features = pd.DataFrame(fitScores.reshape(-1, len(fitScores)),columns=names[2:32])
print(features.T) # transpose to make it easier to read

In [None]:
# Eyeballing the top scores and creating a header for them
# We will see a better way in upcoming sections that use code to do this
heads = ['Mean-Perimeter','Mean-Area','StdErr-Area','Worst-Perimeter','Worst-Area']

# perform the selection of fields so we have them for later analysis
kSelect = SelectKBest(chi2, k=5).fit_transform(X, Y)
(rows, cols) = kSelect.shape 

# Create a dataframe to hold the selected values (only) for later processing
selected = pd.DataFrame(data=kSelect,
          index=np.array(range(1, rows+1)),
          columns=np.array(range(1, cols+1)))

# Add the column headers for the X array--the range from names for this dataframe
selected.columns = heads 
selected.head(3)

## Recursive Feature Elimination
Recursively removes attributes and builds models on those attributes that remain. It accomplishes this by training on the full set then determining the feature importances given the model selected then it prunes the worst, the next worst and so on building a model each time until it ends up with the final set. Default removal each time (step) is one.

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# feature extraction 
model = LogisticRegression() 
rfe = RFE(model, 5) # where the number is the features retained
rfe = rfe.fit(X,Y) 

ranking = rfe.ranking_
selected = rfe.support_

ranking = np.vstack((ranking, selected))

(rows, cols) = ranking.shape

# This dataframe doesn't hold the columns selected, 
# it is only for pretty printing the selected features
rfe_selected = pd.DataFrame(data=ranking,
          index=np.array(range(1, rows+1)),
          columns=np.array(range(1, cols+1)))
rfe_selected.columns = names[2:32] 

array = rfe_selected.T # transpose
array.columns = ['rank', 'selected']
output = array['selected'] == 1
df = array[selected]
df

In [None]:
# Get the actual features selected for later processing
rfeSelect = RFE(model,5).fit_transform(X, Y)

# Get the size of the array of selected values 
(rows, cols) = rfeSelect.shape

# Get the column headings and remove the selection data
df2 = df.T # transpose back... :-)
heads = df2.iloc[0:0]
heads = heads.columns

# Create a dataframe to hold the selected values (only) for later processing
selectedRFE = pd.DataFrame(data=rfeSelect,
          index=np.array(range(1, rows+1)),
          columns=np.array(range(1, cols+1)))

# Add the column headers for the X array--the range from names for this dataframe
selectedRFE.columns = heads 
selectedRFE.head(3)

The most important thing to note is that the top ones do not correspond with kSelect choices Given this is robust, and that it was tested by removing variables to see which models perform best, this method is quite likely to give better results. 

## Recursive removal with cross validation
In this case we will be using a support vector machine and RFECV to identify the top features. This is still a recursive removal, but it is more  comprehensive than a simple RFE. It also allows for automatic tuning of the number of features selected, rather than the data scientist having to set the number in advance or test the the best number of values number through trial and error. 

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
import matplotlib.pyplot as plt
%matplotlib inline

# Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="linear") # using linear, but also use poly or radial basis 
# The "accuracy" scoring is proportional to the number of correct classifications

rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(2),
              scoring='accuracy')
rfecv.fit(X, Y)

print("Optimal number of features : %d" % rfecv.n_features_)
#print("Selected Features: %s" % rfecv.support_) 
#print("Feature Ranking: %s" % rfecv.ranking_)

rankingCV = rfecv.ranking_
selectedCV = rfecv.support_

rankingCV = np.vstack((rankingCV, selectedCV))
(rows, cols) = rankingCV.shape

# This dataframe for pretty printing the selected features
rfecv_selected = pd.DataFrame(data=rankingCV,
          index=np.array(range(1, rows+1)),
          columns=np.array(range(1, cols+1)))
rfecv_selected.columns = names[2:32] 

arrayCV = rfecv_selected.T
arrayCV.columns = ['rankCV', 'selectedCV']
output = arrayCV['selectedCV'] == 1
dfCV = arrayCV[selectedCV]
print(dfCV)

# Plot number of features VS. cross-validation scores
# it's handy that RFECV has the grid_scores features
plt.figure(figsize=(10,10))
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

In [None]:
# Get the actual features selected for later processing
rfecvFeatures = rfecv.transform(X)

# Get the size of the array of selected values 
(rows, cols) = rfecvFeatures.shape
#print(rows, cols)

# Get the column headings and remove the selection data
df3 = dfCV.T
heads = df3.iloc[0:0]
heads = heads.columns

# Create a dataframe to hold the selected values (only) for later processing
selectedRFECV = pd.DataFrame(data=rfecvFeatures,
          index=np.array(range(1, rows+1)),
          columns=np.array(range(1, cols+1)))

# Add the column headers for the X array--the range from names for this dataframe
selectedRFECV.columns = heads 
selectedRFECV.head(3)

The good news is that RFECV selected the same five features as RFE and showed us that we were cutting too deeply. That there is significant variance explained by five features (see the local maxima at 4-5), but that the next 3 features, when included with the first five, result in even better predictability. We plateau again after that, but don't see a drop in predictability until we reach 19 variables, where the prediction actually gets worse by including these and the remainder variables. Most of these come from those that were generated using standard error. To be the most efficient, quite likely you'd rerun RFE and select the top 8. 

## Select From Model
This is what is referred to as a "meta-transformer" it can be used alongside any type of estimator with the coeffient (coef) or feature importance attribute post fitting the data to the model. Instead of selecting the number of features, it selects features that are below a threshold that you provide. The trick is knowing what that threshold should be, but there are ways, as we saw in RFECV to get at this information. In this instance, I'll be using the LASSO cross validation technique (Lasso.CV) which uses KFold as the cross validator by default. 

In [None]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV

clf = LassoCV(cv=5)
sfm = SelectFromModel(clf)
sfmFeatures = sfm.fit_transform(X,Y)

(rows, cols) = sfmFeatures.shape
#print(rows, cols)

# This dataframe for pretty printing the selected features
sfm_selected = pd.DataFrame(data=sfmFeatures,
          index=np.array(range(1, rows+1)),
          columns=np.array(range(1, cols+1)))
sfm_selected.columns = ['Mean-Area','Worst-Texture','Worst-Perimeter',"Worst-Smoothness"] 

sfm_selected.head(3)

Unfortunately, I had to "eyeball" this one (go back to the datafile) to find the features that were selected due to the lack of "helper" functions I used in the previous sections. 

## Feature Importance via ExtraTreesClassifier
Bagged trees like Random Forest and Extra Trees can be used to estimate the importance of features. Extra Trees implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
np.random.seed(101) # make this stochastic decision tree deterministic

etc = ExtraTreesClassifier().fit(X, Y) 
etcFeatures = etc.feature_importances_

dfFeatures = pd.DataFrame(etcFeatures.reshape(-1, len(etcFeatures)),columns=names[2:32])
dfFeatures.T

Decision Trees, as they are implemented, are stochastic. The issue is that each time you run it you will get a different set of features--which sucks if you are looking for a consistent set to choose from and expect the same results every time. The trick I learned on StackOverflow is to add this line to make it deterministic. 
<code> np.random.seed(101) </code>
* Worst-Area
* StdErr-Radius
* Mean-ConcavePoints
* Worst-Perimeter
* Mean-Radius
* Mean-Area
* Mean-Perimeter
* Worst-ConcavePoints
* Worst-Radius
* StdErr-Area

The other problem, as with some examples above is the lack of helper classes, but at least with trees you can visualize them. See my Decision Tree lesson for how you can do that. 