# Feature Selection

- It is a process where we automatically select those features in our data that contribute most to the prediction variable or output in which we are interested.

## Benefits of feature selection
* **Reduces Overfitting** : Less redundant data means less opportunity to make decisions based on noise.
* **Improves Accuracy** : Less misleading data means modelling accuracy improves.
* **Reduces Training Time** : Less data means that algorithms train faster.

## Automatic feature selection techniquies are : 
* Univariate Selection
* Recursive Feature Elimination
* Principle Component Analysis
* Feature Importance

### 1 . Univariate Selection
* Statistical tests can be used to select those features that have the strongest relationship with the output variable.
* Scikit-learn library provides the **SelectKBest** class that can be used with a suite of different statistical tests to select a specific number of features.

In [2]:
# Feature Extraction with Univariate Statistical tests(chi-squared for classification)

from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

filename = 'pima-indians-diabetes.data.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
dataframe = read_csv(filename,names=names)
array = dataframe.values

X = array[:,0:8]
Y = array[:,8]

# feature extraction
# here k=4 specifies that we are trying the extract the best 4 features
test = SelectKBest(score_func=chi2,k=4)
fit = test.fit(X,Y)

# summarizing the scores
set_printoptions(precision=3)
print(fit.scores_)

# Top four scores are our best features

features = fit.transform(X)

# summarizing selected features 
print(features[0:5,:])

[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]
[[148.    0.   33.6  50. ]
 [ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 [ 89.   94.   28.1  21. ]
 [137.  168.   43.1  33. ]]


### 3 . Recursive Feature Elimination
* It works by recursively removing attributes and buiding a model on those attributes that remain.
* It uses the model accuracy to identify which attributes contribute the most to predicting the target attribute.

In [9]:
# feature extraction with RFE
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
# feature extraction
model = LogisticRegression()
rfe = RFE(model,3)
fit = rfe.fit(X,Y)

print("Num Features: %d" %(fit.n_features_)) 
print("Selected Features: %s" %(fit.support_)) 
print("Feature Ranking: %s"  %(fit.ranking_))


Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 4 5 6 1 1 3]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


### 3 . Principal Component Analysis
        (data reduction technique)
* It uses linear algebra to transform the dataset into a compressed form.
* A property of PCA is that we can choose the number of dimensions or principal components in transformed result.

In [12]:
# Feature Extraction with PCA
from sklearn.decomposition import PCA

# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(X)

# summarizing components
print("Explained Variance : %s"%fit.explained_variance_ratio_)
print(fit.components_)

Explained Variance : [0.889 0.062 0.026]
[[-2.022e-03  9.781e-02  1.609e-02  6.076e-02  9.931e-01  1.401e-02
   5.372e-04 -3.565e-03]
 [-2.265e-02 -9.722e-01 -1.419e-01  5.786e-02  9.463e-02 -4.697e-02
  -8.168e-04 -1.402e-01]
 [-2.246e-02  1.434e-01 -9.225e-01 -3.070e-01  2.098e-02 -1.324e-01
  -6.400e-04 -1.255e-01]]


### 4 . Feature Importance
* Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.


In [13]:
# Feature Importance with Extra Trees Classifier

from sklearn.ensemble import ExtraTreesClassifier

# feature extraction
model = ExtraTreesClassifier()
model.fit(X,Y)
# larger the score the more important the attribute
print(model.feature_importances_)

[0.11  0.239 0.098 0.08  0.074 0.135 0.121 0.144]


# Summary
* learnt about 4 different automatic feature selection techniques