## Feature Selection & Feature Extraction

How much information you provide for model really matters. In Machine Learning, we often have to deal with the ["Curse of dimensionality"](https://en.wikipedia.org/wiki/Curse_of_dimensionality) issue. we provide two kinds of algorithm in this tutorial to solve the problem.

As you can see in following image, in general case the model which takes data with more dimention should has better performance. However, one can see that as the dimension increasing the performance goes worse.

![optimal dimension](http://scikit-learn.org/stable/_images/sphx_glr_plot_rfe_with_cross_validation_001.png)


### <span style="color:blue">Feature Selection</span>

Following is the quotation from the definition of "Feature Selection" on [wikipedia](https://en.wikipedia.org/wiki/Feature_selection).

>In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for four reasons:

> 1. simplification of models to make them easier to interpret by researchers/users,
> 2. shorter training times,
> 3. to avoid the curse of dimensionality,
> 4. enhanced generalization by reducing overfitting (formally, reduction of variance)

Based on any **given model** and a **index of performance**. In this tutorial, we use Gaussian Naive Bayess and model's accuracy as example to demostrate just like the assignment. We use feature selection methods to sort features by their relation with model performance.

Assume you have a dataset, and each data has 20 features. You are asked to pick the features whose importance are ranked 1 and 2. This method would be very useful either in visulaization or reducing the number of features. With less features, you can not only train your model in less time but measure the robustness of the model.

Following are the two common methods to do feature selection: **backward selection** and **forward selection**. You can check the rest of the methods in the reference.

---

* <span style="color:green">backward selection</span>

    As the name suggests, we select the features out backward. we will use our imaginary dataset to demo specific steps of backward selection.
    
    1. Train the model with all 20 features and test the model for accuracy.
           ( fea_01, fea_02, fea_03, fea_04, fea_05 .... fea_20) -> total_acc
       
    2. Iterate 20 features, and eliminate corresponding feature. Now we train the same model with different combination of 19 features in each iteration.
      
           ( fea_02, fea_03, fea_04, fea_05, .... fea_20 ) -> acc_01
           ( fea_01, fea_03, fea_04, fea_05, .... fea_20 ) -> acc_02
           ( fea_01, fea_02, fea_04, fea_05, .... fea_20 ) -> acc_03
            ....
           ( fea_01, fea_02, fea_03, fea_04, .... fea_19 ) -> acc_20
       
    3. Eliminate the feature which reduce the accuracy least.
    
        ```python=
        feature_index_we_dont_want = argmin( accuracy )
        del features[ feature_index_we_dont_want ]
        ```
        
    4. Loop above 1~3 steps until feature length satisfy the condition.

    ---
* <span style="color:green">forward selection</span>

    Compared with backward selection, we start our selection with empty list. Append the feature list once at a time. Each time we append a feature which contribute to accuracy most.
    
    1. Leave the feature list empty and calculate accuracy with each feature.
            feature_we_want : []
            fea_01 -> acc_01, fea_02 -> acc_02, fea_03 -> acc_03 .... fea_20 ->acc_20
    2. Append the feature which is the most important.
            feature_index_we_want = argmax( accuracy )
            feature_ws_want.append( features[ feature_index_we_want ] )
    3. Loop above 2 steps until feature length satisfy our condition.
    
### <span style="color:blue">Feature Extraction</span>

According to the definition of feature extraction on [wikipedia](https://en.wikipedia.org/wiki/Feature_extraction):
>In machine learning, pattern recognition and in image processing, feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning and generalization steps, and in some cases leading to better human interpretations. Feature extraction is a dimensionality reduction process, where an initial set of raw variables is reduced to more manageable groups (features) for processing, while still accurately and completely describing the original data set.

Compared with Feature selection, we create several new feature from given features instead of just filtering the features we need. The most famous feature extraction algorithm is [Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). The tutorial will explain when and why we should use PCA, and leave the complex mathematical theory in reference.

In general, feature extraction is more powerful than feature selection with same number of features. However, just like most machine learning model, it lacks explainability. If the purpose is only for visualization in 2D, this kind of algorithm may be the best choice.

Sklearn provide a convinient interface to use PCA. Following is a demostration of PCA.

In [None]:
from sklearn.decomposition import PCA
from sklearn.datasets.samples_generator import make_blobs
import numpy as np

original_x, y = make_blobs(n_samples=500, centers=3, n_features=20)
pca = PCA(n_components=2)
pca.fit(x)
reduced_x = pca.transform(x)
print("original shape is ", original_x.shape, ", the shape after extraction is ", reduced_x.shape)

Just like Feature selection, there are still lots of other methods you can use. I leave them in reference.

---
reference
* [Other methods available on sklearn](http://scikit-learn.org/stable/modules/feature_selection.html)
* [Feature extraction algorithms](https://www.sciencedirect.com/science/article/pii/0031320371900033)

### Assignment
Fill in the Todo segements of following backward selection implementation.
Use backward selection filter out 18 features, and leave only 2 features

In [None]:
from sklearn.datasets.samples_generator import make_blobs
from sklearn.naive_bayes import GaussianNB
import numpy as np

n_features = 20
x, y = make_blobs(n_samples=500, centers=3, n_features=n_features)
classifier = GaussianNB() # in order to save time and computation power stick to GNB model instead of other model

# you should implement the function like Recursive Feature Elimination in sklearn
# http://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html

# external loop to remove feature once at a time.
for n_remove_fea in range(0, n_features - 2):
    
    # internal loop compare the contribution of each feature. 
    for i in range(0, n_features - n_remove_fea):
        # TODO: delete the column you are testing, hint: np.delete
    
    # TODO: use np.argmin to get the least important feature and delete it

# x.shape is going to be (500, 2) now, and backward selection is done.