# Feature selection

When you are done creating hundreds of thousands of features, it's time for selecting a few of them . for training the model. Having too many features pose a problem well known as the curse of dimensionality. If you have a lot of features , you must also have alot of training samples to capture all the features. what's considered a "lot" is not defined correctly and it is upto us to figure out by validating our models properly and checking how much time it takes to train your models.

The simplest form of selecting features would be to **remove features with very low variance.** If the features have a low variance (i.e, very close to 0), they are close to being constant and thus , do not add any values to any model at all. It would be nice to get rid of them and hence lower the complexity. 

Also note that the variance also depends on scaling of the data. Scikit-learn has an implementation for **VarianceThreshold** that does precisely this.

In [None]:
from sklearn.feature_selection import VarianceThreshold

data = .... 
var_thresh = VarianceThreshold(threshold= 0.1)
transformed_data = var_thresh.fit_transform(data)

# Transformed data will have all columns with variance less than 0.1 removed

we can also removed features which have a high correlation. for calculating the correlation between different numerical features, we can use the **Pearson correlation**

In [4]:
import pandas as pd 
from sklearn.datasets import fetch_california_housing
import numpy as np 

# fetch a regression dataset
data = fetch_california_housing()

x = data['data']
col_names = data['feature_names']
y = data['target']

# convert to pandas dataframe 
df = pd.DataFrame(x , columns=col_names)

# introduce a highly correlated column
df.loc[: , 'MedInc_Sqrt'] = df.MedInc.apply(np.sqrt)

# get correlation matrix (pearson)
df.corr()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedInc_Sqrt
MedInc,1.0,-0.119034,0.326895,-0.06204,0.004834,0.018766,-0.079809,-0.015176,0.984329
HouseAge,-0.119034,1.0,-0.153277,-0.077747,-0.296244,0.013191,0.011173,-0.108197,-0.132797
AveRooms,0.326895,-0.153277,1.0,0.847621,-0.072213,-0.004852,0.106389,-0.02754,0.326688
AveBedrms,-0.06204,-0.077747,0.847621,1.0,-0.066197,-0.006181,0.069721,0.013344,-0.06691
Population,0.004834,-0.296244,-0.072213,-0.066197,1.0,0.069863,-0.108785,0.099773,0.018415
AveOccup,0.018766,0.013191,-0.004852,-0.006181,0.069863,1.0,0.002366,0.002476,0.015266
Latitude,-0.079809,0.011173,0.106389,0.069721,-0.108785,0.002366,1.0,-0.924664,-0.084303
Longitude,-0.015176,-0.108197,-0.02754,0.013344,0.099773,0.002476,-0.924664,1.0,-0.015569
MedInc_Sqrt,0.984329,-0.132797,0.326688,-0.06691,0.018415,0.015266,-0.084303,-0.015569,1.0


we see that the feature **"MedInc_Sqrt"** has a very high correlation with **"MedInc"**. we can remove one of them

And now we can move to some univariate ways of feature selection. **univariate feature selection** is nothing but a scoring of each feautre against a given target.

**Mutual information, ANOVA F-test** and **$chi^2$** are some of the most popular methods for univariate feature selection. There are two ways of using these in scikit-learn.

- **SelectKBest**: It keeps the top-k scoring features
- **SelectPrecentile**: It keeps the top features which are in a percentage specified by the user

It must b noted that you can use $chi^2$ only for data which is non-negative in nature. This is a particularly useful feature selection technique in natural language processing when we have a bag of words or tf-idf based features. It's best to create a wrapper for univariate feature selection that you can use for almost any new problem

In [15]:
 import UnivariateFeatureSelection as ufs 
 
 # initializing the instance
 uni_sel = ufs.UnivariateFeatureSelection(
     n_features = 0.9,
     problem_type = 'regression',
     scoring = 'f_regression'
 )


In [16]:
uni_sel.fit(x ,y)

SelectPercentile(percentile=90,
                 score_func=&lt;function f_regression at 0x00000178F6D99438&gt;)

In [17]:
x_transformed = uni_sel.transform(x)

In [18]:
x_transformed.shape

(20640, 7)

Most of the time , people prefer doing feature selection using a machine learning model. let's see how that is done
The simplest form of feature selection that uses a model for selection is known as **greedy feature selection**. In greedy feature selection. 

The first step is to choose a model. The second step is to select a loss/scoring function. And the third and final step is to iteratively evaluate each feature and add it to the list of **"good"** features if it improves loss/score. 

But you must keep this inmind that this is known as greedy feature selection for a reason. This feature selection process will fit a given model each time it evaluates a feature. The computational cost associate with this kind of method is very high. It will also take a lot of time for this kind of feature selection to finish. and if you do not use this feature selection properly . then you might even end up overfitting the model.
