# Introduction
> Feature Selection:
> > selecting high-quality, informative features and dropping less useful features.
> ______________________________
> Three types of feature selection methods:
> 1. `Filter methods` select the best features by examining their statistical properties.
> > Methods where we explicitly set a threshold for a statistic or manually select the number
of features we want to keep are examples of feature selection by filtering.
> 2. `Wrapper methods` use trial and error to find the subset of features that produces models with
the highest quality predictions.
> > Wrapper methods are often the most effective, as they find the best result through actual experimentation as opposed to naive assumptions.
> 3. `Embedded methods` select the best feature subset as part of, as an extension of, a learning algorithm’s training process.

# Thresholding Numerical Feature Variance
> `Variance thresholding (VT)` is an example of feature selection by filtering, and one of
the most basic approaches to feature selection. It is motivated by the idea that features
with low variance are likely less interesting (and less useful) than features with high
variance.
> > VT first calculates the variance of each feature:<br>
> > ![image.png](attachment:a202bd67-3869-4da1-9dd4-7f58e0efb7ad.png)

> > where x is the feature vector, xi is an individual feature value, and μ is that feature’s
mean value. ---> It drops all features whose variance does not meet that threshold.
> ________________________________________
> 2 key point:
> > 1. The variance is not centered (it is in the squared unit of the feature itself), so VT will not work
when feature sets contain different units (e.g., one feature is in years while another is in dollars).
>> 2. The variance threshold is selected manually, so we have to use our own judgment for a good value to select 


In [1]:
# If you have a set of numerical features and want to filter out those with low variance(i.e., likely containing little information),
# just select a subset of features with variances above a given threshold:
# Load libraries
from sklearn import datasets
from sklearn.feature_selection import VarianceThreshold
# Import some data to play with
iris = datasets.load_iris()
# Create features and target
features = iris.data
target = iris.target
# Create thresholder
thresholder = VarianceThreshold(threshold=.5)
# Create high variance feature matrix
features_high_variance = thresholder.fit_transform(features)
# View high variance feature matrix
features_high_variance[0:3]
# array([[ 5.1, 1.4, 0.2],
#  [ 4.9, 1.4, 0.2],
#  [ 4.7, 1.3, 0.2]])

array([[5.1, 1.4, 0.2],
       [4.9, 1.4, 0.2],
       [4.7, 1.3, 0.2]])

In [2]:
# We can see the variance for each feature using variances_:
# View variances
thresholder.fit(features).variances_
# array([0.68112222, 0.18871289, 3.09550267, 0.57713289])

array([0.68112222, 0.18871289, 3.09550267, 0.57713289])

In [3]:
# If the features have been standardized (to mean zero and unit variance), then
# for obvious reasons VT will not work correctly:
# Load library
from sklearn.preprocessing import StandardScaler
# Standardize feature matrix
scaler = StandardScaler()
features_std = scaler.fit_transform(features)
# Caculate variance of each feature
selector = VarianceThreshold()
selector.fit(features_std).variances_
# array([1., 1., 1., 1.])

array([1., 1., 1., 1.])

# Thresholding Binary Feature Variance
> As with numerical features, one strategy for selecting highly informative categorical
features and filtering out less informative ones is to examine their variances.
> > In binary features (i.e., Bernoulli random variables), variance is calculated as:<br>
>> ![image.png](attachment:c7effb8c-8211-4d72-9408-3b4898357ce1.png)

>> - Where p is the proportion of observations of class 1. Therefore, by setting p, we can
remove features where the vast majority of observations are one class.


In [4]:
# If you have a set of binary categorical features and want to filter out those with low variance (likely containing little information).
# just select a subset of features with a Bernoulli random variable variance above a given threshold:
# Load library
from sklearn.feature_selection import VarianceThreshold
# Create feature matrix with:
# Feature 0: 80% class 0
# Feature 1: 80% class 1
# Feature 2: 60% class 0, 40% class 1
features = [[0, 1, 0],
         [0, 1, 1],
         [0, 1, 0],
         [0, 1, 1],
         [1, 0, 0]]
# Run threshold by variance
thresholder = VarianceThreshold(threshold=(.75 * (1 - .75)))
thresholder.fit_transform(features)
# array([[0],
#  [1],
#  [0],
#  [1],
#  [0]])

array([[0],
       [1],
       [0],
       [1],
       [0]])

# Handling Highly Correlated Features
> One problem we often run into in machine learning is highly correlated features.
> > If two features are highly correlated, then the information they contain is very similar,
and it is likely redundant to include both features.<br>
> > In the case of simple models like linear regression, failing to remove such features violates the assumptions of linear
regression and can result in an `artificially inflated R-squared value`.


> > The solution to highly correlated features is simple:
> > > remove one of them from the feature set. you can do it by setting a correlation threshold


In [5]:
# If you have a feature matrix and suspect some features are highly correlated,
# use a correlation matrix to check for highly correlated features. 
# If highly correlated features exist, consider dropping one of the correlated features:
# Load libraries
import pandas as pd
import numpy as np
# Create feature matrix with two highly correlated features
features = np.array([[1, 1, 1],
                     [2, 2, 0],
                     [3, 3, 1],
                     [4, 4, 0],
                     [5, 5, 1],
                     [6, 6, 0],
                     [7, 7, 1],
                     [8, 7, 0],
                     [9, 7, 1]])

# Convert feature matrix into DataFrame
dataframe = pd.DataFrame(features)
# Create correlation matrix
corr_matrix = dataframe.corr().abs()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape),
 k=1).astype(bool))
# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
# Drop features
dataframe.drop(dataframe.columns[to_drop], axis=1).head(3)

Unnamed: 0,0,2
0,1,1
1,2,0
2,3,1


In [6]:
# In above code, first we create a correlation matrix of all features:
# Correlation matrix
dataframe.corr()

Unnamed: 0,0,1,2
0,1.0,0.976103,0.0
1,0.976103,1.0,-0.034503
2,0.0,-0.034503,1.0


In [7]:
# Second, we look at the upper triangle of the correlation matrix to identify pairs of highly correlated features:
# Upper triangle of correlation matrix
upper
# Third, we remove one feature from each of those pairs.

Unnamed: 0,0,1,2
0,,0.976103,0.0
1,,,0.034503
2,,,


# Removing Irrelevant Features for Classification
> `Chi-square` statistics examine the independence of two categorical vectors.
> > The statistic is the difference between the observed number of observations in each
class of a categorical feature and what we would expect if that feature were independent (no relationship) of the target vector:<br><br>
> > ![image.png](attachment:fb68ccd5-1fa9-4408-bdd5-ad975c034c79.png)

> > where Oi is the number of observed observations in class i, and Ei is the number of expected observations in class i.
>__________________________________________________
> A chi-squared statistic is a single number that tells you how much difference exists
between your observed counts and the counts you would expect if there were no relationship at all in the population.
> > By calculating the chi-squared statistic between a feature and the target vector, we obtain a measurement of the independence between
the two.
> > > - If the target is independent of the feature variable, then it is irrelevant for
our purposes because it contains no information we can use for classification.
>>> - On the other hand, if the two features are highly dependent, they likely are very informative for training our model.
>___________________________________________________
> To use chi-squared in feature selection, we `calculate the chi-squared` statistic `between`
each `feature` and the `target` vector, then select the features with the best chi-square statistics.
> > In `scikit-learn`, we can use `SelectKBest` to select them.
> > >The parameter `k` determines the number of features we want to keep and filters out the least informative features.
>___________________________________________
> **Note:**
> 1. Chi-square statistics can be calculated only between two categorical vectors (both the target vector and the features must be categorical).
> > If we have a numerical feature we can use the chi-squared technique by first transforming
the quantitative feature into a categorical feature.
> 2. To use our chi-squared approach, all values need to be `non negative`.
>_____________________________________________
> Numerical feature:
> >We can use `f_classif` to calculate the `ANOVA F-value statistic` with each feature and the target vector.
> >> `F-value` scores examine if, when we group the numerical feature by the target vector, the means for
each group are significantly different.
> >> > **Example:** if we had a binary target vector, gender, and a quantitative feature, test scores, the F-value score would tell us if the
mean test score for men is different than the mean test score for women. If it is not, then test score doesn’t help us predict gender and therefore the feature is irrelevant.



In [8]:
# If you have a categorical target vector and want to remove uninformative features,
# for the features that are categorical, 
# just calculate a chi-square (χ2) statistic between each feature and the target vector:
# Load libraries
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif
# Load data
iris = load_iris()
features = iris.data
target = iris.target
# Convert to categorical data by converting data to integers
features = features.astype(int)
# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
features_kbest = chi2_selector.fit_transform(features, target)
# Show results
print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_kbest.shape[1])

Original number of features: 4
Reduced number of features: 2


In [9]:
# If the features are quantitative, 
# compute the ANOVA F-value between each feature and the target vector:
# Select two features with highest F-values
fvalue_selector = SelectKBest(f_classif, k=2)
features_kbest = fvalue_selector.fit_transform(features, target)
# Show results
print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_kbest.shape[1])
# Original number of features: 4
# Reduced number of features: 2

Original number of features: 4
Reduced number of features: 2


In [10]:
# Instead of selecting a specific number of features,
# we can use SelectPercentile to select the top n percent of features:
# Load library
from sklearn.feature_selection import SelectPercentile
# Select top 75% of features with highest F-values
fvalue_selector = SelectPercentile(f_classif, percentile=75)
features_kbest = fvalue_selector.fit_transform(features, target)
# Show results
print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_kbest.shape[1])

Original number of features: 4
Reduced number of features: 3


# Recursively Eliminating Features
> The idea behind RFE is to train a model repeatedly, updating the weights or
coefficients of that model each time.
> > The first time we train the model, we include all the features. Then, we find the feature with the smallest parameter (notice that this
assumes the features are either rescaled or standardized), meaning it is less important, and remove that feature from the feature set.
> ___________________________________________
> We can use `CV` to find the optimum number of features to keep during RFE.
> In RFE with CV, after every iteration we use cross-validation to evaluate our model.
> > If CV shows that our model improved after we eliminated a feature, then we continue on to the next loop. However, if CV shows that our model got worse after we eliminated a feature, we put that feature back into the feature set and select those features as the best.
>_______________________________
> In `scikit-learn`, `RFE with CV` is implemented using `RFECV`, which contains a number
of important parameters:
> > 1. The `estimator` parameter determines the type of model we want to train (e.g., linear regression),
> > 2. the `step` parameter sets the number or proportion of features to drop during each loop,
> > 3. the `scoring` parameter sets the metric of quality we use to evaluate our model during cross-validation.

In [11]:
# If you want to automatically select the best features to keep,
# use scikit-learn’s RFECV to conduct recursive feature elimination (RFE) using cross-validation (CV). 
# That is, use the wrapper feature selection method and repeatedly train a model, each time removing a feature,
# until model performance (e.g., accuracy) becomes worse,
# the remaining features are the best:
# Load libraries
import warnings
from sklearn.datasets import make_regression
from sklearn.feature_selection import RFECV
from sklearn import datasets, linear_model
# Suppress an annoying but harmless warning
warnings.filterwarnings(action="ignore", module="scipy",
 message="^internal gelsd")
# Generate features matrix, target vector, and the true coefficients
features, target = make_regression(n_samples = 10000,
                                 n_features = 100,
                                 n_informative = 2,
                                 random_state = 1)
# Create a linear regression
ols = linear_model.LinearRegression()
# Recursively eliminate features
rfecv = RFECV(estimator=ols, step=1, scoring="neg_mean_squared_error")
rfecv.fit(features, target)
rfecv.transform(features)
# array([[ 0.00850799, 0.7031277 , 1.52821875],
#  [-1.07500204, 2.56148527, -0.44567768],
#  [ 1.37940721, -1.77039484, -0.74675125],
#  ...,
#  [-0.80331656, -1.60648007, 0.52231601],
#  [ 0.39508844, -1.34564911, 0.4228057 ],
#  [-0.55383035, 0.82880112, 1.73232647]])

array([[ 0.00850799,  0.7031277 ,  1.02069177],
       [-1.07500204,  2.56148527,  0.10585966],
       [ 1.37940721, -1.77039484, -0.53049556],
       ...,
       [-0.80331656, -1.60648007,  0.25921194],
       [ 0.39508844, -1.34564911, -1.80744499],
       [-0.55383035,  0.82880112,  0.76876009]])

In [12]:
# Once we have conducted RFE, we can see the number of features we should keep:
# Number of best features
rfecv.n_features_
# 3

3

In [13]:
# We can also see which of those features we should keep:
# Which categories are best
rfecv.support_

array([False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False])

In [15]:
# We can even view the rankings of the features:
# Rank features best (1) to worst
rfecv.ranking_

array([55, 54, 53, 17, 97,  1, 69, 12, 26, 58, 49, 35, 14,  9, 42, 24, 73,
        6, 93, 32, 75, 19, 65, 45, 20, 23, 59, 91, 74, 29, 41, 16,  5, 88,
       57, 90, 43, 86, 50,  1, 30, 47,  2, 76, 48, 60, 28, 40, 51, 37, 15,
        8, 31, 83, 98,  4, 96, 70, 67, 56, 71, 10, 25, 66, 89, 46, 63, 33,
       34, 92,  7, 11, 82, 87, 68, 72, 39, 61, 22, 81,  3, 78, 27, 38, 44,
       18, 94, 62, 79, 36, 95, 13, 80, 52, 85, 84,  1, 21, 77, 64])

# END of chapter 10 --> dimension reduction using feature selection