### Feature Selection Techniques in ML

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [29]:
# Why do we need feature selection techniques??
# What are advantages of feature selection techniques??

- to reduce the cost of computation of the model, 
- to reduce complexity of the model 
- in some cases the performance/accuracy the model can we increased by reducing the number of features
- Simple models are easier to interpret.
- Shorter training times
- better generalization by reducing Overfitting
- Easier to implement by software developers
- Reduced risk of data errors by model use
- Variable redundancy
- Bad learning behaviour in high dimensional spaces

In [5]:
# Types of Feature Selection Techniques

- The difference can be made on whether features are selected based on the target variable or not. 
- Unsupervised methods ignores the target variable, e.g. methods that remove redundant variables using correlation. 
- Supervised techniques use the target variable, e.g. methods that remove irrelevant variables.
- wrapper, filter and intrinsic(embedded) methods are supervised
- Wrapper methods create many models with different subsets of input features and select those features that result in the best performing model according to a performance metric
- Filter methods use statistical techniques to evaluate the relationship between each input variable and the target variable, and these scores are used as the basis to choose (filter) those input variables
- Finally, there are some algorithms that perform feature selection automatically as part of learning the model e.g. intrinsic methods
- Feature selection is also related to dimensionally reduction techniques, The difference is that feature selection select features to keep or remove from the dataset, whereas dimensionality reduction create a projection of the data resulting in entirely new input features
- so,  dimensionality reduction is an alternate to feature selection rather than a type of feature selection.

In [7]:
# We can summarise various methods as follows:

Feature Selection: Select a subset of input features from the dataset.

Unsupervised: Do not use the target variable (e.g. remove redundant variables).
- Correlation

Supervised: Use the target variable (e.g. remove irrelevant variables).

1.Wrapper: Search for well-performing subsets of features.
- RFE
2.Filter: Select subsets of features based on their relationship with the target.
- Statistical Methods
- Feature Importance Methods
3.Intrinsic: Algorithms that perform automatic feature selection during training.
- Decision Trees

Dimensionality Reduction: Project input data into a lower-dimensional feature space.

In [None]:
# See the image below for categorization

![feature%20selection%20types.JPG](attachment:feature%20selection%20types.JPG)

In [9]:
# the choice of a statistical measures(e.g filter methods) is highly dependent upon the data types of variables

Common input variable data types:

Numerical Variables
- Integer Variables.
- Floating Point Variables.

Categorical Variables.
- Boolean Variables (dichotomous, e.g. True, False) 
- Nominal Variables (e.g. a,b,c)
- Ordinal Variables (good, better, best)

In [10]:
# If data types is confirmed,then u can proceed as below

Numerical|numerical>
- Pearson’s correlation coefficient (linear)
- Spearman’s rank coefficient (nonlinear)

Categorical|Categorical:
- Chi-Squared test (contingency tables).
- Mutual Information Score

If Numerical input, Categorical output or Vice Versa:
- ANOVA correlation coefficient (linear)
- Kendall’s rank coefficient (nonlinear)

#### pearson's correlation coefficient

In [16]:
# Feature selection is performed using Pearson’s Correlation Coefficient via the f_regression() function

n_informative : int, default=10

The number of informative features, i.e., the number of features used to build the linear model used to generate the output.

In [15]:
# pearson's correlation coefficient for numeric input and numeric output
from sklearn.datasets import make_regression
from sklearn.feature_selection import SelectKBest,f_regression
# generate dataset
x, y = make_regression(n_samples=100, n_features=100, n_informative=10)
# define feature selection
fs = SelectKBest(score_func=f_regression, k=10)
# apply feature selection
x_selected = fs.fit_transform(x, y)
print(x_selected.shape)

(100, 10)


#### Spearman’s rank coefficient

In [28]:
# 

>For categorical input  when the target  is also categorical (classfication problems), We commonly use  
> - chi-squared statistic and
> - mutual information statistic

#### Chi-squared Test

#### Mututal Information Score:

> If Numerical input, Categorical output or Vice Versa, then we use following two methods

#### ANOVA Correlation Coefficient (Linear)

In [17]:
# ANOVA feature selection for numeric input and categorical output
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest, f_classif
# generate dataset
X, y = make_classification(n_samples=100, n_features=20, n_informative=5)
# define feature selection
fs = SelectKBest(score_func=f_classif, k=6)
# apply feature selection
X_selected = fs.fit_transform(X, y)
print(X_selected.shape)

(100, 6)


#### Kendall's Rank COefficient (non Linear)

In [27]:
# 

## Brief Introduction of all Feature selection methods:

#### 1.Filter Methods:

In terms of computation, they are very fast and inexpensive and are very good for removing duplicated, correlated, redundant features but these methods do not remove multicollinearity.

Selection of feature is evaluated individually which can sometimes help when features are in isolation (don’t have a dependency on other features) but will lag when a combination of features can lead to increase in the overall performance of the model

- Chi-square test
- Correlation Coefficient (with heatmap)
- Dispersion Ratio
- Fisher’s Score
- Information Gain
- Mean Absolute Difference (MAD)
- Mutual Dependence
- Relief
- Variance Threshold

#### 2.Wrapper methods:

- referred to as greedy algorithms
- trains the algorithm by using a subset of features in an iterative manner
- these methods provide an optimal set of features for training the model, resulting in better accuracy than the filter 
  methods but are computationally more expensive
  
Some of famous wrapped methods are listed below:

- Forward selection 
- Backward elimination
- Bi-directional elimination
- Exhaustive Feature Selection
- Recursive Feature Elimination
- Recursive Feature Elimination with Cross-Validation

#### 3.Embedded or Intrinsic methods:

- In embedded methods, the feature selection algorithm is blended as part of the learning algorithm, thus having its own built-in feature selection methods
- overcomes the drawbacks of filter and wrapper methods and merge their advantages

Some of methods are listed below:

- Lasso Regularizaton
- Ridge Regularizaton
- Tree-based methods

### [[[END]]]