# Feature selection
Feature selection is the process of selecting a subset of features from a dataset for use in machine learning models. The goal of feature selection is to improve the performance of the model by reducing noise and redundancy in the data, and by making the model more interpretable.

Types of feature selection techniques:
1. Filter methods
2. Wrapper methods 
3. Embedded Methods and
4. Hybrid Methods



In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sb


#### All the filter based techniques and Embedded techniques will be implemented on below data frame

In [9]:
df = pd.read_csv(r"C:\Users\prasa\Dropbox\PC\Desktop\ML\My own content\Feature Selection\Financial Distress.csv")
print(df.shape)

(3672, 86)


In [10]:
df.head()

Unnamed: 0,Company,Time,Financial Distress,x1,x2,x3,x4,x5,x6,x7,...,x74,x75,x76,x77,x78,x79,x80,x81,x82,x83
0,1,1,0.010636,1.281,0.022934,0.87454,1.2164,0.06094,0.18827,0.5251,...,85.437,27.07,26.102,16.0,16.0,0.2,22,0.06039,30,49
1,1,2,-0.45597,1.27,0.006454,0.82067,1.0049,-0.01408,0.18104,0.62288,...,107.09,31.31,30.194,17.0,16.0,0.4,22,0.010636,31,50
2,1,3,-0.32539,1.0529,-0.059379,0.92242,0.72926,0.020476,0.044865,0.43292,...,120.87,36.07,35.273,17.0,15.0,-0.2,22,-0.45597,32,51
3,1,4,-0.56657,1.1131,-0.015229,0.85888,0.80974,0.076037,0.091033,0.67546,...,54.806,39.8,38.377,17.167,16.0,5.6,22,-0.32539,33,52
4,2,1,1.3573,1.0623,0.10702,0.8146,0.83593,0.19996,0.0478,0.742,...,85.437,27.07,26.102,16.0,16.0,0.2,29,1.251,7,27


#### This is a classification problem. 
#### Target variable: Financial Distress
#### if Financial Distress>-0.5, then Person will be treated as healthy. Unhealthy otherwise

In [14]:
def convert(x):
    if x>-0.5:
        return 1
    else:
        return 0
df['y'] =  df['Financial Distress'].apply(convert)
df['Financial Distress'] = df['y']
df.drop('y',axis=1,inplace=True)

In [15]:
df['Financial Distress'].value_counts()

1    3536
0     136
Name: Financial Distress, dtype: int64

In [92]:
X = df.drop('Financial Distress',axis=1)
y = df['Financial Distress'] 

### Implenting the RF classification model in order to check the performance before and after implementing feature selection


In [16]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics  import accuracy_score, confusion_matrix, classification_report

In [17]:
RF = RandomForestClassifier(n_estimators=30,n_jobs=1)

In [71]:
print(cross_val_score(estimator=RF,X=X,y=y,cv=6).mean())

0.9577886710239651


## 1. Filter based methods:
Selecting the features based on their individual statistical measures such as their variance, correlation with target variable.
Examples of filter based methods are:
1. Variance threshold
2. Correlation
3. ANOVA
4. Chi2 
5. Mutual info


### 1.1. Variance threshold method:
Dropping constant/ quasi constant features.


In [63]:
from sklearn.feature_selection import VarianceThreshold
VarThr = VarianceThreshold(threshold=0.08)

In [64]:
VarThr.fit(X,y)

In [65]:
X_vartance_thr_method = X[X.columns[VarThr.get_support()]]
X_vartance_thr_method.shape

(3672, 65)

#### Note: No.of features are reduced from 85 to 65 by droping the features that are exibiting the low variance

#### Implementing the model to determine the accuracy after feature selection

In [73]:
print(cross_val_score(estimator=RF,X= X_vartance_thr_method,y=y,cv=6).mean())

0.9610566448801743


#### Accuracy improved slightly. Eventhough the accuracy is not heavily improved but 20 Features are dropped ==> Calculation redundancy is reduced to some extent.

#### Demerits:
1. Feature - Feature or Feature - Target interactions are not considered in this method

### 1.2. Correlation threshold method:
Involves measures the Correlation of each feature with the other variable and dropping the features that are having high correlation than the given threshold value

In [75]:
# Creating correlation matrix
corr_matrix = X.corr()
print(corr_matrix.shape)


(85, 85)


In [89]:
features_to_drop = []
for i in range(len(corr_matrix)):
    for j in range(0,i):
        if corr_matrix.iloc[i,j]>0.9:
            features_to_drop.append(corr_matrix.columns[i])
features_to_drop = list(set(list(features_to_drop)))
print(features_to_drop)
print("No.of features to drop : ",len(features_to_drop))

['x48', 'x81', 'x62', 'x49', 'x52', 'x53', 'x38', 'x7', 'x77', 'x76', 'x75', 'x34']
No.of features to drop :  12


In [93]:
X_corre_method = X.drop(columns=features_to_drop)
X_corre_method.shape

(3672, 73)

#### Note: With the help of the above code, we can remove the features that are highly correlated

In [94]:
print(cross_val_score(estimator=RF,X=X_corre_method,y=y,cv=6).mean())

0.9613289760348583


#### Eventhough the acccuracy is not improved to the significant level, No.of features are reduced by 12, that will definitely reduce the no.of.computations required while making decisions. 

#### Demerits: 
1. Can only captures the linear relations, Complex non linear relations can't be cptured with this technique.

### 1.3. ANOVA
Finds the relation between numerical and categorical data where the no.of categories should be greater than 2

In [97]:
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import SelectKBest

In [120]:
sel = SelectKBest(f_classif,k=40)
sel_k = sel.fit(X=X,y=y)

In [121]:
print(X[X.columns[sel_k.get_support()]].shape)
X_ANOVA = X[X.columns[sel_k.get_support()]]

(3672, 40)


In [122]:
print(cross_val_score(estimator=RF,X=X_ANOVA,y=y,cv=6).mean())

0.960511982570806


#### Note: This is the beauty achieved with feature selection. Same accuracy is achieved by dropping half of the features (43).  

#### Demerits: 
1. Considers the interactions between Features and Target variable. Neglects the Feature - Feature interactions.
2. ANOVA test assumes Normality of the data, which may not true all the times.

### 1.4.Chi2 test based method:
Chi2- test finds the relation between 2 categorical variables

Current data frame is not containing any categorical variables to perform Chi2 test. 
We will use Titanic data set for implementing the Chi2 test.

In [128]:
df1 = pd.read_csv(r"C:\Users\prasa\Dropbox\PC\Desktop\ML\My own content\Feature Selection\titanic.csv")
df1.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [130]:
df1.dropna(inplace=True)
df1.shape

(183, 12)

In [187]:
# Target variable: Survived
X_titanic = df1.drop('Survived',axis=1)
y_titanic = df1['Survived']


In [138]:
# Considering only limited features in X_titanic
X_titanic = X_titanic[['Pclass','Sex','SibSp','Parch','Cabin','Embarked']]
X_titanic

Unnamed: 0,Pclass,Sex,SibSp,Parch,Cabin,Embarked
1,1,female,1,0,C85,C
3,1,female,1,0,C123,S
6,1,male,0,0,E46,S
10,3,female,1,1,G6,S
11,1,female,0,0,C103,S
...,...,...,...,...,...,...
871,1,female,1,1,D35,S
872,1,male,0,0,B51 B53 B55,S
879,1,female,0,1,C50,C
887,1,female,0,0,B42,S


In [134]:
# Performing label encoding 
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()

In [141]:
columns_to_encode = ['Sex','SibSp','Cabin','Embarked']
for i in columns_to_encode:
    X_titanic[i].fillna(X_titanic[i].mode())
    X_titanic[i] = LE.fit_transform(X_titanic[i])
X_titanic

Unnamed: 0,Pclass,Sex,SibSp,Parch,Cabin,Embarked
1,1,0,1,0,72,0
3,1,0,1,0,48,2
6,1,1,0,0,117,2
10,3,0,1,1,131,2
11,1,0,0,0,43,2
...,...,...,...,...,...,...
871,1,0,1,1,91,2
872,1,1,0,0,29,2
879,1,0,0,1,61,0
887,1,0,0,0,25,2


#### Performing Chi2 test

In [149]:
from scipy.stats import chi2_contingency

In [162]:
closely_related_features = []
for i in X_titanic.columns:
        crosstab = pd.crosstab(X_titanic[i],y_titanic)
        chi2 = chi2_contingency(crosstab)
        if chi2.pvalue<0.05:
            closely_related_features.append((i,chi2.pvalue))
for i in closely_related_features:
    print(i)

('Sex', 1.8568580662867508e-12)


#### Lesser the value of p, higher the association between variables. As per the above results  Sex is closely associated with the target variable

#### Demerits:
Only suitable for categorical variables. Not useful when the data is numerical

### 1.5. Mutual Info Gain Method: 
Measure of dependency between 2 Random variables. It quantifies the amount of information obtained about one RV by observing other RV.
It is similar to Chi2 test, and it can be applied to numeric data as well

In [164]:
from sklearn.feature_selection import mutual_info_classif
mutual_info_classif(X_titanic,y_titanic)

array([0.01691857, 0.16313987, 0.        , 0.        , 0.        ,
       0.0104994 ])

#### Higher the value of Mutual Info gain, Higher the association. As per the above result AGE column is highly associated with Target variable comparing to other features in the dataframe considered

## 2. Wrapper methods:
Wrapper methods search for subsets of features that optimizes the performance of the ML model.

In [214]:
from sklearn.datasets import load_iris
data=load_iris()
df2 = pd.DataFrame(data=data.data,columns=data.feature_names)
df2['target'] = data.target
df2


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


In [215]:
X_iris = df2.drop('target',axis=1)
y_iris = df2['target']

### 2.1. Exhaustive Feature Selection/ Best subset selection:
Try outs all the subsets of the features and selects the best subset out of all possible cominations.

In [233]:
from mlxtend.feature_selection import ExhaustiveFeatureSelector
EFS = ExhaustiveFeatureSelector(estimator=RF,max_features=4,cv=6,min_features=2)

In [234]:
sel = EFS.fit(X_iris,y_iris)

Features: 11/11

In [235]:
sel.best_feature_names_

('petal length (cm)', 'petal width (cm)')

#### Note: As the model will check for all possible subsets, 2 best Features are selected out of all possible combinations
#### As per the Exhaustive feature selection method PETAL LENGTH and PETAL WIDTH are the 2 best features comparing with other combinations

#### Demerit: 
1. Computationally very expensive.
2. Risk of overfitting 

### 2.2. Sequential backward elimination:
Eliminating one feature at a time starting from base model

In [236]:
from mlxtend.feature_selection import SequentialFeatureSelector
SBS = SequentialFeatureSelector(estimator=RF,k_features=2,forward=False,cv=3,n_jobs=-1)

In [237]:
sel = SBS.fit(X_iris,y_iris)

In [238]:
sel.k_feature_names_

('sepal width (cm)', 'petal length (cm)')

#### As per Sequential backward selection method PETAL LENGTH and SEPAl WIDTH are the 2 best features comparing with other combinations

### 2.3. Sequential forward elimination:
Starting with one feature and selecting the subsequent k best features

In [239]:
from mlxtend.feature_selection import SequentialFeatureSelector


In [240]:
SFS = SequentialFeatureSelector(estimator=RF,k_features=2,forward=True,cv=6,n_jobs=-1)

In [243]:
sel = SFS.fit(X_iris,y_iris)
sel

In [245]:
sel.k_feature_names_


('petal length (cm)', 'petal width (cm)')

#### As per Sequential forward selection method PETAL LENGTH and PETAL WIDTH are the 2 best features comparing with other combinations


#### disvantages:
Time complexity: Eventhough it is much faster than Exhaustive feature selection. It will take considerable amount of time for computations

## 3. Embedded Methods:
Embeded methods aims to overcome the limitations of Filter methods and Wrapper methods.
Limitations of other methods:

1. Filter based methods: Neglects wither feature - feature or feature - target interactions
2. Wrapper Methods: Computationally expensive


Embeded methods includes both the interactions and offers great computational efficiency.

As the name suggests these are the techniques that are embeded within the model itself. 
Here is the list of models that comes under this category:

1. Models that involves feature importance calculations:

    a. Decision Tree
    
    b. Random Forest
    
    
2. Models that contains coeff_:

    a. Linear regression
    
    b. Logistic regression
    
    c. Regularization models:
        i. Ridge
        ii. Lasso
        iii. Elastic Net

## 4. Hybrid methods:
Combination of any 2 of the above methods

### 4.1. Recurssive Feature Elimination method:
Using any model that calculates the Feature importance and eliminate the features that are least important in a recursive manner

In [246]:
from sklearn.feature_selection import RFE


#### Implementing the RFE method on the Financial Distress data frame. 

In [250]:
RFE_sel = RFE(estimator=RF,n_features_to_select=20,verbose=True)
sel = RFE_sel.fit(X,y)

Fitting estimator with 85 features.
Fitting estimator with 84 features.
Fitting estimator with 83 features.
Fitting estimator with 82 features.
Fitting estimator with 81 features.
Fitting estimator with 80 features.
Fitting estimator with 79 features.
Fitting estimator with 78 features.
Fitting estimator with 77 features.
Fitting estimator with 76 features.
Fitting estimator with 75 features.
Fitting estimator with 74 features.
Fitting estimator with 73 features.
Fitting estimator with 72 features.
Fitting estimator with 71 features.
Fitting estimator with 70 features.
Fitting estimator with 69 features.
Fitting estimator with 68 features.
Fitting estimator with 67 features.
Fitting estimator with 66 features.
Fitting estimator with 65 features.
Fitting estimator with 64 features.
Fitting estimator with 63 features.
Fitting estimator with 62 features.
Fitting estimator with 61 features.
Fitting estimator with 60 features.
Fitting estimator with 59 features.
Fitting estimator with 58 fe

In [254]:
sel.ranking_

array([27, 65, 49,  1,  1, 30,  1, 28, 56,  7,  1,  1, 44,  1,  1,  1, 37,
        8, 38,  3, 17, 26, 15, 29, 20,  9,  1, 19,  1, 16, 14, 48, 33, 36,
       35, 31, 40,  1,  6, 32, 11, 22, 41,  5, 18,  1, 21,  1,  1,  1,  1,
        4, 57, 10,  1,  1, 43, 24,  1, 25, 12, 61,  2, 45, 51, 42, 39, 47,
       46, 54, 63, 58, 59, 66, 62, 55, 34, 23, 64, 60, 52, 50,  1, 53, 13])

In [261]:
X_hybrid = X[X.columns[sel.get_support()]]

In [263]:
print(cross_val_score(estimator=RF,X=X_hybrid,y=y,cv=6).mean())

0.9588779956427015


#### With the help of Hybrid selection technique. Best 20 possible features are selected without affecting the accuracy of the model.


### Advantages of feature selection:
##### 1. Reduced curse of dimensionality
##### 2. Improved performance
##### 3. Improved Interpretability and 
##### 4. avoiding overfitting of the model