## Feature Selection


**Why Feature Selection**

Modelling can be time consuming to run because of number of features.This is when the concept of feature selection comes into play.

In order to avoid curse of dimensionality we use feature selection

**What is Feature Selection?**


Selecting most relevant predicting features for machine learning model building.

**Methods of Feature Selection**

There are 4 methods of Feature Selection
1. Filter Method
2. Wrapper Method 
3. Embedded Method
4. Hybrid Method

# **Filter Method**

Filter methods are a type of feature selection technique in machine learning that are based on a statistical test or a score to evaluate the importance of each feature. These methods are independent of the learning algorithm used and can be applied to any type of model.

Filter methods typically work by ranking the features based on a specific criterion such as correlation with the target variable, or ANOVA F-value, and then selecting a subset of the top-ranked features.

**Univariate Selection**

Univariate feature selection is a type of filter method for feature selection in which the importance of each feature is evaluated independently with respect to the target variable.

Some Examples of Filter methods are
1. Variance Threshold
2. Chi2 
3. Anove Test


## **1. Variance Threshold**

If the variance is low or close to zero, then a feature is approximately constant and will not improve the performance of the model. In that case, it should be removed.

Variance will also be very low for a feature if only a handful of observations of that feature differ from a constant value.

In [1]:
# loading the csv to dataframe name "diabetes" and printing the head values
import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/jainrachit108/datasets/main/diabetes_cleaned.csv')
# print head
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72.0,35.0,218.93776,33.6,0.627,50.0,1
1,1.0,85.0,66.0,29.0,70.189298,26.6,0.351,31.0,0
2,8.0,183.0,64.0,29.0,269.968908,23.3,0.672,32.0,1
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0,0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1


In [2]:
# seperating the features and the target as x and y 
X =df.drop('Outcome', axis =1)
y =df['Outcome']

In [3]:
# Returning  the variance for X along the specified axis=0.
X.var()

Pregnancies                   11.354056
Glucose                      929.680350
BloodPressure                146.321591
SkinThickness                 77.285567
Insulin                     9484.259268
BMI                           48.813618
DiabetesPedigreeFunction       0.109779
Age                          138.303046
dtype: float64

Since the data is on different scale we should bring it to the same scale by using feature selection techniques.
Most common techniques are 
1. MinMaxScaler
2. Stardard Scaler 
3. Robust Scaler


In [4]:
#performing Feature Scaling 
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range = (0,10))
scaler.fit(X)
X_scaled = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled , columns = X.columns)

In [5]:
#Performing Variance Threshold
from sklearn.feature_selection import VarianceThreshold
select_features = VarianceThreshold(threshold = 1)
X_variance_threshold_df =select_features.fit_transform(X_scaled)
X_variance_threshold_df  = pd.DataFrame(X_variance_threshold_df)


In [6]:
def get_selected_features(raw_df , processed_df):
  selected_feature = []
  for i in range(len(processed_df.columns)):
    for j in range(len(raw_df.columns)):
      if (processed_df.iloc[:,i].equals(raw_df.iloc[:,j])):
        selected_feature.append(raw_df.columns[j])
  return selected_feature

In [7]:
selected_features = get_selected_features(X_scaled,X_variance_threshold_df)

In [8]:
selected_features

['Pregnancies',
 'Glucose',
 'BloodPressure',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age']

In [9]:
X_variance_threshold_df.columns = selected_features

In [10]:
X_variance_threshold_df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,Insulin,BMI,DiabetesPedigreeFunction,Age
0,3.529412,6.709677,4.897959,2.740295,3.149284,2.344150,4.833333
1,0.588235,2.645161,4.285714,1.018185,1.717791,1.165670,1.666667
2,4.705882,8.967742,4.081633,3.331099,1.042945,2.536294,1.833333
3,0.588235,2.903226,4.285714,1.293850,2.024540,0.380017,0.000000
4,0.000000,6.000000,1.632653,2.150572,5.092025,9.436379,2.000000
...,...,...,...,...,...,...,...
763,5.882353,3.677419,5.306122,2.289500,3.006135,0.397096,7.000000
764,1.176471,5.032258,4.693878,2.044244,3.803681,1.118702,1.000000
765,2.941176,4.967742,4.897959,1.502241,1.635992,0.713066,1.500000
766,0.588235,5.290323,3.673469,2.217956,2.433538,1.157131,4.333333


# 2. Chi-squared statistical analysis

The chi-square test is a statistical test used to determine if there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.

The null hypothesis of the chi-square test is that there is no significant difference between the expected and observed frequencies. The test statistic is calculated as the sum of the squared differences between the observed and expected frequencies, divided by the expected frequencies. The p-value is used to determine the significance of the test statistic. A small p-value (typically less than 0.05) indicates strong evidence against the null hypothesis and in favor of the alternative hypothesis.

In [11]:
# creating a data frame named diabetes and load the csv file again
diabetes = pd.read_csv('https://raw.githubusercontent.com/jainrachit108/datasets/main/diabetes.csv')

In [12]:
# assigning features to X variable and 'outcome' to y variable from the dataframe diabetes
X = diabetes.drop('Outcome',axis=1)
Y = diabetes['Outcome']

In [13]:
#import chi2 and SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest

In [14]:
X = X.astype(np.float64)

In [15]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    float64
 1   Glucose                   768 non-null    float64
 2   BloodPressure             768 non-null    float64
 3   SkinThickness             768 non-null    float64
 4   Insulin                   768 non-null    float64
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    float64
dtypes: float64(8)
memory usage: 48.1 KB


In [16]:
# Initialising SelectKBest with above parameters 
chi2_test = SelectKBest(score_func=chi2, k=4)

# fitting it with X and Y
chi2_model = chi2_test.fit(X,Y)

In [17]:
#Printing chi scores
chi2_model.scores_

array([ 111.51969064, 1411.88704064,   17.60537322,   53.10803984,
       2175.56527292,  127.66934333,    5.39268155,  181.30368904])

In [18]:
X_chi2_df = pd.concat([pd.DataFrame(X.columns) , pd.DataFrame(chi2_model.scores_)], axis =1)

In [19]:
X_chi2_df.columns = ['Features' , 'Scores']

In [20]:
X_chi2_df.sort_values('Scores', ascending = False)

Unnamed: 0,Features,Scores
4,Insulin,2175.565273
1,Glucose,1411.887041
7,Age,181.303689
5,BMI,127.669343
0,Pregnancies,111.519691
3,SkinThickness,53.10804
2,BloodPressure,17.605373
6,DiabetesPedigreeFunction,5.392682


Higher the value , higher the importance of that feature

Method 2:

By using get_support()

In [21]:
chi2_model.get_support()

array([False,  True, False, False,  True,  True, False,  True])

In [22]:
X.columns[chi2_model.get_support()]

Index(['Glucose', 'Insulin', 'BMI', 'Age'], dtype='object')

In order to drop the less score features

In [23]:
columns_to_drop = X.columns[~(chi2_model.get_support())]
X.drop(columns_to_drop ,axis =1)

Unnamed: 0,Glucose,Insulin,BMI,Age
0,148.0,0.0,33.6,50.0
1,85.0,0.0,26.6,31.0
2,183.0,0.0,23.3,32.0
3,89.0,94.0,28.1,21.0
4,137.0,168.0,43.1,33.0
...,...,...,...,...
763,101.0,180.0,32.9,63.0
764,122.0,0.0,36.8,27.0
765,121.0,112.0,26.2,30.0
766,126.0,0.0,30.1,47.0


# **ANOVA (Analysis of Variance) Test**

It is a statistical method used to determine if there is a significant difference between the means of two or more groups.
 
 The test produces an F-statistic and a p-value, which can be used to determine whether the null hypothesis (that the means of the groups are equal) can be rejected or not.

In [24]:
#importing libraries f_classif ,SelectPercentile from sklearn
from sklearn.feature_selection import f_classif, SelectPercentile
anova_test = SelectPercentile(f_classif, percentile = 80)
anova_model = anova_test.fit(X,Y)
anova_model.scores_

array([ 39.67022739, 213.16175218,   3.2569504 ,   4.30438091,
        13.28110753,  71.7720721 ,  23.8713002 ,  46.14061124])

In [25]:
# def generate_feature_score_df(X , scores):
#   feature_score = pd.DataFrame()
#   for i in range(X.shape[1]):
#     feature_score = feature_score.append({'Feature': X.columns[i] , 'Scores':scores[i] }, ignore_index = True)
#   return feature_score
feature_scores_df = pd.concat([pd.DataFrame(X.columns) , pd.DataFrame(anova_model.scores_)], axis =1)
feature_scores_df.columns =['Features' , 'Score']
feature_scores_df.sort_values(by= 'Score',ascending = False)

Unnamed: 0,Features,Score
1,Glucose,213.161752
5,BMI,71.772072
7,Age,46.140611
0,Pregnancies,39.670227
6,DiabetesPedigreeFunction,23.8713
4,Insulin,13.281108
3,SkinThickness,4.304381
2,BloodPressure,3.25695


In [26]:
# feature_scores_df = generate_feature_score_df(X ,anova_model.scores_ )

In [27]:
X_new = anova_model.transform(X)
X_new = pd.DataFrame(X_new)
selected_features = get_selected_features(X , X_new)
X_new.columns = selected_features

In [28]:
X_new

Unnamed: 0,Pregnancies,Glucose,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6.0,148.0,0.0,33.6,0.627,50.0
1,1.0,85.0,0.0,26.6,0.351,31.0
2,8.0,183.0,0.0,23.3,0.672,32.0
3,1.0,89.0,94.0,28.1,0.167,21.0
4,0.0,137.0,168.0,43.1,2.288,33.0
...,...,...,...,...,...,...
763,10.0,101.0,180.0,32.9,0.171,63.0
764,2.0,122.0,0.0,36.8,0.340,27.0
765,5.0,121.0,112.0,26.2,0.245,30.0
766,1.0,126.0,0.0,30.1,0.349,47.0


Summary:

Use chi-squared test when you're working with categorical data and want to compare the observed frequencies with the expected frequencies, while you use ANOVA when you have continuous data and want to compare the means of two or more groups.

# Wrapper Methods

Wrapper methods are a type of feature selection technique in which a subset of features is selected by repeatedly evaluating the performance of a model with different subsets of features. The goal is to select a subset of features that results in the best performance of the model. Wrapper methods can be computationally expensive, as they require training and evaluating the model multiple times for different subsets of features. However, they can be useful when the relationship between features and target variable is complex and not well understood. Some examples of wrapper methods are forward selection, backward elimination, and recursive feature elimination.

# **Recursive Feature Elimination**

Recursive Feature Elimination selects features by recursively considering smaller subsets of features by pruning the least important feature at each step. Here models are created iteartively and in each iteration it determines the best and worst performing features and this process continues until all the features are explored.

In the worst case, if a dataset contains N number of features RFE will do a greedy search for  N2  combinations of features.

In [29]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [30]:
model = LogisticRegression(solver = 'liblinear')
rfe = RFE(estimator=model,n_features_to_select=4)

In [31]:
fit = rfe.fit(X ,Y)

In [32]:
X.columns[rfe.get_support()]

Index(['Pregnancies', 'Glucose', 'BMI', 'DiabetesPedigreeFunction'], dtype='object')

# Embedded Method

An embedded method for feature scaling is a method where the feature scaling is included as part of the model itself, rather than being preprocessed before training.

In other words, feature selection is integrated into the model training process. The feature selection and model training are performed jointly, and the model's parameters are used to determine which features are most important. Examples of embedded methods for feature selection include:

1.Lasso regularization 

2.Ridge regression

Embedded methods are considered to be more powerful than filter methods, which select features independently of the learning algorithm, and more computationally efficient than wrapper methods, which evaluate the performance of a model with different subsets of features.

# **Lasso Regression(L1 Regularization)**

### Watch this video to clear the concepts
https://www.youtube.com/watch?v=NGf0voTMlcs

Lasso (Least Absolute Shrinkage and Selection Operator) is a type of regularization method for linear regression models that can be used to address the problem of overfitting. It is particularly useful when the number of predictors in a model is greater than the number of observations, or when there is a large number of correlated predictors. Lasso can be used to identify the most important predictors in a model by shrinking the coefficients of less important predictors to zero. It is also used in feature selection, which is the process of selecting a subset of relevant features for use in building a model.

Lasso regression is a type of linear regression that uses shrinkage, a technique that reduces the magnitude of the coefficients of the independent variables, in order to prevent overfitting and improve the interpretability of the model. It is a regularization method that adds a penalty term, known as the L1 penalty, to the least squares objective function. The L1 penalty forces the coefficients of some of the variables to be exactly equal to zero, which can be used for feature selection. 

Since Lasso Regression can exclude useless variables
from equations, it is a little better than Ridge Regression
at reducing the Variance in models that contain a lot of
useless variables.

In contrast, Ridge Regression tends to do a little better
when most variables are useful.

It is worth noting that Lasso regularization will always shrink the coefficient of some of the features to zero. However, it is not guaranteed that the features selected by Lasso will be the optimal subset of features for the task at hand. And also, it could be that non-zero coefficients are very small, which can be considered close to zero. It is always a good idea to check the performance of the model with different value of alpha and select the best one.

In [33]:
from sklearn.linear_model import Lasso
from sklearn.datasets import load_boston

# Load the Boston housing dataset
boston = load_boston()
X = pd.DataFrame(boston.data , columns=boston.feature_names)
y = boston.target



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

In [34]:
# Create a Lasso model with an alpha value of 0.1
lasso = Lasso(alpha=0.1)

# Fit the model to the data
lasso.fit(X, y)

# Print the non-zero coefficients
print(lasso.coef_)


[-0.09789363  0.04921111 -0.03661906  0.95519003 -0.          3.70320175
 -0.01003698 -1.16053834  0.27470721 -0.01457017 -0.77065434  0.01024917
 -0.56876914]


# Ridge Regression

Ridge Regression is a type of linear regression that uses shrinkage, a technique that reduces the magnitude of the coefficients of the independent variables, in order to prevent overfitting and improve the interpretability of the model. It is a regularization method that adds a penalty term, known as the L2 penalty or Ridge penalty, to the least squares objective function.


Watch this video:
https://www.youtube.com/watch?v=Q81RR3yKn30

Ridge Regression is particularly useful in situations where the number of predictors in the model is large, and some of the predictors are highly correlated. In these cases, Ridge Regression can help to reduce the variance of the model and improve its generalization performance by shrinking the coefficients of correlated predictors towards each other.

Ridge regression is also useful when you have a lot of features which are correlated and you don't want to remove any feature. Ridge regression does not remove any feature but it reduces the magnitude of features so that features can be less correlated.

In [35]:
from sklearn.linear_model import Ridge
from sklearn.datasets import make_regression

# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=10, random_state=42)

# Create an instance of the Ridge model
ridge = Ridge(alpha=0.1)

# Fit the model to the data
ridge.fit(X, y)

# Predict values for the input data
y_pred = ridge.predict(X)


Ridge Regression can also be used with cross validation to get the best alpha, by applying RidgeCV class from sklearn.linear_model.

In [36]:
from sklearn.linear_model import RidgeCV
ridge = RidgeCV(alphas=np.logspace(-6, 6, 13))
ridge.fit(X, y)
y_pred = ridge.predict(X)

One way to evaluate the accuracy of the model is to use the score method of the Ridge class, which returns the coefficient of determination (R^2) of the prediction. R^2 is a measure of how well the model fits the data, with a value of 1 indicating a perfect fit and a value of 0 indicating a poor fit.

In [37]:
# Evaluate the model using R^2
r2 = ridge.score(X, y)
print("R^2:", r2)


R^2: 0.9999999999999899


Another way to evaluate the performance of the Ridge Regression model is to use metrics such as mean squared error (MSE) or mean absolute error (MAE) to compare the predicted values to the true values.

In [38]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
y_pred = ridge.predict(X)
mse = mean_squared_error(y, y_pred)
mae = mean_absolute_error(y, y_pred)
print("Mean Squared Error:", mse)
print("Mean Absolute Error:", mae)


Mean Squared Error: 3.6924105100206793e-10
Mean Absolute Error: 1.5134725321566122e-05


Another way to evaluate the model is to use cross validation techniques like k-fold cross validation. This approach can help to give a more robust estimate of the model's performance by training and evaluating the model on different subsets of the data.

In [39]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(ridge, X, y, cv=5)
print("Cross-validation scores:", scores)
print("Mean cross-validation score:", np.mean(scores))


Cross-validation scores: [1. 1. 1. 1. 1.]
Mean cross-validation score: 0.9999999999999943


Rigde and Lasso Regression are used in Linear model.

Most commonly used linear models are
1. Linear regression : y = b0 + b1*x
2. Multilinear Regression : y = b0 + b1x1 + b2x2 +...+ bn*xn
3. Logistic Regression 

# **Hybrid Model**

Hybrid methods are a combination of different feature selection techniques, such as filter methods, wrapper methods, and embedded methods. They are used to combine the strengths of multiple feature selection techniques and overcome the limitations of individual methods.

Some popular examples of hybrid methods include:

Genetic Algorithms: This method uses genetic algorithms to search for the best subset of features. It combines the advantages of filter and wrapper methods by using a filter method to evaluate the importance of each feature and then using a wrapper method to evaluate the performance of different subsets of features.

Particle Swarm Optimization (PSO): This method is a stochastic optimization technique, inspired by the behavior of a swarm of birds or a school of fish, that can be used to find a good subset of features. It combines the advantages of filter and wrapper methods by using a filter method to evaluate the importance of each feature and then using a wrapper method to evaluate the performance of different subsets of features.

Artificial Neural Networks (ANN): This method uses a neural network to learn the relationship between the input features and the output variable. It combines the advantages of filter and embedded methods by using a filter method to evaluate the importance of each feature and then using an embedded method to learn the relationship between the input features and the output variable.

Hybrid feature selection using mutual information and genetic algorithm: This method uses mutual information to evaluate the importance of each feature and then uses a genetic algorithm to search for the best subset of features.

Hybrid feature selection using Random Forest and Recursive Feature Elimination: This method uses feature importance calculation by Random Forest algorithm to evaluate the importance of each feature and then uses Recursive Feature Elimination (RFE) to select the best subset of features.

Hybrid methods can be more effective than individual methods because they can take into account multiple criteria for feature selection. However, they also can be more computationally expensive, so it's important to carefully consider the trade-off between performance and computational cost when choosing a feature selection method.

# **Multicollinearity**

Multicollinearity is a statistical phenomenon that occurs when two or more independent variables in a multiple regression model are highly correlated with each other. This can cause problems in interpreting the individual effects of the variables on the dependent variable, as well as in estimating the coefficients of the model.

VIF (Variance Inflation Factor) is one technique that can be used to detect and quantify multicollinearity in a multiple regression model.

Removing highly correlated variables or combining them into a single variable is one way to handle multicollinearity. Other ways include using regularization techniques such as Ridge or Lasso.

# **Handling Multicollinearity with VIF**

Variance Inflation Factor (VIF) is used to determine whether the independent variables in a model are correlated with each other. A VIF score is calculated for each independent variable, and a high VIF score indicates that the variable is highly correlated with one or more other variables.

VIF is defined as the ratio of variance of an estimated regression coefficient when all other independent variables are included in the model, to the variance of the estimated coefficient when that variable is fit alone.

A VIF value of 1 means that there is no correlation among the independent variables, while a VIF value greater than 1 indicates that there is correlation among the independent variables. Typically, a VIF value greater than 5 or 10 is considered to indicate high correlation, and the variable should be removed from the model or combined with another variable. Removing correlated variables can improve the model's performance and interpretability.

In [40]:
dia_df =diabetes = pd.read_csv('https://raw.githubusercontent.com/jainrachit108/datasets/main/diabetes_cleaned.csv')

In [41]:
from sklearn.model_selection import train_test_split
X = dia_df.iloc[:,:-1]
Y =dia_df.iloc[:,-1]
x_train,x_test,y_train,y_test = train_test_split(X , Y , test_size=0.2)

In [42]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif =pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['Features'] =X.columns
vif

Unnamed: 0,VIF Factor,Features
0,3.275956,Pregnancies
1,30.21158,Glucose
2,33.363489,BloodPressure
3,17.10194,SkinThickness
4,6.476848,Insulin
5,32.175164,BMI
6,3.140762,DiabetesPedigreeFunction
7,14.375335,Age
