# Feature Selection

Feature engineering is one of the most important task in any machine learning project. Feature selection is one of the subtask that is carried out to select the features which best represent the target variable. There are various methods for feature selection that will be using the project including feature selection using correlation, select k best using chi square , select from models such as logistic regression, random forest etc.
Feature selection can be:
1. Filter-based
2. Wrapper-based
3. Embedded methods

In [1]:
# Load the packages
import warnings
warnings.filterwarnings("ignore")
import json
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, RFECV, SequentialFeatureSelector, SelectFromModel
from sklearn.feature_selection import chi2, f_classif, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [2]:
# Load the data
df = pd.read_csv('./../../../data/train/train.csv')

Let us assume that out of 53 colums that are available we want around 30 columns.

In [8]:
# Declare the number of features required
n_feat = 30

In [11]:
# Separate out the data into features and target variable
y = df['Attrition']
X = df.drop('Attrition', axis=1)

# Filter-based methods

### High Correlation Filter

#### Correlation

Starting with our first method first-based method which is dependent on correlation, we eliminate columns whose correlation coefficients is greater than 0.75.

In [3]:
# Create correlation matrix
corr_matrix = df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find features with correlation greater than 0.75
to_drop = [column for column in upper.columns if any(upper[column] > 0.75)]

print(f"Columns to be dropped: {to_drop}")

Columns to be dropped: ['TotalWorkingYears', 'YearsInCurrentRole', 'YearsWithCurrManager', 'JobLevel', 'PerformanceRating', 'Department Sales', 'Gender Male', 'JobRole Human Resources']


In [7]:
# Create a resulting dataframe for the output
columns = df.columns
result = [True if column not in to_drop else False for column in columns]
correlation = pd.DataFrame(columns=['Correlation'], index=columns)
correlation['Correlation'] = result

### Univariate Selection Methods

#### SelectKBest using f_classif

In [12]:
# Declare the transformer
f_class_selector = SelectKBest(k=n_feat)
f_class_selector.fit(X, y)

SelectKBest(k=30)

In [15]:
# Get the features from the support
f_class_support = f_class_selector.get_support()
f_class_features = X.loc[:, f_class_support].columns.tolist()
print(f"Feature select with f classif: {f_class_features}")

Feature select with f classif: ['Age', 'DailyRate', 'DistanceFromHome', 'MonthlyIncome', 'TotalWorkingYears', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsWithCurrManager', 'EnvironmentSatisfaction', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'OverTime', 'StockOptionLevel', 'BusinessTravel Non-Travel', 'BusinessTravel Travel_Rarely', 'Department Human Resources', 'Department Research & Development', 'EducationField Life Sciences', 'EducationField Medical', 'EducationField Other', 'Gender Female', 'JobRole Healthcare Representative', 'JobRole Manager', 'JobRole Manufacturing Director', 'JobRole Research Director', 'JobRole Sales Representative', 'MaritalStatus Divorced', 'MaritalStatus Married', 'MaritalStatus Single']


In [17]:
# Store the result in dataframe
columns = df.columns
result = [True if column in f_class_features else False for column in columns]
f_class_df = pd.DataFrame(columns=['F classif'], index=columns)
f_class_df['F classif'] = result

#### SelectKBest using chi square

In [18]:
# Declare the transformer
chi2 = SelectKBest(chi2, k=n_feat)
chi2.fit(X,y)

SelectKBest(k=30, score_func=<function chi2 at 0x1466c5a60>)

In [20]:
# Get the features from the support
chi2_support = chi2.get_support()
chi2_features = X.loc[:, chi2_support].columns.tolist()
print(f"Features selected with chi2: {chi2_features}")

Features selected with chi2: ['Age', 'MonthlyIncome', 'TotalWorkingYears', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsWithCurrManager', 'EnvironmentSatisfaction', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'OverTime', 'StockOptionLevel', 'BusinessTravel Non-Travel', 'BusinessTravel Travel_Rarely', 'Department Human Resources', 'Department Research & Development', 'EducationField Life Sciences', 'EducationField Medical', 'EducationField Other', 'EducationField Technical Degree', 'Gender Female', 'JobRole Healthcare Representative', 'JobRole Human Resources', 'JobRole Manager', 'JobRole Manufacturing Director', 'JobRole Research Director', 'JobRole Sales Representative', 'MaritalStatus Divorced', 'MaritalStatus Married', 'MaritalStatus Single']


In [21]:
# Store the result in dataframe
columns = df.columns
result = [True if column in chi2_features else False for column in columns]
f_class_df = pd.DataFrame(columns=['Chi2'], index=columns)
f_class_df['Chi2'] = result

#### SelectKBest using mutual information

In [22]:
# Declare the transformer
mutual_info = SelectKBest(mutual_info_classif, k=n_feat)
mutual_info.fit(X, y)

SelectKBest(k=30, score_func=<function mutual_info_classif at 0x146e72700>)

In [23]:
# Get the features from the support
mutual_info_support = mutual_info.get_support()
mutual_info_features = X.loc[:, mutual_info_support].columns.tolist()

In [25]:
# Store the result in dataframe
columns = df.columns
result = [True if column in mutual_info_features else False for column in columns]
f_class_df = pd.DataFrame(columns=['Mutual Informtion'], index=columns)
f_class_df['Mutual Informtion'] = result