# <b><center>Feature Selection</center><b>

### **Dropping constant features**

Dropping constant features, also known as `constant variance` or `zero variance` features, is a common step in feature selection during data preprocessing. It involves identifying and removing features that have the `same` value for all samples in the dataset. These constant features typically provide `no useful information` and *can potentially hinder the performance of machine learning models*.

Here are a few reasons why dropping constant features is beneficial:

- **No Discriminatory Power**: Features with constant values do not vary across different samples or instances in the dataset. Since these features do not change, they do not contribute any discriminatory power to distinguish between different classes or patterns in the data. `Keeping such features may introduce noise or bias into the model`.

- **Redundant Information**: Constant features do not provide any additional information beyond the single value they possess. Including such features in the analysis does not add any value to the model's understanding of the data or its ability to make accurate predictions. Removing them simplifies the dataset and reduces redundancy.

- **Computational Efficiency**: Dropping constant features can lead to computational efficiency gains, especially when dealing with large datasets. These features do not contribute to the learning process, yet they require computational resources for processing, memory storage, and model training. Removing them reduces the overall computational burden.

- **Model Stability**: Including constant features in a machine learning model may cause instability or overfitting. Overfitting occurs when a model learns noise or irrelevant patterns in the training data, resulting in poor generalization performance on unseen data. Removing constant features reduces the risk of overfitting and promotes better model stability.

While dropping constant features is generally considered a good practice, it's important to note that there may be exceptions. For example, in certain specific cases or domains, constant features might carry some unique significance or have a specific purpose. Therefore, it's always essential to carefully analyze and understand the data before making decisions about feature selection and removal.

### **Importance of Feature Selection in High-Dimensional Data Analysis:**

High-dimensional data poses problems such as increased computational complexity, overfitting, and sparsity. As the number of features increases, the data becomes increasingly sparse, making it difficult to find meaningful patterns or relationships. This sparsity can lead to poor generalization and decreased model performance.

Feature selection aims to address the curse of dimensionality by reducing the number of irrelevant or redundant features, thereby improving model efficiency and reducing overfitting. By selecting the most informative features, we can mitigate the challenges associated with high-dimensional data and focus on the most relevant information for solving the problem at hand.

**Without Pre-Processing & Feature Selection**

In [1]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Generate a random dataset with 1000 samples and 100 features
X, y = make_classification(n_samples=1000, n_features=100, random_state=42)

# Shape of the dataset
print(X.shape, y.shape)

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train a logistic regression classifier without feature selection
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)

# Make predictions on the test set
y_pred = lr.predict(X_test)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy without feature selection:", accuracy*100)

(1000, 100) (1000,)
Accuracy without feature selection: 83.0


**With Pre-Processing & Feature Selection**

In [2]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Generate a random dataset with 1000 samples and 100 features
X, y = make_classification(n_samples=1000, n_features=100, random_state=42)

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Preprocess the data by scaling it between 0 and 1
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Perform feature selection using the chi-squared test
k = 10  # Select the top 10 features
selector = SelectKBest(score_func=chi2, k=k)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)

# Train a logistic regression classifier with feature selection
lr = LogisticRegression(random_state=42)
lr.fit(X_train_selected, y_train)

# Make predictions on the test set
y_pred = lr.predict(X_test_selected)

# Calculate the accuracy of the classifier with feature selection
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy with feature selection:", accuracy*100)

Accuracy with feature selection: 88.0


### **Variance Threshold**


The VarianceThreshold class from the `sklearn.feature_selection` module in scikit-learn is used for feature selection based on `variance`. It's primarily used to remove features (`columns`) from a dataset that have `low variance`, assuming that such features contain less useful or redundant information.

The idea behind this approach is that if a feature has very little variance, it means that it does not vary much across the samples in the dataset. In other words, the feature has almost the same value for all or most of the samples. Such features often provide little discriminatory power and may not contain much useful information for building predictive models.

Here's a simple example to illustrate the usage of VarianceThreshold:

In [3]:
# Import pandas to create DataFrame
from sklearn.feature_selection import VarianceThreshold
import pandas as pd

# Make DataFrame of the given data
data = pd.DataFrame({"A": [1, 2, 4, 1, 2, 4],
                     "B": [4, 5, 6, 7, 8, 9],
                     "C": [0, 0, 0, 0, 0, 0],
                     "D": [1, 1, 1, 1, 1, 1]})


# Create VarianceThreshold object with threshold=0
var_thres = VarianceThreshold(threshold=0)

# Fit the object to the data
var_thres.fit(data)

# Get the support (boolean mask) of the features
feature_mask = var_thres.get_support()

# Get the column names of the constant columns
constant_columns = [
    column for column in data.columns if column not in data.columns[feature_mask]]

print("Number of constant columns:", len(constant_columns))

for feature in constant_columns:
    print(feature)

Number of constant columns: 2
C
D


In [4]:
# Import pandas to create DataFrame
from sklearn.feature_selection import VarianceThreshold
import pandas as pd

# Make DataFrame of the given data
data = pd.DataFrame({"A": [1, 2, 4, 1, 2, 4],
                     "B": [4, 5, 6, 7, 8, 9],
                     "C": [0, 0, 0, 0, 0, 0],
                     "D": [1, 1, 1, 1, 1, 1]})

# Display the initial dataset
print("Initial Dataset:")
print(data)

# Create VarianceThreshold object with threshold=0
var_thres = VarianceThreshold(threshold=0)

# Fit the object to the data
var_thres.fit(data)

# Get the support (boolean mask) of the features
feature_mask = var_thres.get_support()
print("\nFeature Mask:")
print("If the particular feature is selected (True), if not selected (False) ")
print(feature_mask)

# Get the column names of the constant columns
constant_columns = data.columns[~feature_mask]

print("\nConstant Columns:")
print(constant_columns)

# Drop the constant columns from the dataset
data = data.drop(constant_columns, axis=1)

# Display the updated dataset
print("\nUpdated Dataset:")
print(data)

Initial Dataset:
   A  B  C  D
0  1  4  0  1
1  2  5  0  1
2  4  6  0  1
3  1  7  0  1
4  2  8  0  1
5  4  9  0  1

Feature Mask:
If the particular feature is selected (True), if not selected (False) 
[ True  True False False]

Constant Columns:
Index(['C', 'D'], dtype='object')

Updated Dataset:
   A  B
0  1  4
1  2  5
2  4  6
3  1  7
4  2  8
5  4  9


### **Applying the Above technique on Santander Customer Dataset**

[Downlode Dataset](https://www.kaggle.com/c/santander-customer-satisfaction/data?select=train.csv)

In [5]:
import pandas as pd
from sklearn.feature_selection import VarianceThreshold

df=pd.read_csv('_dataset\santander.csv',nrows=10000)
print(df.shape)
display(df.head())

(10000, 371)


Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
0,1,2,23,0.0,0.0,0.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.17,0
1,3,2,34,0.0,0.0,0.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.03,0
2,4,2,23,0.0,0.0,0.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.77,0
3,8,2,37,0.0,195.0,195.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.97,0
4,10,2,39,0.0,0.0,0.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0


In [6]:
X = df.drop(labels=['TARGET'], axis=1)
y = df['TARGET']

In [7]:
from sklearn.model_selection import train_test_split

# separate dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(labels=['TARGET'], axis=1),
    df['TARGET'],
    test_size=0.3,
    random_state=42)

print(X_train.shape, X_test.shape)

(7000, 370) (3000, 370)


**Lets apply the variance threshold**

In [8]:
var_thres = VarianceThreshold(threshold=0)
var_thres.fit(X_train)

VarianceThreshold(threshold=0)

In [9]:
var_thres.get_support()

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False, False,  True,  True,  True,  True,  True, False,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False, False, False, False,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True,  True, False, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False, False,  True,  True,  True,
        True,  True, False, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,

In [10]:
### Finding non constant features
sum(var_thres.get_support())

281

In [11]:
# Lets Find non-constant features 
len(X_train.columns[var_thres.get_support()])

281

In [12]:
constant_columns = [column for column in X_train.columns
                    if column not in X_train.columns[var_thres.get_support()]]

print(len(constant_columns))

89


In [13]:
for column in constant_columns:
    print(column)

ind_var2_0
ind_var2
ind_var13_medio_0
ind_var13_medio
ind_var18_0
ind_var18
ind_var27_0
ind_var28_0
ind_var28
ind_var27
ind_var34_0
ind_var34
ind_var41
ind_var46_0
ind_var46
num_var13_medio_0
num_var13_medio
num_var18_0
num_var18
num_var27_0
num_var28_0
num_var28
num_var27
num_var34_0
num_var34
num_var41
num_var46_0
num_var46
saldo_var13_medio
saldo_var18
saldo_var28
saldo_var27
saldo_var34
saldo_var41
saldo_var46
delta_imp_amort_var18_1y3
delta_imp_amort_var34_1y3
delta_imp_reemb_var17_1y3
delta_imp_reemb_var33_1y3
delta_imp_trasp_var17_in_1y3
delta_imp_trasp_var17_out_1y3
delta_imp_trasp_var33_out_1y3
delta_num_reemb_var17_1y3
delta_num_reemb_var33_1y3
delta_num_trasp_var17_in_1y3
delta_num_trasp_var17_out_1y3
delta_num_trasp_var33_out_1y3
imp_amort_var18_hace3
imp_amort_var18_ult1
imp_amort_var34_hace3
imp_amort_var34_ult1
imp_var7_emit_ult1
imp_reemb_var13_hace3
imp_reemb_var17_hace3
imp_reemb_var17_ult1
imp_reemb_var33_hace3
imp_reemb_var33_ult1
imp_trasp_var17_in_hace3
imp_trasp_

In [14]:
X_train.drop(constant_columns,axis=1)

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var29_ult3,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38
9069,18258,2,23,0.0,300.9,813.75,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016
2603,5211,2,29,0.0,0.0,0.00,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,69825.090000
7738,15572,2,25,0.0,0.0,0.00,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,88115.130000
1579,3125,2,24,0.0,0.0,0.00,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,150490.350000
5058,10101,2,33,0.0,0.0,0.00,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5734,11502,2,23,0.0,0.0,0.00,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,78929.940000
5191,10394,2,49,0.0,0.0,0.00,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,127000.770000
5390,10815,2,24,0.0,0.0,0.00,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,95723.880000
860,1703,2,17,0.0,0.0,0.00,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,101492.340000


### **On Iris Dataset**

In [17]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [18]:
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import seaborn as sns

# Load the Titanic dataset from Seaborn
titanic_data = sns.load_dataset('titanic')

# Preprocessing
titanic_data = titanic_data.drop(['class', 'deck', 'embark_town', 'alive', 'alone'], axis=1)
titanic_data['sex'] = titanic_data['sex'].map({'female': 0, 'male': 1})
titanic_data['age'] = titanic_data['age'].fillna(titanic_data['age'].mean())
titanic_data['fare'] = titanic_data['fare'].fillna(titanic_data['fare'].mean())
titanic_data['family_size'] = titanic_data['sibsp'] + titanic_data['parch'] + 1
titanic_data = titanic_data.drop(['sibsp', 'parch'], axis=1)

# Perform one-hot encoding on categorical variables
titanic_data = pd.get_dummies(titanic_data, drop_first=True)

X = titanic_data.drop('survived', axis=1)
y = titanic_data['survived']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Print the shape of the dataset before preprocessing
print("Before preprocessing:")
print("Number of samples:", X.shape[0])
print("Number of features:", X.shape[1])

# Train a KNN classifier on the original dataset
clf_original = KNeighborsClassifier()
clf_original.fit(X_train, y_train)
y_pred_original = clf_original.predict(X_test)
accuracy_original = accuracy_score(y_test, y_pred_original)
print("Accuracy before preprocessing:", accuracy_original)

# Apply VarianceThreshold with threshold 0.2
selector = VarianceThreshold(threshold=0.2)
X_selected = selector.fit_transform(X)

# Split the selected features into training and testing sets
X_train_selected, X_test_selected, y_train, y_test = train_test_split(
    X_selected, y, test_size=0.2, random_state=42)

# Print the shape of the dataset after feature selection
print("After feature selection:")
print("Number of samples:", X_selected.shape[0])
print("Number of features:", X_selected.shape[1])

# Train a KNN classifier on the selected features
clf_selected = KNeighborsClassifier()
clf_selected.fit(X_train_selected, y_train)
y_pred_selected = clf_selected.predict(X_test_selected)
accuracy_selected = accuracy_score(y_test, y_pred_selected)
print("Accuracy after feature selection:", accuracy_selected)

Before preprocessing:
Number of samples: 891
Number of features: 10
Accuracy before preprocessing: 0.7374301675977654
After feature selection:
Number of samples: 891
Number of features: 9
Accuracy after feature selection: 0.7430167597765364


In [19]:
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import seaborn as sns

# Load the Titanic dataset from Seaborn
titanic_data = sns.load_dataset('titanic')

# Preprocessing
titanic_data = titanic_data.drop(
    ['class', 'deck', 'embark_town', 'alive', 'alone'], axis=1)
titanic_data['sex'] = titanic_data['sex'].map({'female': 0, 'male': 1})
titanic_data['age'] = titanic_data['age'].fillna(titanic_data['age'].mean())
titanic_data['fare'] = titanic_data['fare'].fillna(titanic_data['fare'].mean())
titanic_data['family_size'] = titanic_data['sibsp'] + titanic_data['parch'] + 1
titanic_data = titanic_data.drop(['sibsp', 'parch'], axis=1)

# Perform one-hot encoding on categorical variables
titanic_data = pd.get_dummies(titanic_data, drop_first=True)

X = titanic_data.drop('survived', axis=1)
y = titanic_data['survived']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Print the shape of the dataset before feature selection
print("Before preprocessing:")
print("Number of samples:", X.shape[0])
print("Number of features:", X.shape[1])

# Apply VarianceThreshold with threshold 0.2
selector = VarianceThreshold(threshold=0.2)
X_selected = selector.fit_transform(X)

# Print the shape of the dataset after feature selection
print("After feature selection:")
print("Number of samples:", X_selected.shape[0])
print("Number of features:", X_selected.shape[1])

# Split the selected features into training and testing sets
X_train_selected, X_test_selected, y_train, y_test = train_test_split(
    X_selected, y, test_size=0.2, random_state=42)

# Define the parameter grid for grid search
param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11],  # Vary the number of neighbors
    'weights': ['uniform', 'distance'],  # Explore different weight options
}

# Perform grid search to find the best hyperparameters
clf_selected = KNeighborsClassifier()
grid_search = GridSearchCV(clf_selected, param_grid, cv=5)
grid_search.fit(X_train_selected, y_train)

# Get the best model from grid search
best_model = grid_search.best_estimator_

# Make predictions on the test set using the best model
y_pred_selected = best_model.predict(X_test_selected)
accuracy_selected = accuracy_score(y_test, y_pred_selected)
print("Accuracy after feature selection and hyperparameter tuning:", accuracy_selected)

Before preprocessing:
Number of samples: 891
Number of features: 10
After feature selection:
Number of samples: 891
Number of features: 9
Accuracy after feature selection and hyperparameter tuning: 0.7206703910614525
