In [1]:
# Variamce Threshold

The variance of a feature is simply the average of the squared differences between each value in the feature and the mean of the feature:
 
    variance = sum((x - mean)**2) / (n - 1)

Where:

         x = value of a sample in the feature, 
         mean: mean of the feature, 
         n : number of samples in the feature. 
         
         The denominator n-1 is used instead of n to correct for bias.


The Variance Threshold method is a feature selection technique that removes low-variance features from a dataset. The idea behind this method is that features with very low variance are likely to contain little information and are therefore less useful for machine learning models.


To apply the Variance Threshold method, we first compute the variance of each feature in the dataset. We then select a threshold value for the variance, below which features are considered to have low variance. Features with a variance below the threshold are then removed from the dataset.

In [2]:
from sklearn.feature_selection import VarianceThreshold
import numpy as np

In [3]:
# Create a dataset with three features and five samples
X = np.array([[0, 1, 0],
              [0, 1, 1],
              [0, 1, 0],
              [0, 1, 1],
              [1, 0, 0]])

In [4]:
# Compute the variance of each feature
variances = np.var(X, axis=0)

# Print the variances of each feature
print(f"Variances: {variances}")

Variances: [0.16 0.16 0.24]


In [5]:
# Apply the Variance Threshold method with a threshold of 0.1
selector = VarianceThreshold(threshold=0.1)
X_new = selector.fit_transform(X)

# Print the remaining features after applying the threshold
print(f"Selected features: {selector.get_support()}")
# The get_support method returns a boolean mask that indicates which features were selected.
print(f"New dataset: {X_new}")

Selected features: [ True  True  True]
New dataset: [[0 1 0]
 [0 1 1]
 [0 1 0]
 [0 1 1]
 [1 0 0]]
