# Machine Learning
### Variance threshold for feature selection
**Variance threshold** method is one of the simplest **filter**-based **feature selection** techniques. 
- It removes features whose variance across samples is below a specified threshold
    - Under the assumption that features with very low variance contain little useful information for prediction.

<hr>

**Reminder:** **Feature selection** is the process of identifying and retaining the most *relevant* input *variables* (*features*) from a dataset to improve model performance, reduce overfitting, speed up training, and enhance interpretability. 
- It helps eliminate irrelevant, redundant, or noisy features. 

<hr> 

For a feature vector $\boldsymbol{x}=[x_1,x_2,...,x_n]$ with $n$ samples (we use **sample variance**):
- $Variance_{sample}(x)=\sigma^2=\frac{1}{n-1}\sum_{i=1}^{n}(x_iâˆ’\mu)^2$

Where 
- $\mu=\frac{1}{n}\sum_{i=1}^n x_i$ (**sample mean** of the feature)
- $x_i$: is the i-th sample value.

<hr>

Then, **threshold decision rule** is:
- Keep feature $j$ if $\sigma_j^2 \ge threshold$
- Discard feature $j$ if $\sigma_j^2\lt threshold$

<hr>

In the following, we implement **variance threshold** from scratch and use it with a simple dataset. Each row of the datset is a data point. And each column of the dataset is the samples of a feature.
- As a bonus, we also give the code to use `VarianceThreshold` of **Scikit-learn**.

<hr>

https://github.com/ostad-ai/Machine-Learning
<br> Explanation: https://www.pinterest.com/HamedShahHosseini/Machine-Learning/

In [1]:
# Import required module
import numpy as np

In [2]:
def variance_threshold_selection(X, threshold=0.0):
    """
    Select features with variance > threshold.
    
    Parameters:
    - X: 2D NumPy array (n_samples, n_features)
    - threshold: minimum variance to keep a feature (default = 0.0)
    
    Returns:
    - X_selected: array with selected features
    - selected_indices: indices of kept features
    """
    # Compute variance for each feature (column)
    variances = np.var(X, axis=0, ddof=1)  # ddof=1 for sample variance
    
    # Find features with variance > threshold
    selected_indices = np.where(variances > threshold)[0]
    
    # Select those columns
    X_selected = X[:, selected_indices]
    
    return X_selected, selected_indices, variances

In [6]:
# Example usage
# Feature 0 of X is obviously
# the one with least variance
X = np.array([
    [0.12, 1, 0.9],
    [0.14, 2, 1.2],
    [0.13, 3, 1.],
    [0.1, 4, -0.9]

])

# Now apply Variance Threshold
X_new, kept_idx, vars_ = variance_threshold_selection(X, threshold=0.1)

print(f"Original variances:\n{vars_}")
print("Kept feature indices:", kept_idx)
print("Selected features:\n", X[:,kept_idx])

Original variances:
[2.91666667e-04 1.66666667e+00 9.50000000e-01]
Kept feature indices: [1 2]
Selected features:
 [[ 1.   0.9]
 [ 2.   1.2]
 [ 3.   1. ]
 [ 4.  -0.9]]


<hr style="height:3px;background-color:lightblue">

# Bonus
### Feature selection by Variance Threshold using scikit-learn 

In [7]:
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.1)
X_selected = selector.fit_transform(X)
print(X_selected)

[[ 1.   0.9]
 [ 2.   1.2]
 [ 3.   1. ]
 [ 4.  -0.9]]
