# **Problem Statement**  
## **13. Perform feature selection using correlation thresholding.**

Perform feature selection using correlation thresholding to remove highly correlated features from a dataset while retaining the most informative ones.

### Constraints & Example Inputs/Outputs

### Constraints
- Dataset is numerical
- Use Pearson correlation
- Threshold range: 0 < threshold < 1
- Remove one feature from each highly correlated pair


### Example Input:
```python
| Feature A | Feature B | Feature C |
| --------- | --------- | --------- |
| 10        | 20        | 5         |
| 20        | 40        | 6         |
| 30        | 60        | 7         |

```

### Expected Output:
- Drop either Feature A or Feature B
- Keep Feature C

### Solution Approach

### What is Correlation Thresholding?
- Measures linear dependency between features
- Highly correlated features → redundant information
- Remove one to:
    - Reduce multicollinearity
    - Improve model stability
    - Reduce overfitting

### Steps
1. Compute correlation matrix
2. Take absolute values
3. Identify pairs above threshold
4. Drop one feature from each pair
4. Return selected features

### Solution Code

In [1]:
# Approach1: Brute Force (Nested Loop)
import numpy as np
import pandas as pd

def correlation_threshold_bruteforce(df, threshold):
    corr_matrix = df.corr()
    features = df.columns.tolist()
    to_drop = set()

    for i in range(len(features)):
        for j in range(i + 1, len(features)):
            if abs(corr_matrix.iloc[i, j]) > threshold:
                to_drop.add(features[j])

    selected_features = [f for f in features if f not in to_drop]
    return selected_features, df[selected_features]


### Alternative Solution

In [3]:
# Approach 2: Optimized (Upper Triangle Matrix)
def correlation_threshold_optimized(df, threshold):
    corr_matrix = df.corr().abs()
    upper_triangle = corr_matrix.where(
        np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
    )

    to_drop = [
        column for column in upper_triangle.columns
        if any(upper_triangle[column] > threshold)
    ]

    selected_features = df.drop(columns=to_drop)
    return selected_features.columns.tolist(), selected_features


### Alternative Approaches

### Other Feature Selection Methods
- Variance Threshold
- Mutual Information
- Recursive Feature Elimination (RFE)
- L1 Regularization (Lasso)
- Tree-based Feature Importance

### Test Case

In [4]:
# Test Case 1: Simple Correlated Dataset

data = {
    "A": [10, 20, 30, 40],
    "B": [20, 40, 60, 80],  # Highly correlated with A
    "C": [5, 7, 9, 11]
}

df = pd.DataFrame(data)


In [5]:
features, selected_df = correlation_threshold_bruteforce(df, threshold=0.9)
print("Brute Force Selected Features:", features)
selected_df


Brute Force Selected Features: ['A']


Unnamed: 0,A
0,10
1,20
2,30
3,40


In [6]:
features, selected_df = correlation_threshold_optimized(df, threshold=0.9)
print("Optimized Selected Features:", features)
selected_df


Optimized Selected Features: ['A']


Unnamed: 0,A
0,10
1,20
2,30
3,40


In [7]:
# Test Case 2: No Highly Correlated Features

data = {
    "X1": [1, 2, 3, 4],
    "X2": [2, 5, 1, 7],
    "X3": [9, 3, 6, 2]
}

df = pd.DataFrame(data)

features, selected_df = correlation_threshold_optimized(df, threshold=0.8)
print("Selected Features:", features)


Selected Features: ['X1', 'X2']


In [8]:
# Test Case 3: All Features Correlated

data = {
    "F1": [1, 2, 3, 4],
    "F2": [2, 4, 6, 8],
    "F3": [3, 6, 9, 12]
}

df = pd.DataFrame(data)

features, selected_df = correlation_threshold_optimized(df, threshold=0.9)
print("Selected Features:", features)
selected_df


Selected Features: ['F1']


Unnamed: 0,F1
0,1
1,2
2,3
3,4


In [9]:
# Test Case 4: Realistic Random Dataset

np.random.seed(42)

data = {
    "Age": np.random.randint(20, 60, 100),
    "Salary": np.random.randint(30000, 90000, 100),
    "Experience": np.random.randint(1, 40, 100),
}

# Introduce correlation
data["Salary_Duplicate"] = data["Salary"] * 1.05

df = pd.DataFrame(data)

features, selected_df = correlation_threshold_optimized(df, threshold=0.85)
print("Selected Features:", features)


Selected Features: ['Age', 'Salary', 'Experience']


### Expected Outputs

✔ Highly correlated features removed

✔ Independent features retained

✔ Same output from brute force & optimized

✔ Ready for ML model input

## Complexity Analysis

### Brute Force
- Time: O(n²)
- Space: O(n²) (correlation matrix)

### Optimized
- Time: O(n²) (but fewer comparisons)
- Space: O(n²)

#### Thank You!!