## Feature correlation analysis

## Context

with 400 + features, many derived from similar signals, redudancy is inevitable. Highly correlated features add noise, slow down training, and can destabilize linear model. this analysis identifies clusters of correlated features for pruning

## Objective 
- compute correaltion metrices for numeric features
- identify highly correlated feature paris(|r| > 0.95)
- select candidates for removal based on correalatin and importance
- quantify potential dimensionality reduction


In [1]:
# import  library and load data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings

warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')

train = pd.read_parquet(Path('../data/interim/train_merged.parquet'))
print(f'Data loaded: {train.shape}')

Data loaded: (590540, 434)


In [2]:
# numeric features
numeric_cols = train.select_dtypes(include=[np.number]).columns.tolist()
numeric_cols = [c for c in numeric_cols if c not in ['Transactionid', 'isFraud']]

print(f'Numeric features : {len(numeric_cols)}')

# sample for speed
sample = train[numeric_cols].sample(n=50000, random_state=42)

Numeric features : 402


In [7]:
# V FEATURE CORRELATION
v_cols = [c for c in numeric_cols if c.startswith('V')]
print(f'v-features : {len(v_cols)}')

# compute correaltion matrix
v_corr = sample[v_cols].corr()

# find high correlation pairs
high_corr_pairs = []
for i in range(len(v_cols)):
    for j in range(i+1, len(v_cols)):
        if abs(v_corr.iloc[i, j ]) > 0.95:
            high_corr_pairs.append((v_cols[i], v_cols[j], v_corr.iloc[i,j]))


print(f'Highly correlated pairs |r| > 95 : {len(high_corr_pairs)}')
print('\nSample pairs :')
for f1, f2, r in high_corr_pairs[:10]:
    print(f'{f1} - {f2} : {r:.3f}')

v-features : 339
Highly correlated pairs |r| > 95 : 634

Sample pairs :
V10 - V11 : 0.970
V15 - V16 : 0.985
V15 - V33 : 0.956
V15 - V57 : 0.954
V15 - V94 : 0.953
V17 - V18 : 0.994
V17 - V21 : 0.959
V18 - V21 : 0.954
V21 - V22 : 0.965
V21 - V84 : 0.956


## insight : V-features correlations clusters

from the correlation anyallysis , we identify dense clusters of redudancy :
- many v-features paris exceed r > 0.95 whic indicates near perfect correaltion
- these likely com from similiar aggregation windows or realted transformations in vesta's system
- keeping all of them adds noise and training overhead without imporving AUC

pruning strategy : for each correalted pair, drop the feature with lower target correlation or higher missingness to preserve the most predictive signal

In [12]:
# identify redundant features 

# target correaltion
target_corr = sample.join(train['isFraud']).corr()['isFraud'].drop('isFraud').abs()

# for each correalted pairm mark the one with lower target correaltionfor removal
to_remove = set()
for f1, f2, r in high_corr_pairs:
    if target_corr.get(f1, 0) < target_corr.get(f2, 0):
        to_remove.add(f1)
    else:
        to_remove.add(f2)
        
print(f'Features Recomended for removal : {len(to_remove)}')
print(f'Original V-features : {len(v_cols)}')
print(f'After pruning : {len(v_cols) - len(to_remove)}')
print(f'reduction : {len(to_remove)/ len(v_cols)*100:.1f}%')

Features Recomended for removal : 130
Original V-features : 339
After pruning : 209
reduction : 38.3%


## Insight : Dimensionality Reduction IMpact

from redudancy analysis:
- we can potentially remove 50 + feature with correlation exceeding 0.95
- this represent approximately 15% reduciton in feature space
- training time decreases proportionally with minimal auc loss ( typically les than 0.1%)

balance : aggresive pruning speeds up iteration cycles while conservative pruning preserves potential signal. start aggresive, then add features back if auc suffer noticeable

In [13]:
# save redundant features

output_path = Path('../data/metadata')
output_path.mkdir(parents=True, exist_ok=True)

pd.Series(list(to_remove)).to_csv(output_path / 'redundant_feature.csv', index=False)
print(f'Saved redundant features list to: {output_path / 'redundant_feature.csv'}')
print(f'features to drop {len(to_remove)}')

Saved redundant features list to: ..\data\metadata\redundant_feature.csv
features to drop 130


## key takeways 

from the features correlation analysis :
- redundancy is substantial : we identified 50+ features pairs with > 0.95 correaltion
- v_features are the main sources : the vesta engineered V1-V339 features contains many near duplicates