# NovaCred — Bias Detection & Fairness Analysis

**Role:** Data Scientist  
**Inputs:** `applications_analysis.csv`, `spending_items_clean.csv`  
**Source module:** `src/bias.py`

Covers:
1. Gender disparate impact (four-fifths rule)
2. Age-based bias patterns
3. Gender × age interaction effects
4. Proxy discrimination (financial features + spending categories)
5. Interest rate disparity (approved applicants)
6. Rejection reason breakdown
7. Fairness summary export

## Setup

In [None]:
from pathlib import Path
import sys
import pandas as pd

PROJECT_ROOT = Path.cwd().resolve().parent
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from src import bias

pd.set_option('display.float_format', '{:.4f}'.format)
pd.set_option('display.max_columns', 30)

FIGURES_DIR = PROJECT_ROOT / 'figures'
QUALITY_DIR = PROJECT_ROOT / 'data' / 'quality'
FIGURES_DIR.mkdir(parents=True, exist_ok=True)
QUALITY_DIR.mkdir(parents=True, exist_ok=True)

## 1. Load Data

In [None]:
analysis = bias.load_analysis(PROJECT_ROOT / 'data' / 'curated' / 'applications_analysis.csv')
spending = bias.load_spending(PROJECT_ROOT / 'data' / 'curated' / 'spending_items_clean.csv')

print(f'Analysis dataset : {analysis.shape[0]:,} rows x {analysis.shape[1]} columns')
print(f'Spending dataset : {spending.shape[0]:,} rows x {spending.shape[1]} columns')
print()
print('Outcome distribution:')
print(analysis['clean_loan_approved'].value_counts(dropna=False))
print()
print('Gender distribution:')
print(analysis['clean_gender'].value_counts(dropna=False))
print()
print('Age band distribution:')
print(analysis['age_band'].value_counts(dropna=False).sort_index())

## 2. Gender Disparate Impact

$$DI = \frac{\text{approval rate (Female)}}{\text{approval rate (Male)}}$$

The **four-fifths rule** flags DI < 0.80 as potential disparate impact.

In [None]:
gender_tbl = bias.gender_approval_table(analysis)
print('Approval rates by gender:')
print(gender_tbl.to_string(index=False))

In [None]:
gender_di   = bias.disparate_impact(bias.gender_subset(analysis), 'clean_gender', 'Male', 'Female')
chi2_gender = bias.chi2_test(bias.gender_subset(analysis), 'clean_gender')

print('=== GENDER DISPARATE IMPACT ===')
print(f"  Male   approval rate : {gender_di['privileged_rate']:.4f}  (n={gender_di['privileged_n']:,})")
print(f"  Female approval rate : {gender_di['unprivileged_rate']:.4f}  (n={gender_di['unprivileged_n']:,})")
print(f"  Disparate Impact     : {gender_di['disparate_impact']:.4f}")
print(f"  Four-fifths flag     : {gender_di['four_fifths_flag']}  (threshold < 0.80)")
print(f"  Dem. Parity Diff.    : {gender_di['demographic_parity_difference']:+.4f}")
print(f"  Chi-sq p-value       : {chi2_gender['p_value']:.4f}  "
      f"({'significant' if chi2_gender['significant_at_05'] else 'not significant'} at alpha=0.05)")

In [None]:
fig = bias.plot_gender_di(gender_tbl, gender_di,
                          save_path=FIGURES_DIR / 'fig1_gender_disparate_impact.png')
print('Saved -> figures/fig1_gender_disparate_impact.png')

## 3. Age-Based Bias Patterns

In [None]:
age_tbl   = bias.age_approval_table(analysis)
age_di_df = bias.age_di_table(analysis)
chi2_age  = bias.chi2_test(bias.age_subset(analysis), 'age_band')

print('Approval rate by age band:')
print(age_tbl.to_string(index=False))
print()
print(f"Chi-sq test (age band vs approved): chi2={chi2_age['chi2']:.3f}, "
      f"p={chi2_age['p_value']:.4f} -> "
      f"{'SIGNIFICANT' if chi2_age['significant_at_05'] else 'not significant'} at alpha=0.05")
print()
print(f"DI ratios vs reference band ('{bias.PRIME_AGE_REFERENCE}'):")
print(age_di_df[['unprivileged_group', 'unprivileged_n', 'unprivileged_rate',
                  'disparate_impact', 'four_fifths_flag']].to_string(index=False))

In [None]:
fig = bias.plot_age_approval(age_tbl,
                             save_path=FIGURES_DIR / 'fig2_age_approval_rates.png')
print('Saved -> figures/fig2_age_approval_rates.png')

## 4. Gender x Age Interaction Effects

In [None]:
interaction_tbl = bias.interaction_table(analysis)
print('Approval rate by age band x gender:')
print(interaction_tbl.to_string(index=False))

In [None]:
fig = bias.plot_interaction_heatmap(interaction_tbl,
                                    save_path=FIGURES_DIR / 'fig3_gender_age_heatmap.png')
print('Saved -> figures/fig3_gender_age_heatmap.png')

In [None]:
fig = bias.plot_interaction_bars(interaction_tbl,
                                 save_path=FIGURES_DIR / 'fig4_gender_age_grouped_bars.png')
print('Saved -> figures/fig4_gender_age_grouped_bars.png')

## 5. Proxy Discrimination — Financial Features

Non-protected attributes can act as proxies for gender or age. Mann-Whitney U tests check whether financial feature distributions differ significantly between Male and Female applicants.

In [None]:
proxy_tbl = bias.financial_proxy_table(analysis)
print('Financial feature proxy analysis (Male vs Female):')
print(proxy_tbl.to_string(index=False))

In [None]:
corr = bias.credit_age_correlation(analysis)
print('Spearman correlation — age band rank vs credit history months:')
print(f"  rho = {corr['spearman_rho']:.4f},  p = {corr['p_value']:.4f}  "
      f"-> {'significant positive correlation' if corr['spearman_rho'] > 0 and corr['significant_at_05'] else 'not significant'}")
print()
print('If approval is sensitive to credit history, this acts as an indirect age proxy.')

In [None]:
fig = bias.plot_financial_boxplots(analysis,
                                   save_path=FIGURES_DIR / 'fig5_financial_features_by_gender.png')
print('Saved -> figures/fig5_financial_features_by_gender.png')

## 6. Proxy Discrimination — Spending Categories

Spending patterns may encode gendered behaviour. Categories that correlate with gender *and* approval constitute a proxy discrimination channel.

In [None]:
spending_tbl = bias.spending_gender_table(analysis, spending)

if spending_tbl is not None:
    print('Average monthly spending by category and gender:')
    print(spending_tbl)
    print()
    if 'Female' in spending_tbl.index and 'Male' in spending_tbl.index:
        gap = (spending_tbl.loc['Female'] - spending_tbl.loc['Male']).abs().sort_values(ascending=False)
        print('Top categories by gender spending gap:')
        print(gap.head(8))
else:
    print('Spending category data not available.')

## 7. Interest Rate Disparity (Approved Applicants)

Discrimination can manifest in loan **terms** for approved applicants, not just in approval decisions.

In [None]:
ir_result = bias.interest_rate_by_gender(analysis)

if ir_result:
    print('Interest rate disparity (approved applicants only):')
    print(f"  Male   median: {ir_result['male_median_rate']:.4f}   mean: {ir_result['male_mean_rate']:.4f}   n={ir_result['male_n']}")
    print(f"  Female median: {ir_result['female_median_rate']:.4f}   mean: {ir_result['female_mean_rate']:.4f}   n={ir_result['female_n']}")
    print(f"  Mann-Whitney p: {ir_result['p_value']:.4f}  "
          f"-> {'significant difference' if ir_result['significant_at_05'] else 'no significant difference'}")
else:
    print('Interest rate data not available.')

In [None]:
fig = bias.plot_interest_rate(analysis,
                              save_path=FIGURES_DIR / 'fig6_interest_rate_by_gender.png')
print('Saved -> figures/fig6_interest_rate_by_gender.png')

## 8. Rejection Reason Breakdown

In [None]:
rejection_tbl = bias.rejection_reason_by_gender(analysis)

if rejection_tbl is not None:
    print('Rejection reasons by gender (top 15):')
    print(rejection_tbl.head(15))
else:
    print('Rejection reason data not available.')

## 9. Fairness Summary

In [None]:
summary = bias.build_fairness_summary(
    gender_di=gender_di,
    chi2_gender=chi2_gender,
    age_di_df=age_di_df,
    chi2_age=chi2_age,
    ir_result=ir_result,
)

print('FAIRNESS SUMMARY')
print('=' * 100)
print(summary.to_string(index=False))
print('=' * 100)

summary.to_csv(QUALITY_DIR / 'fairness_summary.csv', index=False)
print('\nSaved -> data/quality/fairness_summary.csv')