# Performance Gap Metrics

**Category Focus:** *"Measuring specific types of errors"*

This notebook explores Performance Gap Metrics that identify exactly where your model makes unfair mistakes, enabling targeted fixes. These metrics help diagnose and fix **specific bias problems** by measuring differences in error rates across demographic groups.

## Metrics in This Category

1. **FNR Difference** - Missing qualified candidates (talent loss)
2. **FOR Difference** - Wrongly rejecting people (false accusations)
3. **FPR Difference** - Wrongly approving people (risk exposure)

---

## üìö Descriptive Analysis

### False Negative Rate (FNR) Difference
**Definition:** Measures differences in missing qualified candidates across demographic groups.

**Formula:** |FNR_group1 - FNR_group2|
- Where FNR = False Negatives / (True Positives + False Negatives)
- FNR = 1 - True Positive Rate (TPR)

**Business Meaning:** 
- Identifies if the model systematically misses qualified candidates from specific groups
- High FNR difference = talent loss from underrepresented groups
- Directly impacts diversity in hiring and opportunity access

**Diagnostic Value:**
- Reveals hidden bias in qualification recognition
- Shows where model training data may be incomplete
- Indicates need for threshold adjustment or feature engineering

**When to Prioritize:**
- Talent acquisition where missing good candidates is costly
- Medical diagnosis where false negatives have serious consequences
- Any domain where opportunity access is critical

---

### False Omission Rate (FOR) Difference
**Definition:** Measures differences in wrongly rejecting people across demographic groups.

**Formula:** |FOR_group1 - FOR_group2|
- Where FOR = False Negatives / (True Negatives + False Negatives)
- FOR = Proportion of false negatives among all negative predictions

**Business Meaning:**
- Identifies if the model systematically under-predicts positive outcomes for specific groups
- High FOR difference = unfair rejection rates leading to missed opportunities
- Measures reliability of negative predictions across groups

**Diagnostic Value:**
- Shows calibration problems in prediction confidence
- Reveals when model is overly conservative for certain groups
- Indicates need for group-specific thresholds

**When to Prioritize:**
- Loan approvals where false rejections harm access to credit
- Educational opportunities where false rejections limit advancement
- Any system where rejection has long-term consequences

---

### False Positive Rate (FPR) Difference
**Definition:** Measures differences in wrongly approving people across demographic groups.

**Formula:** |FPR_group1 - FPR_group2|
- Where FPR = False Positives / (True Negatives + False Positives)
- FPR = 1 - True Negative Rate (TNR)

**Business Meaning:**
- Identifies if the model systematically over-predicts positive outcomes for specific groups
- High FPR difference = unfair advantage or increased risk exposure
- Measures consistency of risk assessment across groups

**Diagnostic Value:**
- Reveals when model is overly optimistic for certain groups
- Shows potential bias in training data representation
- Indicates need for improved negative case recognition

**When to Prioritize:**
- Credit scoring where false positives increase default risk
- Security systems where false alarms waste resources
- Medical screening where false positives cause unnecessary treatment

---

## üîß Diagnostic Framework

| Error Type | What It Reveals | Fix Strategy |
|------------|-----------------|-------------|
| **High FNR Diff** | Missing qualified people from group X | Improve training data, lower thresholds for group X |
| **High FOR Diff** | Poor negative prediction reliability for group X | Calibrate confidence scores, adjust rejection criteria |
| **High FPR Diff** | Too many false approvals for group X | Tighten approval criteria, improve feature representation |

**Key Insight:** Performance Gap metrics are **diagnostic tools** - they tell you exactly what type of bias problem you have so you can fix it systematically.

---

## üíª Computational Analysis

Let's implement and analyze these three performance gap metrics using the Adult Income dataset to diagnose specific bias problems.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
from jurity.fairness import BinaryFairnessMetrics
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")
print("Ready to diagnose performance gaps:")
print("‚Ä¢ FNR Difference (Missing Talent)")
print("‚Ä¢ FOR Difference (False Rejections)")
print("‚Ä¢ FPR Difference (False Approvals)")

In [None]:
# Load and prepare the Adult Income dataset
print("=== LOADING DIAGNOSTIC DATASET ===")
print("Context: Performance gap analysis for bias detection")

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status',
          'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
          'hours_per_week', 'native_country', 'income']

data = pd.read_csv(url, names=columns, skipinitialspace=True)

print(f"Dataset shape: {data.shape}")
print("\nPerformance gap analysis focuses on:")
print("‚Ä¢ False Negative Rate differences (missing qualified people)")
print("‚Ä¢ False Omission Rate differences (wrong rejections)")
print("‚Ä¢ False Positive Rate differences (wrong approvals)")

print("\nTarget and sensitive attribute distribution:")
print(f"Income: {data['income'].value_counts().to_dict()}")
print(f"Gender: {data['sex'].value_counts().to_dict()}")

# Prepare features and target
print("\n=== PREPARING FEATURES ===")

# Create target variable (1 for >50K, 0 for <=50K)
y = (data['income'] == '>50K').astype(int)

# Create sensitive attribute (1 for Female, 0 for Male)
sensitive_attribute = (data['sex'] == 'Female').astype(int)

# Select and encode features
categorical_features = ['workclass', 'education', 'marital_status', 'occupation', 
                       'relationship', 'race', 'native_country']
numerical_features = ['age', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']

# Encode categorical features
X = data[numerical_features].copy()
le = LabelEncoder()
for col in categorical_features:
    X[col] = le.fit_transform(data[col].astype(str))

print(f"Features prepared: {X.shape[1]} features, {X.shape[0]} samples")
print(f"Target distribution: {y.value_counts().to_dict()}")
print(f"Sensitive attribute: {sensitive_attribute.value_counts().to_dict()}")

In [None]:
# Split data and train model
X_train, X_test, y_train, y_test, sensitive_train, sensitive_test = train_test_split(
    X, y, sensitive_attribute, test_size=0.3, random_state=42, stratify=y
)

# Train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
y_pred = rf_model.predict(X_test)
y_pred_proba = rf_model.predict_proba(X_test)[:, 1]

print("=== MODEL PERFORMANCE ===")
print(f"Test set accuracy: {(y_pred == y_test).mean():.3f}")
print(f"Test set size: {len(y_test)}")

# Show basic prediction statistics by group
female_mask = sensitive_test == 1
male_mask = sensitive_test == 0

female_predictions = y_pred[female_mask]
male_predictions = y_pred[male_mask]

print(f"\nPrediction rates by group:")
print(f"Female: {female_predictions.mean():.3f} ({female_predictions.sum()} of {len(female_predictions)})")
print(f"Male: {male_predictions.mean():.3f} ({male_predictions.sum()} of {len(male_predictions)})")

print("\nüîç Ready for detailed performance gap analysis...")

In [None]:
# Calculate Performance Gap Metrics using Jurity
print("=== PERFORMANCE GAP METRICS ANALYSIS ===")

bfm = BinaryFairnessMetrics()

# FNR Difference (False Negative Rate) - Available in Jurity
fnr_difference = bfm.FNRDifference.get_score(
    labels=y_test.values,
    predictions=y_pred,
    memberships=sensitive_test.values
)

# FOR Difference (False Omission Rate) - Available in Jurity
for_difference = bfm.FORDifference.get_score(
    labels=y_test.values,
    predictions=y_pred,
    memberships=sensitive_test.values
)

# FPR Difference (False Positive Rate) - Calculate manually (not available in Jurity)
def calculate_fpr_difference(y_true, y_pred, sensitive):
    """Calculate False Positive Rate difference manually"""
    results = {}
    for group in [0, 1]:  # 0=Male, 1=Female
        mask = sensitive == group
        y_true_group = y_true[mask]
        y_pred_group = y_pred[mask]
        
        tn = ((y_true_group == 0) & (y_pred_group == 0)).sum()
        fp = ((y_true_group == 0) & (y_pred_group == 1)).sum()
        fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
        
        group_name = 'Female' if group == 1 else 'Male'
        results[group_name] = fpr
    
    return abs(results['Female'] - results['Male'])

fpr_difference = calculate_fpr_difference(y_test.values, y_pred, sensitive_test.values)

print("üîß PERFORMANCE GAP RESULTS:")
print(f"FNR Difference: {fnr_difference:.4f} (Missing qualified candidates)")
print(f"FOR Difference: {for_difference:.4f} (False rejections)")
print(f"FPR Difference: {fpr_difference:.4f} (False approvals)")

print("\nüìä INTERPRETATION:")
print("‚Ä¢ Higher FNR Difference = More qualified people missed from one group")
print("‚Ä¢ Higher FOR Difference = More unreliable negative predictions for one group")
print("‚Ä¢ Higher FPR Difference = More false positives for one group")
print("\nNote: Values closer to 0 indicate better fairness across error types")

In [None]:
# Detailed confusion matrix analysis by group
def calculate_detailed_metrics_by_group(y_true, y_pred, sensitive):
    """Calculate detailed error metrics by sensitive group"""
    results = {}
    
    for group in [0, 1]:  # 0=Male, 1=Female
        mask = sensitive == group
        y_true_group = y_true[mask]
        y_pred_group = y_pred[mask]
        
        # Confusion matrix components
        tn = ((y_true_group == 0) & (y_pred_group == 0)).sum()
        tp = ((y_true_group == 1) & (y_pred_group == 1)).sum()
        fn = ((y_true_group == 1) & (y_pred_group == 0)).sum()
        fp = ((y_true_group == 0) & (y_pred_group == 1)).sum()
        
        # Calculate error rates
        fnr = fn / (tp + fn) if (tp + fn) > 0 else 0  # False Negative Rate
        for_rate = fn / (tn + fn) if (tn + fn) > 0 else 0  # False Omission Rate
        fpr = fp / (fp + tn) if (fp + tn) > 0 else 0  # False Positive Rate
        
        # Additional metrics for context
        tpr = tp / (tp + fn) if (tp + fn) > 0 else 0  # True Positive Rate
        tnr = tn / (tn + fp) if (tn + fp) > 0 else 0  # True Negative Rate
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tpr
        
        group_name = 'Female' if group == 1 else 'Male'
        results[group_name] = {
            'FNR': fnr, 'FOR': for_rate, 'FPR': fpr,
            'TPR': tpr, 'TNR': tnr,
            'Precision': precision, 'Recall': recall,
            'TP': tp, 'TN': tn, 'FP': fp, 'FN': fn,
            'Total': len(y_true_group),
            'Positive_Cases': (y_true_group == 1).sum(),
            'Negative_Cases': (y_true_group == 0).sum()
        }
    
    return results

# Calculate detailed metrics
group_metrics = calculate_detailed_metrics_by_group(y_test.values, y_pred, sensitive_test.values)

print("=== DETAILED ERROR ANALYSIS BY GROUP ===")
for group, metrics in group_metrics.items():
    print(f"\n{group.upper()} GROUP:")
    print(f"  Sample size: {metrics['Total']} (Positive: {metrics['Positive_Cases']}, Negative: {metrics['Negative_Cases']})")
    print(f"  ")
    print(f"  Error Rates (Performance Gaps):")
    print(f"    False Negative Rate (FNR): {metrics['FNR']:.4f} (Missing qualified people)")
    print(f"    False Omission Rate (FOR):  {metrics['FOR']:.4f} (Wrong rejection rate)")
    print(f"    False Positive Rate (FPR):  {metrics['FPR']:.4f} (Wrong approval rate)")
    print(f"  ")
    print(f"  Context Metrics:")
    print(f"    True Positive Rate (TPR):   {metrics['TPR']:.4f} (Sensitivity/Recall)")
    print(f"    True Negative Rate (TNR):   {metrics['TNR']:.4f} (Specificity)")
    print(f"    Precision:                  {metrics['Precision']:.4f}")

In [None]:
# Manual verification of performance gap calculations
male_metrics = group_metrics['Male']
female_metrics = group_metrics['Female']

# Calculate differences manually
manual_fnr_diff = abs(female_metrics['FNR'] - male_metrics['FNR'])
manual_for_diff = abs(female_metrics['FOR'] - male_metrics['FOR'])
manual_fpr_diff = abs(female_metrics['FPR'] - male_metrics['FPR'])

print("=== MANUAL VERIFICATION OF PERFORMANCE GAP METRICS ===")

print(f"\nüéØ FALSE NEGATIVE RATE (FNR) ANALYSIS:")
print(f"   Male FNR: {male_metrics['FNR']:.4f} ({male_metrics['FN']} missed of {male_metrics['FN'] + male_metrics['TP']} qualified)")
print(f"   Female FNR: {female_metrics['FNR']:.4f} ({female_metrics['FN']} missed of {female_metrics['FN'] + female_metrics['TP']} qualified)")
print(f"   Manual Difference: {manual_fnr_diff:.4f}")
print(f"   Jurity Score: {fnr_difference:.4f}")
print(f"   ‚úì Match: {abs(manual_fnr_diff - fnr_difference) < 0.001}")
print(f"   Impact: {'Higher female FNR' if female_metrics['FNR'] > male_metrics['FNR'] else 'Higher male FNR' if male_metrics['FNR'] > female_metrics['FNR'] else 'Equal FNR'} (missing qualified talent)")

print(f"\nüìã FALSE OMISSION RATE (FOR) ANALYSIS:")
print(f"   Male FOR: {male_metrics['FOR']:.4f} ({male_metrics['FN']} errors of {male_metrics['FN'] + male_metrics['TN']} negative predictions)")
print(f"   Female FOR: {female_metrics['FOR']:.4f} ({female_metrics['FN']} errors of {female_metrics['FN'] + female_metrics['TN']} negative predictions)")
print(f"   Manual Difference: {manual_for_diff:.4f}")
print(f"   Jurity Score: {for_difference:.4f}")
print(f"   ‚úì Match: {abs(manual_for_diff - for_difference) < 0.001}")
print(f"   Impact: {'Higher female FOR' if female_metrics['FOR'] > male_metrics['FOR'] else 'Higher male FOR' if male_metrics['FOR'] > female_metrics['FOR'] else 'Equal FOR'} (unreliable rejections)")

print(f"\n‚ö†Ô∏è FALSE POSITIVE RATE (FPR) ANALYSIS:")
print(f"   Male FPR: {male_metrics['FPR']:.4f} ({male_metrics['FP']} errors of {male_metrics['FP'] + male_metrics['TN']} negative cases)")
print(f"   Female FPR: {female_metrics['FPR']:.4f} ({female_metrics['FP']} errors of {female_metrics['FP'] + female_metrics['TN']} negative cases)")
print(f"   Manual Difference: {manual_fpr_diff:.4f}")
print(f"   Jurity Score: {fpr_difference:.4f}")
print(f"   ‚úì Match: {abs(manual_fpr_diff - fpr_difference) < 0.001}")
print(f"   Impact: {'Higher female FPR' if female_metrics['FPR'] > male_metrics['FPR'] else 'Higher male FPR' if male_metrics['FPR'] > female_metrics['FPR'] else 'Equal FPR'} (false approvals)")

print(f"\nüìä PERFORMANCE GAP SUMMARY:")
print(f"   Largest gap: {'FNR' if manual_fnr_diff >= max(manual_for_diff, manual_fpr_diff) else 'FOR' if manual_for_diff >= manual_fpr_diff else 'FPR'} difference ({max(manual_fnr_diff, manual_for_diff, manual_fpr_diff):.4f})")
print(f"   Primary concern: {'Talent loss' if manual_fnr_diff >= max(manual_for_diff, manual_fpr_diff) else 'False rejections' if manual_for_diff >= manual_fpr_diff else 'False approvals'}")

In [None]:
# Create comprehensive performance gap visualization dashboard
fig, axes = plt.subplots(3, 3, figsize=(18, 16))
fig.suptitle('Performance Gap Metrics: Diagnostic Analysis Dashboard', fontsize=16, fontweight='bold')

# 1. Error Rates by Group Comparison
groups = ['Male', 'Female']
fnr_values = [male_metrics['FNR'], female_metrics['FNR']]
for_values = [male_metrics['FOR'], female_metrics['FOR']]
fpr_values = [male_metrics['FPR'], female_metrics['FPR']]

x = np.arange(len(groups))
width = 0.25

axes[0,0].bar(x - width, fnr_values, width, label='FNR (Missing Qualified)', alpha=0.8, color='red')
axes[0,0].bar(x, for_values, width, label='FOR (False Rejections)', alpha=0.8, color='orange')
axes[0,0].bar(x + width, fpr_values, width, label='FPR (False Approvals)', alpha=0.8, color='purple')
axes[0,0].set_xlabel('Demographic Group')
axes[0,0].set_ylabel('Error Rate')
axes[0,0].set_title('Error Rates by Group')
axes[0,0].set_xticks(x)
axes[0,0].set_xticklabels(groups)
axes[0,0].legend()
axes[0,0].grid(True, alpha=0.3)

# 2. FNR Difference Visualization
axes[0,1].bar(groups, fnr_values, color=['lightblue', 'lightcoral'], alpha=0.8)
axes[0,1].set_ylabel('False Negative Rate')
axes[0,1].set_title(f'FNR Difference: {manual_fnr_diff:.4f}\n(Missing Qualified Candidates)')
axes[0,1].grid(True, alpha=0.3)
for i, v in enumerate(fnr_values):
    axes[0,1].text(i, v + 0.005, f'{v:.3f}', ha='center', va='bottom')

# 3. FOR Difference Visualization
axes[0,2].bar(groups, for_values, color=['lightgreen', 'orange'], alpha=0.8)
axes[0,2].set_ylabel('False Omission Rate')
axes[0,2].set_title(f'FOR Difference: {manual_for_diff:.4f}\n(Unreliable Rejections)')
axes[0,2].grid(True, alpha=0.3)
for i, v in enumerate(for_values):
    axes[0,2].text(i, v + 0.002, f'{v:.3f}', ha='center', va='bottom')

# 4. Performance Gap Scores Comparison
gap_scores = [fnr_difference, for_difference, fpr_difference]
gap_names = ['FNR\nDifference', 'FOR\nDifference', 'FPR\nDifference']
gap_colors = ['red', 'orange', 'purple']

bars = axes[1,0].bar(gap_names, gap_scores, color=gap_colors, alpha=0.8)
axes[1,0].set_ylabel('Performance Gap Score')
axes[1,0].set_title('Performance Gap Metrics\n(Lower = More Fair)')
axes[1,0].grid(True, alpha=0.3)
for bar, score in zip(bars, gap_scores):
    axes[1,0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.002,
                  f'{score:.4f}', ha='center', va='bottom', fontweight='bold')

# 5. FPR Difference Visualization
axes[1,1].bar(groups, fpr_values, color=['plum', 'gold'], alpha=0.8)
axes[1,1].set_ylabel('False Positive Rate')
axes[1,1].set_title(f'FPR Difference: {manual_fpr_diff:.4f}\n(False Approvals)')
axes[1,1].grid(True, alpha=0.3)
for i, v in enumerate(fpr_values):
    axes[1,1].text(i, v + 0.002, f'{v:.3f}', ha='center', va='bottom')

# 6. Error Type Distribution
error_types = ['False Negatives', 'False Positives']
male_errors = [male_metrics['FN'], male_metrics['FP']]
female_errors = [female_metrics['FN'], female_metrics['FP']]

x = np.arange(len(error_types))
width = 0.35

axes[1,2].bar(x - width/2, male_errors, width, label='Male', alpha=0.8, color='lightblue')
axes[1,2].bar(x + width/2, female_errors, width, label='Female', alpha=0.8, color='lightcoral')
axes[1,2].set_ylabel('Number of Errors')
axes[1,2].set_title('Error Counts by Type and Group')
axes[1,2].set_xticks(x)
axes[1,2].set_xticklabels(error_types)
axes[1,2].legend()
axes[1,2].grid(True, alpha=0.3)

# 7. Diagnostic Assessment
def assess_performance_gaps(fnr_diff, for_diff, fpr_diff):
    """Assess performance gap severity and provide diagnostic insights"""
    assessments = []
    
    if fnr_diff > 0.1:
        assessments.append(f"üö® HIGH FNR GAP ({fnr_diff:.3f}) - Significant talent loss")
    elif fnr_diff > 0.05:
        assessments.append(f"‚ö†Ô∏è MODERATE FNR GAP ({fnr_diff:.3f}) - Monitor talent pipeline")
    else:
        assessments.append(f"‚úÖ LOW FNR GAP ({fnr_diff:.3f}) - Good qualified detection")
    
    if for_diff > 0.1:
        assessments.append(f"üö® HIGH FOR GAP ({for_diff:.3f}) - Unreliable rejections")
    elif for_diff > 0.05:
        assessments.append(f"‚ö†Ô∏è MODERATE FOR GAP ({for_diff:.3f}) - Check rejection reliability")
    else:
        assessments.append(f"‚úÖ LOW FOR GAP ({for_diff:.3f}) - Reliable negative predictions")
    
    if fpr_diff > 0.1:
        assessments.append(f"üö® HIGH FPR GAP ({fpr_diff:.3f}) - Inconsistent risk assessment")
    elif fpr_diff > 0.05:
        assessments.append(f"‚ö†Ô∏è MODERATE FPR GAP ({fpr_diff:.3f}) - Monitor false approvals")
    else:
        assessments.append(f"‚úÖ LOW FPR GAP ({fpr_diff:.3f}) - Consistent risk assessment")
    
    return assessments

gap_assessment = assess_performance_gaps(manual_fnr_diff, manual_for_diff, manual_fpr_diff)

axes[2,0].axis('off')
assessment_text = "PERFORMANCE GAP ASSESSMENT:\n\n" + "\n\n".join(gap_assessment)
axes[2,0].text(0.05, 0.95, assessment_text, transform=axes[2,0].transAxes, 
              fontsize=10, verticalalignment='top', fontfamily='monospace',
              bbox=dict(boxstyle="round,pad=0.5", facecolor="lightgray", alpha=0.8))

# 8. Diagnostic Recommendations
axes[2,1].axis('off')
primary_issue = 'FNR' if manual_fnr_diff >= max(manual_for_diff, manual_fpr_diff) else 'FOR' if manual_for_diff >= manual_fpr_diff else 'FPR'

recommendations = {
    'FNR': [
        "PRIMARY ISSUE: Missing Qualified Talent",
        "",
        "FIXES:",
        "‚Ä¢ Lower decision thresholds for affected group",
        "‚Ä¢ Improve training data representation",
        "‚Ä¢ Add features that better capture qualifications",
        "‚Ä¢ Consider group-aware modeling",
        "",
        "MONITORING:",
        "‚Ä¢ Track qualified candidate pipeline",
        "‚Ä¢ Monitor hiring/promotion rates by group"
    ],
    'FOR': [
        "PRIMARY ISSUE: Unreliable Rejections",
        "",
        "FIXES:",
        "‚Ä¢ Calibrate prediction confidence scores",
        "‚Ä¢ Adjust rejection criteria by group",
        "‚Ä¢ Improve model training on negative cases",
        "‚Ä¢ Review decision boundary placement",
        "",
        "MONITORING:",
        "‚Ä¢ Track prediction reliability by group",
        "‚Ä¢ Audit rejected applications regularly"
    ],
    'FPR': [
        "PRIMARY ISSUE: Inconsistent Risk Assessment",
        "",
        "FIXES:",
        "‚Ä¢ Tighten approval criteria for affected group",
        "‚Ä¢ Improve negative case feature representation",
        "‚Ä¢ Balance training data across groups",
        "‚Ä¢ Review risk scoring methodology",
        "",
        "MONITORING:",
        "‚Ä¢ Track approval success rates by group",
        "‚Ä¢ Monitor downstream performance metrics"
    ]
}

rec_text = "\n".join(recommendations[primary_issue])
axes[2,1].text(0.05, 0.95, rec_text, transform=axes[2,1].transAxes,
              fontsize=9, verticalalignment='top', fontfamily='monospace',
              bbox=dict(boxstyle="round,pad=0.5", facecolor="lightyellow", alpha=0.8))

# 9. Implementation Action Plan
axes[2,2].axis('off')
action_plan = (
    "IMPLEMENTATION ACTION PLAN:\n\n"
    f"1. IMMEDIATE (Week 1):\n"
    f"   Largest gap: {primary_issue} ({max(manual_fnr_diff, manual_for_diff, manual_fpr_diff):.3f})\n"
    f"   Focus: {'Talent retention' if primary_issue == 'FNR' else 'Prediction reliability' if primary_issue == 'FOR' else 'Risk consistency'}\n\n"
    f"2. SHORT-TERM (Month 1):\n"
    f"   ‚Ä¢ Implement threshold adjustments\n"
    f"   ‚Ä¢ Begin model retraining\n"
    f"   ‚Ä¢ Set up monitoring dashboards\n\n"
    f"3. LONG-TERM (Quarter 1):\n"
    f"   ‚Ä¢ Evaluate model architecture changes\n"
    f"   ‚Ä¢ Implement fairness-aware algorithms\n"
    f"   ‚Ä¢ Establish ongoing bias audit process\n\n"
    f"SUCCESS METRICS:\n"
    f"   ‚Ä¢ All gaps < 0.05 (current max: {max(manual_fnr_diff, manual_for_diff, manual_fpr_diff):.3f})\n"
    f"   ‚Ä¢ Maintained or improved accuracy\n"
    f"   ‚Ä¢ Stakeholder satisfaction scores"
)
axes[2,2].text(0.05, 0.95, action_plan, transform=axes[2,2].transAxes,
              fontsize=9, verticalalignment='top', fontfamily='monospace',
              bbox=dict(boxstyle="round,pad=0.5", facecolor="lightblue", alpha=0.8))

plt.tight_layout()
plt.show()

## üîß Performance Gap Diagnostic Framework

### Understanding Your Results

Performance Gap Metrics provide **actionable diagnostic information** about specific types of bias in your model. Unlike legal compliance or merit-based metrics that focus on overall fairness, performance gap metrics tell you exactly what type of error is causing unfairness.

#### Key Diagnostic Questions:

**High FNR Difference (False Negative Rate):**
- *Question:* "Is our model systematically missing qualified candidates from one demographic group?"
- *Business Impact:* Lost talent, reduced diversity in hiring/promotions
- *Root Cause:* Often insufficient training data for underrepresented groups
- *Fix Strategy:* Lower thresholds, improve data representation, feature engineering

**High FOR Difference (False Omission Rate):**
- *Question:* "Are our negative predictions less reliable for one demographic group?"
- *Business Impact:* Unfair rejections, missed opportunities
- *Root Cause:* Poor model calibration or biased training data
- *Fix Strategy:* Calibrate prediction confidence, adjust decision boundaries

**High FPR Difference (False Positive Rate):**
- *Question:* "Is our model giving unfair advantages to one demographic group?"
- *Business Impact:* Increased risk exposure, unfair resource allocation
- *Root Cause:* Imbalanced representation of negative cases in training
- *Fix Strategy:* Tighten criteria, improve negative case modeling

---

### Implementation Priority Matrix

| Gap Size | Urgency | Action Required |
|----------|---------|----------------|
| **> 0.10** | üö® CRITICAL | Immediate model adjustment, halt deployment |
| **0.05-0.10** | ‚ö†Ô∏è HIGH | Address within 30 days, enhanced monitoring |
| **< 0.05** | ‚úÖ NORMAL | Routine monitoring, document for audits |

### Business Value of Performance Gap Analysis

1. **Precise Problem Identification:** Know exactly which error type causes bias
2. **Targeted Solutions:** Apply specific fixes rather than general approaches
3. **Resource Efficiency:** Focus improvement efforts where they matter most
4. **Measurable Progress:** Track specific error reductions over time
5. **Stakeholder Communication:** Explain bias issues in concrete, actionable terms

---

## üéØ When to Use Performance Gap Metrics

**Primary Use Cases:**
- **Model Debugging:** When you know bias exists but need to understand the source
- **Continuous Monitoring:** Ongoing detection of emerging bias patterns
- **A/B Testing:** Comparing bias patterns across different model versions
- **Stakeholder Reporting:** Communicating specific bias issues to technical teams

**Integration with Other Metrics:**
- **After Legal Compliance:** Use performance gaps to diagnose why legal metrics fail
- **Before Merit-Based:** Understand error patterns before implementing merit-focused fairness
- **With Business Metrics:** Connect error patterns to business outcomes

**Performance Gap Metrics are your bias debugging toolkit** - use them to understand exactly what's wrong so you can fix it systematically.