# DoorDash PM Binary Classification Analysis (Table 1 Reproduction - Extension Data)

This notebook reproduces the binary classification results from Section 8 of the paper for the DoorDash PM role.

**Extension data: 50 qualified + 50 unqualified resumes**

We will test three conditions:
1. **U: No LLMs, P: GPT-4O** - Unprivileged group has no LLM access, Privileged group uses GPT-4o
2. **U: GPT-3.5, P: GPT-4O** - Unprivileged uses GPT-3.5, Privileged uses GPT-4o  
3. **U: GPT-4O-MINI, P: GPT-4O** - Unprivileged uses GPT-4o-mini, Privileged uses GPT-4o

## Step 1: Load Required Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import roc_curve, confusion_matrix
from sklearn.model_selection import train_test_split
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

## Step 2: Load and Merge Qualified and Unqualified Resume Scores

- **Qualified**: PM resumes (n=50) - these ARE qualified for the PM role
- **Unqualified**: UX designer resumes (n=50) - these are NOT qualified for the PM role

In [2]:
# Load qualified resumes (PM resumes)
qualified_original = pd.read_csv('../tpr_calculation_files_extension/Qualified_PM/ScoresDoordash_PM_Original_File_PM.csv')
qualified_original['PM True Label'] = 1  # Qualified for PM role

# Load unqualified resumes (UX resumes)
unqualified_original = pd.read_csv('../tpr_calculation_files_extension/Unqualified_PM/ScoresDoorDash_PM_Original_File_UX.csv')
unqualified_original['PM True Label'] = 0  # Unqualified for PM role

# IMPORTANT: Standardize column names - they differ between files
qualified_cols = [col for col in qualified_original.columns if 'DoorDash' in col or 'Doordash' in col]
unqualified_cols = [col for col in unqualified_original.columns if 'DoorDash' in col or 'Doordash' in col]

if qualified_cols and unqualified_cols:
    qualified_original.rename(columns={qualified_cols[0]: 'CV DoorDash PM Score'}, inplace=True)
    unqualified_original.rename(columns={unqualified_cols[0]: 'CV DoorDash PM Score'}, inplace=True)
    print(f"Standardized column names to: 'CV DoorDash PM Score'")

# FIX: Keep only first 50 qualified resumes to get exactly 100 total
qualified_original = qualified_original.head(50)

print(f"Qualified resumes: {len(qualified_original)}")
print(f"Unqualified resumes: {len(unqualified_original)}")
print("\nQualified sample:")
print(qualified_original.head())
print("\nUnqualified sample:")
print(unqualified_original.head())

Standardized column names to: 'CV DoorDash PM Score'
Qualified resumes: 50
Unqualified resumes: 50

Qualified sample:
   Unnamed: 0  CV DoorDash PM Score  PM True Label
0           0                81.061              1
1           1                84.806              1
2           2                81.328              1
3           3                82.467              1
4           4                81.239              1

Unqualified sample:
   Unnamed: 0  CV DoorDash PM Score  PM True Label
0           0                78.009              0
1           1                78.009              0
2           2                77.359              0
3           3                76.957              0
4           4                78.456              0


## Step 3: Load All Modified Resume Scores

We need to load scores for:
- GPT-3.5 modified
- GPT-4o modified (once)
- GPT-4o-mini modified
- GPT-4o on GPT-3.5 (twice modified)
- GPT-4o on GPT-4o-mini (twice modified)
- GPT-4o on GPT-4o (twice modified)

In [3]:
# Load GPT-3.5 modified scores
qualified_gpt35 = pd.read_csv('../tpr_calculation_files_extension/Qualified_PM/ScoresDoordash_PM_gpt35turbo.csv').head(50)
unqualified_gpt35 = pd.read_csv('../tpr_calculation_files_extension/Unqualified_PM/ScoresDoordash_PM_gpt35turbo.csv')

# Load GPT-4o modified scores (once)
qualified_gpt4o = pd.read_csv('../tpr_calculation_files_extension/Qualified_PM/ScoresDoorDash_PM_gpt4o.csv').head(50)
unqualified_gpt4o = pd.read_csv('../tpr_calculation_files_extension/Unqualified_PM/ScoresDoordash_PM_gpt4o.csv')

# Load GPT-4o-mini modified scores
qualified_gpt4omini = pd.read_csv('../tpr_calculation_files_extension/Qualified_PM/ScoresDoordash_PM_gpt4omini.csv').head(50)
unqualified_gpt4omini = pd.read_csv('../tpr_calculation_files_extension/Unqualified_PM/ScoresDoordash_PM_gpt4omini.csv')

# Load GPT-4o on GPT-3.5 scores
qualified_gpt4o_on_gpt35 = pd.read_csv('../tpr_calculation_files_extension/Qualified_PM/ScoresDoordash_PM_gpt4o_on_gpt35turbo.csv').head(50)
unqualified_gpt4o_on_gpt35 = pd.read_csv('../tpr_calculation_files_extension/Unqualified_PM/ScoresDoordash_PM_gpt4o_on_gpt35turbo.csv')

# Load GPT-4o on GPT-4o-mini scores
qualified_gpt4o_on_gpt4omini = pd.read_csv('../tpr_calculation_files_extension/Qualified_PM/ScoresDoordash_PM_gpt4o_on_gpt4omini.csv').head(50)
unqualified_gpt4o_on_gpt4omini = pd.read_csv('../tpr_calculation_files_extension/Unqualified_PM/ScoresDoordash_PM_gpt4o_on_gpt4omini.csv')

# Load GPT-4o on GPT-4o scores (twice modified)
qualified_gpt4o_on_gpt4o = pd.read_csv('../tpr_calculation_files_extension/Qualified_PM/ScoresDoordash_PM_gpt4o_on_gpt4o.csv').head(50)
unqualified_gpt4o_on_gpt4o = pd.read_csv('../tpr_calculation_files_extension/Unqualified_PM/ScoresDoordash_PM_gpt4o_on_gpt4o.csv')

print("All score files loaded successfully!")
print(f"Each file has {len(qualified_gpt35)} qualified and {len(unqualified_gpt35)} unqualified resumes")

All score files loaded successfully!
Each file has 50 qualified and 50 unqualified resumes


## Step 4: Merge All Scores into Single Dataframe

We'll create one dataframe with all score columns for easier manipulation.

In [4]:
# Merge qualified scores
qualified_df = qualified_original.copy()
qualified_df['GPT-3.5 Score'] = qualified_gpt35.iloc[:, 1].values
qualified_df['GPT-4o Score'] = qualified_gpt4o.iloc[:, 1].values
qualified_df['GPT-4o-mini Score'] = qualified_gpt4omini.iloc[:, 1].values
qualified_df['GPT-4o on GPT-3.5 Score'] = qualified_gpt4o_on_gpt35.iloc[:, 1].values
qualified_df['GPT-4o on GPT-4o-mini Score'] = qualified_gpt4o_on_gpt4omini.iloc[:, 1].values
qualified_df['GPT-4o on GPT-4o Score'] = qualified_gpt4o_on_gpt4o.iloc[:, 1].values

# Merge unqualified scores
unqualified_df = unqualified_original.copy()
unqualified_df['GPT-3.5 Score'] = unqualified_gpt35.iloc[:, 1].values
unqualified_df['GPT-4o Score'] = unqualified_gpt4o.iloc[:, 1].values
unqualified_df['GPT-4o-mini Score'] = unqualified_gpt4omini.iloc[:, 1].values
unqualified_df['GPT-4o on GPT-3.5 Score'] = unqualified_gpt4o_on_gpt35.iloc[:, 1].values
unqualified_df['GPT-4o on GPT-4o-mini Score'] = unqualified_gpt4o_on_gpt4omini.iloc[:, 1].values
unqualified_df['GPT-4o on GPT-4o Score'] = unqualified_gpt4o_on_gpt4o.iloc[:, 1].values

# Combine qualified and unqualified
df_combined = pd.concat([qualified_df, unqualified_df], ignore_index=True)

print(f"Combined dataframe shape: {df_combined.shape}")
print(f"Total resumes: {len(df_combined)}")
print(f"\nColumns: {df_combined.columns.tolist()}")
print(f"\nLabel distribution:")
print(df_combined['PM True Label'].value_counts())

Combined dataframe shape: (100, 9)
Total resumes: 100

Columns: ['Unnamed: 0', 'CV DoorDash PM Score', 'PM True Label', 'GPT-3.5 Score', 'GPT-4o Score', 'GPT-4o-mini Score', 'GPT-4o on GPT-3.5 Score', 'GPT-4o on GPT-4o-mini Score', 'GPT-4o on GPT-4o Score']

Label distribution:
PM True Label
1    50
0    50
Name: count, dtype: int64


## Step 5: Randomly Assign "Will Manipulate" Groups

Randomly assign 50 resumes to Privileged group (P, Will Manipulate=True) and 50 to Unprivileged group (U, Will Manipulate=False).

This assignment is independent of whether the resume is qualified or unqualified.

In [5]:
# Randomly assign Will Manipulate groups (50/50 split)
np.random.seed(42)
indices = np.arange(len(df_combined))
np.random.shuffle(indices)

# First 50 get Will Manipulate = True, next 50 get False
df_combined['Will Manipulate'] = False
df_combined.loc[indices[:50], 'Will Manipulate'] = True

# Verify the assignment
print("Will Manipulate distribution:")
print(df_combined['Will Manipulate'].value_counts())
print("\nCross-tabulation of True Label vs Will Manipulate:")
print(pd.crosstab(df_combined['PM True Label'], df_combined['Will Manipulate']))
print("\nSample of data with new column:")
print(df_combined[['CV DoorDash PM Score', 'PM True Label', 'Will Manipulate']].head(10))

Will Manipulate distribution:
Will Manipulate
True     50
False    50
Name: count, dtype: int64

Cross-tabulation of True Label vs Will Manipulate:
Will Manipulate  False  True 
PM True Label                
0                   30     20
1                   20     30

Sample of data with new column:
   CV DoorDash PM Score  PM True Label  Will Manipulate
0                81.061              1             True
1                84.806              1            False
2                81.328              1            False
3                82.467              1            False
4                81.239              1             True
5                78.576              1             True
6                81.196              1            False
7                80.252              1             True
8                84.576              1            False
9                79.446              1             True


## Step 6: Define Helper Functions

These functions replicate the methodology from the paper:
1. **Score mapping functions**: Map original and modified scores based on group assignment
2. **Threshold calculation**: Find optimal threshold with No False Positives objective
3. **Metrics calculation**: Calculate TPR, FNR, Accuracy, and Disparity

In [6]:
# Function 1: Map input scores (Traditional 1-ticket scheme)
def map_input_score(row, group, groups_dict):
    """
    Maps the score that the applicant submits to the hiring system.
    - If Will Manipulate = True (Privileged): returns max(original, privileged_LLM_score)
    - If Will Manipulate = False (Unprivileged): returns max(original, unprivileged_LLM_score)
    """
    if row['Will Manipulate']:
        # Privileged group: choose best between original and their LLM (Input-B)
        return max(row[groups_dict[group]['Input-B']], row[groups_dict[0]])
    else:
        # Unprivileged group: choose best between original and their LLM (Input-A)
        return max(row[groups_dict[group]['Input-A']], row[groups_dict[0]])


# Function 2: Map hirer scores (Two-ticket scheme)
def map_hirer_score(row, group, groups_dict):
    """
    Maps the score after the hirer applies their own LLM manipulation.
    - If Will Manipulate = True: returns max(submitted, hirer_LLM_on_submitted) where submitted was already modified (Hirer-B)
    - If Will Manipulate = False: returns max(submitted, hirer_LLM_on_submitted) where submitted was original (Hirer-A)
    """
    if row['Will Manipulate']:
        # Privileged: hirer applies LLM to already-modified resume (twice modified)
        return max(row[groups_dict[group]['Input-B']], row[groups_dict[group]['Hirer-B']])
    else:
        # Unprivileged: hirer applies LLM to original resume (once modified by hirer)
        return max(row[groups_dict[group]['Input-A']], row[groups_dict[group]['Hirer-A']])


# Function 3: Calculate threshold with relaxed FPR constraint (MODIFIED FOR SMALL DATASETS)
def set_threshold_min_fpr(scores, labels, target_fpr=0.05):
    """
    Find threshold that maximizes TPR while keeping FPR around target (default 5%).
    
    NOTE: Changed from strict "No False Positives" (FPR=0) to allow 5% FPR.
    With only 100 samples, strict FPR=0 sets threshold too high (>90), 
    causing nearly all candidates to be rejected (TPR ≈ 0).
    
    Allowing 5% FPR enables meaningful differentiation between conditions.
    """
    scores = np.array(scores)
    labels = np.array(labels)
    
    # Calculate ROC curve
    fpr, tpr, thresholds = roc_curve(labels, scores)
    
    # Find the threshold where FPR is closest to target
    idx = np.argmin(np.abs(fpr - target_fpr))
    return thresholds[idx]


# Function 4: Calculate disparity between groups
def calculate_disparity(y_true, y_pred, y_manipulate_label):
    """
    Calculate TPR disparity: TPR_privileged - TPR_unprivileged
    """
    # Separate by manipulation group
    y_true_privileged = [y_true[i] for i in range(len(y_manipulate_label)) if y_manipulate_label[i] == True]
    y_pred_privileged = [y_pred[i] for i in range(len(y_manipulate_label)) if y_manipulate_label[i] == True]
    
    y_true_unprivileged = [y_true[i] for i in range(len(y_manipulate_label)) if y_manipulate_label[i] == False]
    y_pred_unprivileged = [y_pred[i] for i in range(len(y_manipulate_label)) if y_manipulate_label[i] == False]
    
    # Calculate confusion matrices
    tn_p, fp_p, fn_p, tp_p = confusion_matrix(y_true_privileged, y_pred_privileged).ravel()
    tn_u, fp_u, fn_u, tp_u = confusion_matrix(y_true_unprivileged, y_pred_unprivileged).ravel()
    
    # Calculate TPRs
    tpr_privileged = tp_p / (tp_p + fn_p) if (tp_p + fn_p) > 0 else 0
    tpr_unprivileged = tp_u / (tp_u + fn_u) if (tp_u + fn_u) > 0 else 0
    
    return tpr_privileged - tpr_unprivileged


# Function 5: Calculate TPR, FNR, and Accuracy
def calculate_tpr_fnr_accuracy(y_true, y_pred):
    """
    Calculate overall TPR, FNR, and Accuracy.
    """
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
    fnr = fn / (fn + tp) if (fn + tp) > 0 else 0
    accuracy = (tp + tn) / (tp + tn + fp + fn)
    return tpr, fnr, accuracy


print("Helper functions defined successfully!")

Helper functions defined successfully!


## Step 7: Define Experimental Conditions

We'll test three conditions from Table 1:
1. **Condition 1**: Unprivileged (U) = No LLM, Privileged (P) = GPT-4o
2. **Condition 2**: Unprivileged (U) = GPT-3.5, Privileged (P) = GPT-4o
3. **Condition 3**: Unprivileged (U) = GPT-4o-mini, Privileged (P) = GPT-4o

In [7]:
groups_doordash_pm = {
    0: 'CV DoorDash PM Score',  # Baseline original score
    
    # Condition 1: U: No LLM, P: GPT-4o
    1: {
        'Input-A': 'CV DoorDash PM Score',           # Unprivileged submits original
        'Input-B': 'GPT-4o Score',                 # Privileged submits GPT-4o modified
        'Hirer-A': 'GPT-4o Score',                 # Hirer applies GPT-4o to original
        'Hirer-B': 'GPT-4o on GPT-4o Score'        # Hirer applies GPT-4o to GPT-4o modified
    },
    
    # Condition 2: U: GPT-3.5, P: GPT-4o
    2: {
        'Input-A': 'GPT-3.5 Score',                # Unprivileged submits GPT-3.5 modified
        'Input-B': 'GPT-4o Score',                 # Privileged submits GPT-4o modified
        'Hirer-A': 'GPT-4o on GPT-3.5 Score',      # Hirer applies GPT-4o to GPT-3.5 modified
        'Hirer-B': 'GPT-4o on GPT-4o Score'        # Hirer applies GPT-4o to GPT-4o modified
    },
    
    # Condition 3: U: GPT-4o-mini, P: GPT-4o
    3: {
        'Input-A': 'GPT-4o-mini Score',            # Unprivileged submits GPT-4o-mini modified
        'Input-B': 'GPT-4o Score',                 # Privileged submits GPT-4o modified
        'Hirer-A': 'GPT-4o on GPT-4o-mini Score',  # Hirer applies GPT-4o to GPT-4o-mini modified
        'Hirer-B': 'GPT-4o on GPT-4o Score'        # Hirer applies GPT-4o to GPT-4o modified
    }
}

print("Experimental conditions defined:")
for group_id in [1, 2, 3]:
    print(f"\nCondition {group_id}:")
    print(f"  Unprivileged Input: {groups_doordash_pm[group_id]['Input-A']}")
    print(f"  Privileged Input: {groups_doordash_pm[group_id]['Input-B']}")
    print(f"  Hirer on Unprivileged: {groups_doordash_pm[group_id]['Hirer-A']}")
    print(f"  Hirer on Privileged: {groups_doordash_pm[group_id]['Hirer-B']}")

Experimental conditions defined:

Condition 1:
  Unprivileged Input: CV DoorDash PM Score
  Privileged Input: GPT-4o Score
  Hirer on Unprivileged: GPT-4o Score
  Hirer on Privileged: GPT-4o on GPT-4o Score

Condition 2:
  Unprivileged Input: GPT-3.5 Score
  Privileged Input: GPT-4o Score
  Hirer on Unprivileged: GPT-4o on GPT-3.5 Score
  Hirer on Privileged: GPT-4o on GPT-4o Score

Condition 3:
  Unprivileged Input: GPT-4o-mini Score
  Privileged Input: GPT-4o Score
  Hirer on Unprivileged: GPT-4o on GPT-4o-mini Score
  Hirer on Privileged: GPT-4o on GPT-4o Score


## Step 8: Run 500-Iteration Experiment

For each condition, we'll:
1. Run 500 iterations with random 70/30 train-test splits
2. Test both Traditional (1-ticket) and Two-ticket hiring schemes
3. Calculate TPR, TPR Disparity, FNR, and Accuracy for each iteration
4. Store results for statistical analysis

**Note**: With 100 total resumes, test set will be ~30 resumes per iteration.

**IMPORTANT**: Due to the small dataset (100 samples vs 520 in the original study), we use a **relaxed FPR constraint (5%)** instead of strict "No False Positives" (FPR=0%). With only 4 qualified resumes scoring >90, the strict constraint causes nearly all candidates to be rejected (TPR ≈ 0%), making conditions indistinguishable.

In [8]:
# Initialize results storage
results = {
    'Condition': [],
    'Scheme': [],
    'Iteration': [],
    'TPR': [],
    'TPR_Disparity': [],
    'FNR': [],
    'Accuracy': []
}

num_iterations = 500
test_size = 0.3

print("Running experiments...")
print(f"Total iterations per condition: {num_iterations}")
print(f"Train-test split: {int((1-test_size)*100)}/{int(test_size*100)}\n")

# Run experiments for each condition
for group_id in [1, 2, 3]:
    condition_name = f"Condition {group_id}"
    print(f"\n{'='*60}")
    print(f"Running {condition_name}")
    print(f"{'='*60}")
    
    for iteration in range(num_iterations):
        if (iteration + 1) % 100 == 0:
            print(f"  Iteration {iteration + 1}/{num_iterations}")
        
        # Split data
        train_df, test_df = train_test_split(
            df_combined, 
            test_size=test_size, 
            random_state=42 + iteration,
            stratify=df_combined['PM True Label']
        )
        
        # Get true labels
        y_train = train_df['PM True Label'].values
        y_test = test_df['PM True Label'].values
        y_test_manipulate = test_df['Will Manipulate'].values
        
        # === Traditional Scheme (1-ticket) ===
        train_df['Input Score'] = train_df.apply(
            lambda row: map_input_score(row, group_id, groups_doordash_pm), axis=1
        )
        test_df['Input Score'] = test_df.apply(
            lambda row: map_input_score(row, group_id, groups_doordash_pm), axis=1
        )
        
        X_train_trad = train_df['Input Score'].values
        X_test_trad = test_df['Input Score'].values
        
        # Find threshold and make predictions
        threshold_trad = set_threshold_min_fpr(X_train_trad, y_train)
        y_pred_trad = (X_test_trad >= threshold_trad).astype(int)
        
        # Calculate metrics
        tpr_trad, fnr_trad, acc_trad = calculate_tpr_fnr_accuracy(y_test, y_pred_trad)
        disparity_trad = calculate_disparity(y_test, y_pred_trad, y_test_manipulate)
        
        # Store results
        results['Condition'].append(condition_name)
        results['Scheme'].append('Traditional')
        results['Iteration'].append(iteration)
        results['TPR'].append(tpr_trad)
        results['TPR_Disparity'].append(disparity_trad)
        results['FNR'].append(fnr_trad)
        results['Accuracy'].append(acc_trad)
        
        # === Two-Ticket Scheme ===
        train_df['Hirer Score'] = train_df.apply(
            lambda row: map_hirer_score(row, group_id, groups_doordash_pm), axis=1
        )
        test_df['Hirer Score'] = test_df.apply(
            lambda row: map_hirer_score(row, group_id, groups_doordash_pm), axis=1
        )
        
        X_train_two = train_df['Hirer Score'].values
        X_test_two = test_df['Hirer Score'].values
        
        # Find threshold and make predictions
        threshold_two = set_threshold_min_fpr(X_train_two, y_train)
        y_pred_two = (X_test_two >= threshold_two).astype(int)
        
        # Calculate metrics
        tpr_two, fnr_two, acc_two = calculate_tpr_fnr_accuracy(y_test, y_pred_two)
        disparity_two = calculate_disparity(y_test, y_pred_two, y_test_manipulate)
        
        # Store results
        results['Condition'].append(condition_name)
        results['Scheme'].append('Two-Ticket')
        results['Iteration'].append(iteration)
        results['TPR'].append(tpr_two)
        results['TPR_Disparity'].append(disparity_two)
        results['FNR'].append(fnr_two)
        results['Accuracy'].append(acc_two)

# Convert to DataFrame
results_df = pd.DataFrame(results)

print(f"\n{'='*60}")
print("Experiment completed!")
print(f"Total results: {len(results_df)} rows")
print(f"  - 3 conditions × 2 schemes × {num_iterations} iterations")
print(f"{'='*60}")

Running experiments...
Total iterations per condition: 500
Train-test split: 70/30


Running Condition 1
  Iteration 100/500
  Iteration 200/500
  Iteration 300/500
  Iteration 400/500
  Iteration 500/500

Running Condition 2
  Iteration 100/500
  Iteration 200/500
  Iteration 300/500
  Iteration 400/500
  Iteration 500/500

Running Condition 3
  Iteration 100/500
  Iteration 200/500
  Iteration 300/500
  Iteration 400/500
  Iteration 500/500

Experiment completed!
Total results: 3000 rows
  - 3 conditions × 2 schemes × 500 iterations


## Step 9: Generate Table 1 Results

Calculate mean TPR and TPR Disparity with 95% confidence intervals for each condition and scheme.

In [9]:
# Calculate statistics for each condition and scheme
table1_results = []

for condition in ['Condition 1', 'Condition 2', 'Condition 3']:
    for scheme in ['Traditional', 'Two-Ticket']:
        # Filter results for this condition and scheme
        mask = (results_df['Condition'] == condition) & (results_df['Scheme'] == scheme)
        subset = results_df[mask]
        
        # Calculate mean and 95% CI for TPR
        tpr_values = subset['TPR'].values
        tpr_mean = np.mean(tpr_values)
        tpr_ci = stats.t.interval(0.95, len(tpr_values)-1, 
                                   loc=tpr_mean, 
                                   scale=stats.sem(tpr_values))
        
        # Calculate mean and 95% CI for TPR Disparity
        disparity_values = subset['TPR_Disparity'].values
        disparity_mean = np.mean(disparity_values)
        disparity_ci = stats.t.interval(0.95, len(disparity_values)-1,
                                        loc=disparity_mean,
                                        scale=stats.sem(disparity_values))
        
        # Store results
        table1_results.append({
            'Condition': condition,
            'Scheme': scheme,
            'TPR_Mean': tpr_mean,
            'TPR_CI_Lower': tpr_ci[0],
            'TPR_CI_Upper': tpr_ci[1],
            'TPR_Disparity_Mean': disparity_mean,
            'TPR_Disparity_CI_Lower': disparity_ci[0],
            'TPR_Disparity_CI_Upper': disparity_ci[1]
        })

# Create summary table
table1_df = pd.DataFrame(table1_results)

# Format for display
print("\n" + "="*80)
print("TABLE 1 RESULTS: DoorDash PM Binary Classification (Extension Data)")
print("="*80)
print(f"\nRole: DoorDash PM")
print(f"Qualified Resumes: 50 PM resumes")
print(f"Unqualified Resumes: 50 UX Designer resumes")
print(f"Iterations: {num_iterations}")
print(f"\nObjective: No False Positives (maximize TPR subject to FPR ≈ 0)")
print("\n" + "-"*80)

for _, row in table1_df.iterrows():
    print(f"\n{row['Condition']}: {row['Scheme']}")
    print(f"  TPR: {row['TPR_Mean']:.4f} (95% CI: [{row['TPR_CI_Lower']:.4f}, {row['TPR_CI_Upper']:.4f}])")
    print(f"  TPR Disparity: {row['TPR_Disparity_Mean']:.4f} (95% CI: [{row['TPR_Disparity_CI_Lower']:.4f}, {row['TPR_Disparity_CI_Upper']:.4f}])")

print("\n" + "="*80)

# Display as formatted DataFrame
print("\n\nFormatted Table:")
display_df = table1_df.copy()
display_df['TPR'] = display_df.apply(
    lambda row: f"{row['TPR_Mean']:.4f} [{row['TPR_CI_Lower']:.4f}, {row['TPR_CI_Upper']:.4f}]", 
    axis=1
)
display_df['TPR Disparity'] = display_df.apply(
    lambda row: f"{row['TPR_Disparity_Mean']:.4f} [{row['TPR_Disparity_CI_Lower']:.4f}, {row['TPR_Disparity_CI_Upper']:.4f}]",
    axis=1
)
print(display_df[['Condition', 'Scheme', 'TPR', 'TPR Disparity']].to_string(index=False))


TABLE 1 RESULTS: DoorDash PM Binary Classification (Extension Data)

Role: DoorDash PM
Qualified Resumes: 50 PM resumes
Unqualified Resumes: 50 UX Designer resumes
Iterations: 500

Objective: No False Positives (maximize TPR subject to FPR ≈ 0)

--------------------------------------------------------------------------------

Condition 1: Traditional
  TPR: 0.4933 (95% CI: [0.4814, 0.5053])
  TPR Disparity: 0.4872 (95% CI: [0.4692, 0.5051])

Condition 1: Two-Ticket
  TPR: 0.6141 (95% CI: [0.6026, 0.6256])
  TPR Disparity: 0.0372 (95% CI: [0.0162, 0.0581])

Condition 2: Traditional
  TPR: 0.5064 (95% CI: [0.4948, 0.5180])
  TPR Disparity: 0.3807 (95% CI: [0.3613, 0.4001])

Condition 2: Two-Ticket
  TPR: 0.3907 (95% CI: [0.3782, 0.4031])
  TPR Disparity: 0.5044 (95% CI: [0.4885, 0.5204])

Condition 3: Traditional
  TPR: 0.5064 (95% CI: [0.4923, 0.5205])
  TPR Disparity: 0.1832 (95% CI: [0.1603, 0.2060])

Condition 3: Two-Ticket
  TPR: 0.5393 (95% CI: [0.5273, 0.5514])
  TPR Disparity: 0

## Step 10: Detailed Statistics Table

Generate comprehensive statistics comparing Traditional (1-ticket) vs Two-Ticket schemes.

In [10]:
# Create detailed comparison for each condition
for group_id in [1, 2, 3]:
    condition_name = f"Condition {group_id}"
    
    # Filter data
    trad_mask = (results_df['Condition'] == condition_name) & (results_df['Scheme'] == 'Traditional')
    two_mask = (results_df['Condition'] == condition_name) & (results_df['Scheme'] == 'Two-Ticket')
    
    trad_data = results_df[trad_mask].reset_index(drop=True)
    two_data = results_df[two_mask].reset_index(drop=True)
    
    # Create comparison DataFrame
    comparison_df = pd.DataFrame({
        'test_accuracy_1ticket': trad_data['Accuracy'],
        'test_accuracy_2ticket': two_data['Accuracy'],
        'test_accuracy_improvement': two_data['Accuracy'] - trad_data['Accuracy'],
        'test_tpr_1ticket': trad_data['TPR'],
        'test_tpr_2ticket': two_data['TPR'],
        'tpr_improvement': two_data['TPR'] - trad_data['TPR'],
        'test_disparity_1ticket': trad_data['TPR_Disparity'],
        'test_disparity_2_ticket': two_data['TPR_Disparity'],
        'disparity_decrease_2_1': trad_data['TPR_Disparity'] - two_data['TPR_Disparity'],
        'test_fnr_1ticket': trad_data['FNR'],
        'test_fnr_2ticket': two_data['FNR']
    })
    
    # Display statistics
    print("="*100)
    print(f"{condition_name}: DETAILED STATISTICS")
    print("="*100)
    print(f"\nComparison of Traditional (1-ticket) vs Two-Ticket Schemes")
    print(f"Based on {len(comparison_df)} iterations\n")
    
    print(comparison_df.describe().to_string())
    
    print("\n" + "="*100)
    print(f"\nKEY INSIGHTS FOR {condition_name}:")
    print("-"*100)
    pct_increase = (comparison_df['tpr_improvement'].mean()/comparison_df['test_tpr_1ticket'].mean()*100) if comparison_df['test_tpr_1ticket'].mean() > 0 else 0
    print(f"TPR Improvement (mean): {comparison_df['tpr_improvement'].mean():.4f} ({pct_increase:.1f}% increase)")
    print(f"Accuracy Improvement (mean): {comparison_df['test_accuracy_improvement'].mean():.4f}")
    print(f"Disparity Decrease (mean): {comparison_df['disparity_decrease_2_1'].mean():.4f}")
    print(f"  - Traditional disparity: {comparison_df['test_disparity_1ticket'].mean():.4f}")
    print(f"  - Two-Ticket disparity: {comparison_df['test_disparity_2_ticket'].mean():.4f}")
    
    # Check how often two-ticket is better
    tpr_better = (comparison_df['tpr_improvement'] > 0).sum()
    acc_better = (comparison_df['test_accuracy_improvement'] > 0).sum()
    disp_better = (comparison_df['disparity_decrease_2_1'] > 0).sum()
    
    print(f"\nTwo-Ticket wins:")
    print(f"  - Higher TPR: {tpr_better}/{len(comparison_df)} iterations ({tpr_better/len(comparison_df)*100:.1f}%)")
    print(f"  - Higher Accuracy: {acc_better}/{len(comparison_df)} iterations ({acc_better/len(comparison_df)*100:.1f}%)")
    print(f"  - Lower Disparity: {disp_better}/{len(comparison_df)} iterations ({disp_better/len(comparison_df)*100:.1f}%)")
    print("\n")

Condition 1: DETAILED STATISTICS

Comparison of Traditional (1-ticket) vs Two-Ticket Schemes
Based on 500 iterations

       test_accuracy_1ticket  test_accuracy_2ticket  test_accuracy_improvement  test_tpr_1ticket  test_tpr_2ticket  tpr_improvement  test_disparity_1ticket  test_disparity_2_ticket  disparity_decrease_2_1  test_fnr_1ticket  test_fnr_2ticket
count             500.000000             500.000000                 500.000000        500.000000        500.000000       500.000000              500.000000               500.000000              500.000000        500.000000        500.000000
mean                0.723933               0.783667                   0.059733          0.493333          0.614133         0.120800                0.487188                 0.037155                0.450033          0.506667          0.385867
std                 0.066252               0.059257                   0.063701          0.135717          0.130958         0.140500                0.204241    