# Google UX Designer Binary Classification Analysis (Table 1 Reproduction)

This notebook reproduces the binary classification results from Section 8 of the paper for the Google UX Designer role.

We will test three conditions:
1. **U: No LLMs, P: GPT-4O** - Unprivileged group has no LLM access, Privileged group uses GPT-4o
2. **U: GPT-3.5, P: GPT-4O** - Unprivileged uses GPT-3.5, Privileged uses GPT-4o  
3. **U: GPT-4O-MINI, P: GPT-4O** - Unprivileged uses GPT-4o-mini, Privileged uses GPT-4o

## Step 1: Load Required Libraries

In [46]:
import pandas as pd
import numpy as np
from sklearn.metrics import roc_curve, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

## Step 2: Load and Merge Qualified and Unqualified Resume Scores

- **Qualified**: UX designer resumes (n=260) - these ARE qualified for the UX role
- **Unqualified**: PM resumes (n=260) - these are NOT qualified for the UX role

In [47]:
# Load qualified resumes (UX designers)
qualified_original = pd.read_csv('../tpr_calculation_files/Qualified_UX/ScoresGoogle_UX_Original_File.csv')
qualified_original['UX True Label'] = 1  # Qualified for UX role

# Load unqualified resumes (PM resumes)
unqualified_original = pd.read_csv('../tpr_calculation_files/Unqualified_UX/ScoresGoogle_UX_Original_File_pm_resumes.csv')
unqualified_original['UX True Label'] = 0  # Unqualified for UX role

print(f"Qualified resumes: {len(qualified_original)}")
print(f"Unqualified resumes: {len(unqualified_original)}")
print("\nQualified sample:")
print(qualified_original.head())
print("\nUnqualified sample:")
print(unqualified_original.head())

Qualified resumes: 260
Unqualified resumes: 260

Qualified sample:
   Unnamed: 0  CV Google_UX Score  UX True Label
0           0              84.209              1
1           1              84.209              1
2           2              85.607              1
3           3              80.387              1
4           4              88.592              1

Unqualified sample:
   Unnamed: 0  CV Google_UX Score  UX True Label
0           0              78.228              0
1           1              81.609              0
2           2              79.725              0
3           3              80.315              0
4           4              82.063              0


## Step 3: Load All Modified Resume Scores

We need to load scores for:
- GPT-3.5 modified
- GPT-4o modified (once)
- GPT-4o-mini modified
- GPT-4o on GPT-3.5 (twice modified)
- GPT-4o on GPT-4o-mini (twice modified)
- GPT-4o on GPT-4o (twice modified)

In [48]:
# Load GPT-3.5 modified scores
qualified_gpt35 = pd.read_csv('../tpr_calculation_files/Qualified_UX/ScoresGoogle_UX_gpt35turbo.csv')
unqualified_gpt35 = pd.read_csv('../tpr_calculation_files/Unqualified_UX/ScoresGoogle_UX_gpt35turbo.csv')

# Load GPT-4o modified scores (once)
qualified_gpt4o = pd.read_csv('../tpr_calculation_files/Qualified_UX/ScoresGoogle_UX_gpt4o.csv')
unqualified_gpt4o = pd.read_csv('../tpr_calculation_files/Unqualified_UX/ScoresGoogle_UX_gpt4o.csv')

# Load GPT-4o-mini modified scores
qualified_gpt4omini = pd.read_csv('../tpr_calculation_files/Qualified_UX/ScoresGoogle_UX_gpt4omini.csv')
unqualified_gpt4omini = pd.read_csv('../tpr_calculation_files/Unqualified_UX/ScoresGoogle_UX_gpt4omini.csv')

# Load GPT-4o on GPT-3.5 scores
qualified_gpt4o_on_gpt35 = pd.read_csv('../tpr_calculation_files/Qualified_UX/ScoresGoogle_UX_gpt4o_on_gpt35turbo.csv')
unqualified_gpt4o_on_gpt35 = pd.read_csv('../tpr_calculation_files/Unqualified_UX/ScoresGoogle_UX_gpt4o_on_gpt35turbo.csv')

# Load GPT-4o on GPT-4o-mini scores
qualified_gpt4o_on_gpt4omini = pd.read_csv('../tpr_calculation_files/Qualified_UX/ScoresGoogle_UX_gpt4o_on_gpt4omini.csv')
unqualified_gpt4o_on_gpt4omini = pd.read_csv('../tpr_calculation_files/Unqualified_UX/ScoresGoogle_UX_gpt4o_ongpt4omini.csv')

# Load GPT-4o on GPT-4o scores (twice modified)
qualified_gpt4o_on_gpt4o = pd.read_csv('../tpr_calculation_files/Qualified_UX/ScoresGoogle_UX_gpt4o_on_gpt4o.csv')
unqualified_gpt4o_on_gpt4o = pd.read_csv('../tpr_calculation_files/Unqualified_UX/ScoresGoogle_UX_gpt4o_on_gpt4o.csv')

print("All score files loaded successfully!")
print(f"Each file has {len(qualified_gpt35)} qualified and {len(unqualified_gpt35)} unqualified resumes")

All score files loaded successfully!
Each file has 260 qualified and 260 unqualified resumes


## Step 4: Merge All Scores into Single Dataframe

We'll create one dataframe with all score columns for easier manipulation.

In [49]:
# Merge qualified scores
qualified_df = qualified_original.copy()
qualified_df['GPT-3.5 Score'] = qualified_gpt35.iloc[:, 1]  # Second column has the scores
qualified_df['GPT-4o Score'] = qualified_gpt4o.iloc[:, 1]
qualified_df['GPT-4o-mini Score'] = qualified_gpt4omini.iloc[:, 1]
qualified_df['GPT-4o on GPT-3.5 Score'] = qualified_gpt4o_on_gpt35.iloc[:, 1]
qualified_df['GPT-4o on GPT-4o-mini Score'] = qualified_gpt4o_on_gpt4omini.iloc[:, 1]
qualified_df['GPT-4o on GPT-4o Score'] = qualified_gpt4o_on_gpt4o.iloc[:, 1]

# Merge unqualified scores
unqualified_df = unqualified_original.copy()
unqualified_df['GPT-3.5 Score'] = unqualified_gpt35.iloc[:, 1]
unqualified_df['GPT-4o Score'] = unqualified_gpt4o.iloc[:, 1]
unqualified_df['GPT-4o-mini Score'] = unqualified_gpt4omini.iloc[:, 1]
unqualified_df['GPT-4o on GPT-3.5 Score'] = unqualified_gpt4o_on_gpt35.iloc[:, 1]
unqualified_df['GPT-4o on GPT-4o-mini Score'] = unqualified_gpt4o_on_gpt4omini.iloc[:, 1]
unqualified_df['GPT-4o on GPT-4o Score'] = unqualified_gpt4o_on_gpt4o.iloc[:, 1]

# Combine qualified and unqualified
df_combined = pd.concat([qualified_df, unqualified_df], ignore_index=True)

print(f"Combined dataframe shape: {df_combined.shape}")
print(f"Total resumes: {len(df_combined)}")
print(f"\nColumns: {df_combined.columns.tolist()}")
print(f"\nSample data:")
print(df_combined.head())
print(f"\nLabel distribution:")
print(df_combined['UX True Label'].value_counts())

Combined dataframe shape: (520, 9)
Total resumes: 520

Columns: ['Unnamed: 0', 'CV Google_UX Score', 'UX True Label', 'GPT-3.5 Score', 'GPT-4o Score', 'GPT-4o-mini Score', 'GPT-4o on GPT-3.5 Score', 'GPT-4o on GPT-4o-mini Score', 'GPT-4o on GPT-4o Score']

Sample data:
   Unnamed: 0  CV Google_UX Score  UX True Label  GPT-3.5 Score  GPT-4o Score  \
0           0              84.209              1         84.666        83.219   
1           1              84.209              1         85.905        85.236   
2           2              85.607              1         89.395        89.270   
3           3              80.387              1         81.141        82.108   
4           4              88.592              1         86.736        87.710   

   GPT-4o-mini Score  GPT-4o on GPT-3.5 Score  GPT-4o on GPT-4o-mini Score  \
0             83.438                   85.864                       82.949   
1             84.554                   86.135                       83.910   
2        

## Step 5: Randomly Assign "Will Manipulate" Groups

Randomly assign 260 resumes to Privileged group (P, Will Manipulate=True) and 260 to Unprivileged group (U, Will Manipulate=False).

This assignment is independent of whether the resume is qualified or unqualified.

In [50]:
# Randomly assign Will Manipulate groups (50/50 split)
# Use a fixed random seed for reproducibility
np.random.seed(42)
indices = np.arange(len(df_combined))
np.random.shuffle(indices)

# First 260 get Will Manipulate = True, next 260 get False
df_combined['Will Manipulate'] = False
df_combined.loc[indices[:260], 'Will Manipulate'] = True

# Verify the assignment
print("Will Manipulate distribution:")
print(df_combined['Will Manipulate'].value_counts())
print("\nCross-tabulation of True Label vs Will Manipulate:")
print(pd.crosstab(df_combined['UX True Label'], df_combined['Will Manipulate']))
print("\nSample of data with new column:")
print(df_combined[['CV Google_UX Score', 'UX True Label', 'Will Manipulate']].head(10))

Will Manipulate distribution:
Will Manipulate
True     260
False    260
Name: count, dtype: int64

Cross-tabulation of True Label vs Will Manipulate:
Will Manipulate  False  True 
UX True Label                
0                  135    125
1                  125    135

Sample of data with new column:
   CV Google_UX Score  UX True Label  Will Manipulate
0              84.209              1             True
1              84.209              1            False
2              85.607              1             True
3              80.387              1             True
4              88.592              1            False
5              85.647              1             True
6              84.062              1             True
7              82.276              1             True
8              88.742              1            False
9              84.935              1             True


## Data Preparation Complete!

We now have a complete dataset with:
- **520 resumes total** (260 qualified UX + 260 unqualified PM)
- **True labels**: UX True Label (1=qualified, 0=unqualified)
- **Group assignment**: Will Manipulate (True=Privileged group P, False=Unprivileged group U)
- **7 score columns**: Original, GPT-3.5, GPT-4o, GPT-4o-mini, GPT-4o on GPT-3.5, GPT-4o on GPT-4o-mini, GPT-4o on GPT-4o

Next steps: Define the three experimental conditions and run the binary classification analysis!

## Step 6: Define Helper Functions

These functions replicate the methodology from the paper:
1. **Score mapping functions**: Map original and modified scores based on group assignment
2. **Threshold calculation**: Find optimal threshold with No False Positives objective
3. **Metrics calculation**: Calculate TPR, FNR, Accuracy, and Disparity

In [51]:
# Function 1: Map input scores (Traditional 1-ticket scheme)
def map_input_score(row, group, groups_dict):
    """
    Maps the score that the applicant submits to the hiring system.
    - If Will Manipulate = True (Privileged): returns max(original, privileged_LLM_score)
    - If Will Manipulate = False (Unprivileged): returns max(original, unprivileged_LLM_score)
    """
    if row['Will Manipulate']:
        # Privileged group: choose best between original and their LLM (Input-B)
        return max(row[groups_dict[group]['Input-B']], row[groups_dict[0]])
    else:
        # Unprivileged group: choose best between original and their LLM (Input-A)
        return max(row[groups_dict[group]['Input-A']], row[groups_dict[0]])


# Function 2: Map hirer scores (Two-ticket scheme)
def map_hirer_score(row, group, groups_dict):
    """
    Maps the score after the hirer applies their own LLM manipulation.
    - If Will Manipulate = True: returns max(submitted, hirer_LLM_on_submitted) where submitted was already modified (Hirer-B)
    - If Will Manipulate = False: returns max(submitted, hirer_LLM_on_submitted) where submitted was original (Hirer-A)
    """
    if row['Will Manipulate']:
        # Privileged: hirer applies LLM to already-modified resume (twice modified)
        return max(row[groups_dict[group]['Input-B']], row[groups_dict[group]['Hirer-B']])
    else:
        # Unprivileged: hirer applies LLM to original resume (once modified by hirer)
        return max(row[groups_dict[group]['Input-A']], row[groups_dict[group]['Hirer-A']])


# Function 3: Calculate threshold with No False Positives objective
def set_threshold_min_fpr(scores, labels, min_tpr=0.01):
    """
    Find threshold that maximizes TPR while keeping FPR arbitrarily small (close to 0).
    Returns only the threshold value.
    """
    scores = np.array(scores)
    labels = np.array(labels)
    
    # Calculate ROC curve
    fpr, tpr, thresholds = roc_curve(labels, scores)
    
    # Find the index where TPR is just above min_tpr
    valid_idx = np.where(tpr >= min_tpr)[0]
    if len(valid_idx) > 0:
        best_idx = valid_idx[0]
        return thresholds[best_idx]
    else:
        # If no threshold gives TPR >= min_tpr, return the threshold with highest TPR
        best_idx = np.argmax(tpr)
        return thresholds[best_idx]


# Function 4: Calculate disparity between groups
def calculate_disparity(y_true, y_pred, y_manipulate_label):
    """
    Calculate TPR disparity: TPR_privileged - TPR_unprivileged
    """
    # Separate by manipulation group
    y_true_privileged = [y_true[i] for i in range(len(y_manipulate_label)) if y_manipulate_label[i] == True]
    y_pred_privileged = [y_pred[i] for i in range(len(y_manipulate_label)) if y_manipulate_label[i] == True]
    
    y_true_unprivileged = [y_true[i] for i in range(len(y_manipulate_label)) if y_manipulate_label[i] == False]
    y_pred_unprivileged = [y_pred[i] for i in range(len(y_manipulate_label)) if y_manipulate_label[i] == False]
    
    # Calculate confusion matrices
    tn_p, fp_p, fn_p, tp_p = confusion_matrix(y_true_privileged, y_pred_privileged).ravel()
    tn_u, fp_u, fn_u, tp_u = confusion_matrix(y_true_unprivileged, y_pred_unprivileged).ravel()
    
    # Calculate TPRs
    tpr_privileged = tp_p / (tp_p + fn_p) if (tp_p + fn_p) > 0 else 0
    tpr_unprivileged = tp_u / (tp_u + fn_u) if (tp_u + fn_u) > 0 else 0
    
    return tpr_privileged - tpr_unprivileged


# Function 5: Calculate TPR, FNR, and Accuracy
def calculate_tpr_fnr_accuracy(y_true, y_pred):
    """
    Calculate overall TPR, FNR, and Accuracy.
    """
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
    fnr = fn / (fn + tp) if (fn + tp) > 0 else 0
    accuracy = (tp + tn) / (tp + tn + fp + fn)
    return tpr, fnr, accuracy


print("Helper functions defined successfully!")

Helper functions defined successfully!


## Step 7: Define Experimental Conditions

We'll test three conditions from Table 1:
1. **Condition 1**: Unprivileged (U) = No LLM, Privileged (P) = GPT-4o
2. **Condition 2**: Unprivileged (U) = GPT-3.5, Privileged (P) = GPT-4o
3. **Condition 3**: Unprivileged (U) = GPT-4o-mini, Privileged (P) = GPT-4o

For each condition, we specify:
- **Input-A**: Score the unprivileged group submits
- **Input-B**: Score the privileged group submits
- **Hirer-A**: Score after hirer applies LLM to unprivileged submission
- **Hirer-B**: Score after hirer applies LLM to privileged submission

In [52]:
groups_google_ux = {
    0: 'CV Google_UX Score',  # Baseline original score
    
    # Condition 1: U: No LLM, P: GPT-4o
    1: {
        'Input-A': 'CV Google_UX Score',           # Unprivileged submits original
        'Input-B': 'GPT-4o Score',                 # Privileged submits GPT-4o modified
        'Hirer-A': 'GPT-4o Score',                 # Hirer applies GPT-4o to original
        'Hirer-B': 'GPT-4o on GPT-4o Score'        # Hirer applies GPT-4o to GPT-4o modified
    },
    
    # Condition 2: U: GPT-3.5, P: GPT-4o
    2: {
        'Input-A': 'GPT-3.5 Score',                # Unprivileged submits GPT-3.5 modified
        'Input-B': 'GPT-4o Score',                 # Privileged submits GPT-4o modified
        'Hirer-A': 'GPT-4o on GPT-3.5 Score',      # Hirer applies GPT-4o to GPT-3.5 modified
        'Hirer-B': 'GPT-4o on GPT-4o Score'        # Hirer applies GPT-4o to GPT-4o modified
    },
    
    # Condition 3: U: GPT-4o-mini, P: GPT-4o
    3: {
        'Input-A': 'GPT-4o-mini Score',            # Unprivileged submits GPT-4o-mini modified
        'Input-B': 'GPT-4o Score',                 # Privileged submits GPT-4o modified
        'Hirer-A': 'GPT-4o on GPT-4o-mini Score',  # Hirer applies GPT-4o to GPT-4o-mini modified
        'Hirer-B': 'GPT-4o on GPT-4o Score'        # Hirer applies GPT-4o to GPT-4o modified
    }
}

print("Experimental conditions defined:")
for group_id in [1, 2, 3]:
    print(f"\nCondition {group_id}:")
    print(f"  Unprivileged Input: {groups_google_ux[group_id]['Input-A']}")
    print(f"  Privileged Input: {groups_google_ux[group_id]['Input-B']}")
    print(f"  Hirer on Unprivileged: {groups_google_ux[group_id]['Hirer-A']}")
    print(f"  Hirer on Privileged: {groups_google_ux[group_id]['Hirer-B']}")

Experimental conditions defined:

Condition 1:
  Unprivileged Input: CV Google_UX Score
  Privileged Input: GPT-4o Score
  Hirer on Unprivileged: GPT-4o Score
  Hirer on Privileged: GPT-4o on GPT-4o Score

Condition 2:
  Unprivileged Input: GPT-3.5 Score
  Privileged Input: GPT-4o Score
  Hirer on Unprivileged: GPT-4o on GPT-3.5 Score
  Hirer on Privileged: GPT-4o on GPT-4o Score

Condition 3:
  Unprivileged Input: GPT-4o-mini Score
  Privileged Input: GPT-4o Score
  Hirer on Unprivileged: GPT-4o on GPT-4o-mini Score
  Hirer on Privileged: GPT-4o on GPT-4o Score


## Step 7.5: Check and Clean Data

Before running experiments, let's check for any NaN values and clean the data.

In [53]:
# Check for NaN values
print("Checking for NaN values in the dataset...")
print(f"\nTotal rows: {len(df_combined)}")
print(f"\nNaN counts per column:")
print(df_combined.isnull().sum())

# Check if any rows have NaN values
rows_with_nan = df_combined.isnull().any(axis=1).sum()
print(f"\nRows with at least one NaN: {rows_with_nan}")

if rows_with_nan > 0:
    print("\nRemoving rows with NaN values...")
    df_combined_clean = df_combined.dropna()
    print(f"Rows after cleaning: {len(df_combined_clean)}")
    print(f"\nLabel distribution after cleaning:")
    print(df_combined_clean['UX True Label'].value_counts())
    print(f"\nWill Manipulate distribution after cleaning:")
    print(df_combined_clean['Will Manipulate'].value_counts())
    
    # Replace the original dataframe
    df_combined = df_combined_clean
else:
    print("\nNo NaN values found. Data is clean!")

Checking for NaN values in the dataset...

Total rows: 520

NaN counts per column:
Unnamed: 0                     0
CV Google_UX Score             0
UX True Label                  0
GPT-3.5 Score                  0
GPT-4o Score                   1
GPT-4o-mini Score              1
GPT-4o on GPT-3.5 Score        0
GPT-4o on GPT-4o-mini Score    1
GPT-4o on GPT-4o Score         0
Will Manipulate                0
dtype: int64

Rows with at least one NaN: 1

Removing rows with NaN values...
Rows after cleaning: 519

Label distribution after cleaning:
UX True Label
1    260
0    259
Name: count, dtype: int64

Will Manipulate distribution after cleaning:
Will Manipulate
False    260
True     259
Name: count, dtype: int64


In [54]:
from sklearn.model_selection import train_test_split

# Initialize results storage
results = {
    'Condition': [],
    'Scheme': [],
    'Iteration': [],
    'TPR': [],
    'TPR_Disparity': [],
    'FNR': [],
    'Accuracy': []
}

num_iterations = 500
test_size = 0.3

print("Running experiments...")
print(f"Total iterations per condition: {num_iterations}")
print(f"Train-test split: {int((1-test_size)*100)}/{int(test_size*100)}\n")

# Run experiments for each condition
for group_id in [1, 2, 3]:
    condition_name = f"Condition {group_id}"
    print(f"\n{'='*60}")
    print(f"Running {condition_name}")
    print(f"{'='*60}")
    
    for iteration in range(num_iterations):
        if (iteration + 1) % 100 == 0:
            print(f"  Iteration {iteration + 1}/{num_iterations}")
        
        # Split data
        train_df, test_df = train_test_split(
            df_combined, 
            test_size=test_size, 
            random_state=42 + iteration,
            stratify=df_combined['UX True Label']
        )
        
        # Get true labels
        y_train = train_df['UX True Label'].values
        y_test = test_df['UX True Label'].values
        y_test_manipulate = test_df['Will Manipulate'].values
        
        # === Traditional Scheme (1-ticket) ===
        # Applicants submit best of original or their LLM-modified resume
        train_df['Input Score'] = train_df.apply(
            lambda row: map_input_score(row, group_id, groups_google_ux), axis=1
        )
        test_df['Input Score'] = test_df.apply(
            lambda row: map_input_score(row, group_id, groups_google_ux), axis=1
        )
        
        X_train_trad = train_df['Input Score'].values
        X_test_trad = test_df['Input Score'].values
        
        # Find threshold on training set
        threshold_trad = set_threshold_min_fpr(X_train_trad, y_train)
        
        # Make predictions on test set
        y_pred_trad = (X_test_trad >= threshold_trad).astype(int)
        
        # Calculate metrics
        tpr_trad, fnr_trad, acc_trad = calculate_tpr_fnr_accuracy(y_test, y_pred_trad)
        disparity_trad = calculate_disparity(y_test, y_pred_trad, y_test_manipulate)
        
        # Store results
        results['Condition'].append(condition_name)
        results['Scheme'].append('Traditional')
        results['Iteration'].append(iteration)
        results['TPR'].append(tpr_trad)
        results['TPR_Disparity'].append(disparity_trad)
        results['FNR'].append(fnr_trad)
        results['Accuracy'].append(acc_trad)
        
        # === Two-Ticket Scheme ===
        # Hirer applies their LLM to submitted resumes
        train_df['Hirer Score'] = train_df.apply(
            lambda row: map_hirer_score(row, group_id, groups_google_ux), axis=1
        )
        test_df['Hirer Score'] = test_df.apply(
            lambda row: map_hirer_score(row, group_id, groups_google_ux), axis=1
        )
        
        X_train_two = train_df['Hirer Score'].values
        X_test_two = test_df['Hirer Score'].values
        
        # Find threshold on training set
        threshold_two = set_threshold_min_fpr(X_train_two, y_train)
        
        # Make predictions on test set
        y_pred_two = (X_test_two >= threshold_two).astype(int)
        
        # Calculate metrics
        tpr_two, fnr_two, acc_two = calculate_tpr_fnr_accuracy(y_test, y_pred_two)
        disparity_two = calculate_disparity(y_test, y_pred_two, y_test_manipulate)
        
        # Store results
        results['Condition'].append(condition_name)
        results['Scheme'].append('Two-Ticket')
        results['Iteration'].append(iteration)
        results['TPR'].append(tpr_two)
        results['TPR_Disparity'].append(disparity_two)
        results['FNR'].append(fnr_two)
        results['Accuracy'].append(acc_two)

# Convert to DataFrame
results_df = pd.DataFrame(results)

print(f"\n{'='*60}")
print("Experiment completed!")
print(f"Total results: {len(results_df)} rows")
print(f"  - 3 conditions × 2 schemes × {num_iterations} iterations")
print(f"{'='*60}")

Running experiments...
Total iterations per condition: 500
Train-test split: 70/30


Running Condition 1
  Iteration 100/500
  Iteration 200/500
  Iteration 300/500
  Iteration 400/500
  Iteration 500/500

Running Condition 2
  Iteration 100/500
  Iteration 200/500
  Iteration 300/500
  Iteration 400/500
  Iteration 500/500

Running Condition 3
  Iteration 100/500
  Iteration 200/500
  Iteration 300/500
  Iteration 400/500
  Iteration 500/500

Experiment completed!
Total results: 3000 rows
  - 3 conditions × 2 schemes × 500 iterations


## Step 9: Generate Table 1 Results

Calculate mean TPR and TPR Disparity with 95% confidence intervals for each condition and scheme.

In [55]:
from scipy import stats

# Calculate statistics for each condition and scheme
table1_results = []

for condition in ['Condition 1', 'Condition 2', 'Condition 3']:
    for scheme in ['Traditional', 'Two-Ticket']:
        # Filter results for this condition and scheme
        mask = (results_df['Condition'] == condition) & (results_df['Scheme'] == scheme)
        subset = results_df[mask]
        
        # Calculate mean and 95% CI for TPR
        tpr_values = subset['TPR'].values
        tpr_mean = np.mean(tpr_values)
        tpr_ci = stats.t.interval(0.95, len(tpr_values)-1, 
                                   loc=tpr_mean, 
                                   scale=stats.sem(tpr_values))
        
        # Calculate mean and 95% CI for TPR Disparity
        disparity_values = subset['TPR_Disparity'].values
        disparity_mean = np.mean(disparity_values)
        disparity_ci = stats.t.interval(0.95, len(disparity_values)-1,
                                        loc=disparity_mean,
                                        scale=stats.sem(disparity_values))
        
        # Store results
        table1_results.append({
            'Condition': condition,
            'Scheme': scheme,
            'TPR_Mean': tpr_mean,
            'TPR_CI_Lower': tpr_ci[0],
            'TPR_CI_Upper': tpr_ci[1],
            'TPR_Disparity_Mean': disparity_mean,
            'TPR_Disparity_CI_Lower': disparity_ci[0],
            'TPR_Disparity_CI_Upper': disparity_ci[1]
        })

# Create summary table
table1_df = pd.DataFrame(table1_results)

# Format for display
print("\n" + "="*80)
print("TABLE 1 RESULTS: Google UX Designer Binary Classification")
print("="*80)
print(f"\nRole: Google UX Designer")
print(f"Qualified Resumes: 260 UX Designer resumes")
print(f"Unqualified Resumes: 260 Product Manager resumes")
print(f"Iterations: {num_iterations}")
print(f"\nObjective: No False Positives (maximize TPR subject to FPR ≈ 0)")
print("\n" + "-"*80)

for _, row in table1_df.iterrows():
    print(f"\n{row['Condition']}: {row['Scheme']}")
    print(f"  TPR: {row['TPR_Mean']:.4f} (95% CI: [{row['TPR_CI_Lower']:.4f}, {row['TPR_CI_Upper']:.4f}])")
    print(f"  TPR Disparity: {row['TPR_Disparity_Mean']:.4f} (95% CI: [{row['TPR_Disparity_CI_Lower']:.4f}, {row['TPR_Disparity_CI_Upper']:.4f}])")

print("\n" + "="*80)

# Display as formatted DataFrame
print("\n\nFormatted Table:")
display_df = table1_df.copy()
display_df['TPR'] = display_df.apply(
    lambda row: f"{row['TPR_Mean']:.4f} [{row['TPR_CI_Lower']:.4f}, {row['TPR_CI_Upper']:.4f}]", 
    axis=1
)
display_df['TPR Disparity'] = display_df.apply(
    lambda row: f"{row['TPR_Disparity_Mean']:.4f} [{row['TPR_Disparity_CI_Lower']:.4f}, {row['TPR_Disparity_CI_Upper']:.4f}]",
    axis=1
)
print(display_df[['Condition', 'Scheme', 'TPR', 'TPR Disparity']].to_string(index=False))


TABLE 1 RESULTS: Google UX Designer Binary Classification

Role: Google UX Designer
Qualified Resumes: 260 UX Designer resumes
Unqualified Resumes: 260 Product Manager resumes
Iterations: 500

Objective: No False Positives (maximize TPR subject to FPR ≈ 0)

--------------------------------------------------------------------------------

Condition 1: Traditional
  TPR: 0.1347 (95% CI: [0.1213, 0.1481])
  TPR Disparity: 0.1256 (95% CI: [0.1148, 0.1364])

Condition 1: Two-Ticket
  TPR: 0.1586 (95% CI: [0.1475, 0.1697])
  TPR Disparity: -0.0389 (95% CI: [-0.0450, -0.0327])

Condition 2: Traditional
  TPR: 0.1416 (95% CI: [0.1283, 0.1549])
  TPR Disparity: 0.0790 (95% CI: [0.0719, 0.0861])

Condition 2: Two-Ticket
  TPR: 0.1836 (95% CI: [0.1738, 0.1934])
  TPR Disparity: 0.0280 (95% CI: [0.0210, 0.0350])

Condition 3: Traditional
  TPR: 0.0939 (95% CI: [0.0841, 0.1037])
  TPR Disparity: -0.0214 (95% CI: [-0.0260, -0.0168])

Condition 3: Two-Ticket
  TPR: 0.1471 (95% CI: [0.1379, 0.1562])


## Step 10: Detailed Statistics Table

Generate comprehensive statistics comparing Traditional (1-ticket) vs Two-Ticket schemes, including:
- Accuracy metrics
- TPR and improvement
- Disparity metrics
- Threshold information
- FNR (False Negative Rate)
- Number of candidates accepted

In [56]:
# Create detailed comparison DataFrames for each condition
for group_id in [1, 2, 3]:
    condition_name = f"Condition {group_id}"
    
    # Filter data for this condition
    trad_mask = (results_df['Condition'] == condition_name) & (results_df['Scheme'] == 'Traditional')
    two_mask = (results_df['Condition'] == condition_name) & (results_df['Scheme'] == 'Two-Ticket')
    
    trad_data = results_df[trad_mask].reset_index(drop=True)
    two_data = results_df[two_mask].reset_index(drop=True)
    
    # Create comparison DataFrame
    comparison_df = pd.DataFrame({
        'test_accuracy_1ticket': trad_data['Accuracy'],
        'test_accuracy_2ticket': two_data['Accuracy'],
        'test_accuracy_improvement': two_data['Accuracy'] - trad_data['Accuracy'],
        'test_tpr_1ticket': trad_data['TPR'],
        'test_tpr_2ticket': two_data['TPR'],
        'tpr_improvement': two_data['TPR'] - trad_data['TPR'],
        'test_disparity_1ticket': trad_data['TPR_Disparity'],
        'test_disparity_2_ticket': two_data['TPR_Disparity'],
        'disparity_decrease_2_1': trad_data['TPR_Disparity'] - two_data['TPR_Disparity'],
        'test_fnr_1ticket': trad_data['FNR'],
        'test_fnr_2ticket': two_data['FNR']
    })
    
    # Display statistics
    print("="*100)
    print(f"{condition_name}: DETAILED STATISTICS")
    print("="*100)
    print(f"\nComparison of Traditional (1-ticket) vs Two-Ticket Schemes")
    print(f"Based on {len(comparison_df)} iterations\n")
    
    # Display descriptive statistics
    stats_df = comparison_df.describe()
    print(stats_df.to_string())
    
    print("\n" + "="*100)
    print(f"\nKEY INSIGHTS FOR {condition_name}:")
    print("-"*100)
    print(f"TPR Improvement (mean): {comparison_df['tpr_improvement'].mean():.4f} "
          f"({comparison_df['tpr_improvement'].mean()/comparison_df['test_tpr_1ticket'].mean()*100:.1f}% increase)")
    print(f"Accuracy Improvement (mean): {comparison_df['test_accuracy_improvement'].mean():.4f}")
    print(f"Disparity Decrease (mean): {comparison_df['disparity_decrease_2_1'].mean():.4f}")
    print(f"  - Traditional disparity: {comparison_df['test_disparity_1ticket'].mean():.4f}")
    print(f"  - Two-Ticket disparity: {comparison_df['test_disparity_2_ticket'].mean():.4f}")
    
    # Check how often two-ticket is better
    tpr_better = (comparison_df['tpr_improvement'] > 0).sum()
    acc_better = (comparison_df['test_accuracy_improvement'] > 0).sum()
    disp_better = (comparison_df['disparity_decrease_2_1'] > 0).sum()
    
    print(f"\nTwo-Ticket wins:")
    print(f"  - Higher TPR: {tpr_better}/{len(comparison_df)} iterations ({tpr_better/len(comparison_df)*100:.1f}%)")
    print(f"  - Higher Accuracy: {acc_better}/{len(comparison_df)} iterations ({acc_better/len(comparison_df)*100:.1f}%)")
    print(f"  - Lower Disparity: {disp_better}/{len(comparison_df)} iterations ({disp_better/len(comparison_df)*100:.1f}%)")
    print("\n")

Condition 1: DETAILED STATISTICS

Comparison of Traditional (1-ticket) vs Two-Ticket Schemes
Based on 500 iterations

       test_accuracy_1ticket  test_accuracy_2ticket  test_accuracy_improvement  test_tpr_1ticket  test_tpr_2ticket  tpr_improvement  test_disparity_1ticket  test_disparity_2_ticket  disparity_decrease_2_1  test_fnr_1ticket  test_fnr_2ticket
count             500.000000             500.000000                 500.000000        500.000000        500.000000       500.000000              500.000000               500.000000              500.000000        500.000000        500.000000
mean                0.567064               0.579192                   0.012128          0.134692          0.158564         0.023872                0.125611                -0.038877                0.164488          0.865308          0.841436
std                 0.075285               0.062691                   0.056608          0.152274          0.126209         0.115222                0.123216    