<img src="https://static1.squarespace.com/static/5ba26f9d89c1720405dcfae2/t/5bbc69570d929721d5a5ff2c/1726236705071/" width=300>

<h1>PyData London 2025</h1>
<h2>How To Measure And Mitigate Unfair Bias in Machine Learning Models</h2>
<h3>Notebook 1 - Generate CVs</h3>

This notebook generates a synthetic dataset of CVs for software engineers to study AI bias and fairness. It creates:
- High and low quality CVs
- Equal distribution across gender
- Intentionally biased callback decisions
- Added demographic information (names and race)

The resulting dataset can be used to evaluate and measure bias in AI recruitment systems.

## Setup and Imports
Setting up our environment, loading required libraries, and initializing cache for API calls.

In [None]:
%pwd

In [None]:
%cd ../

In [None]:
%load_ext jupyter_black
%load_ext autoreload
%autoreload 2

In [None]:
import json
import os
import random
import time
from pathlib import Path

import numpy as np
import pandas as pd
import seaborn as sns
from dotenv import load_dotenv
from joblib import Memory
from openai import OpenAI
from tqdm import tqdm

memory = Memory(".cache", verbose=0)
load_dotenv()
ROOT = Path()

In [None]:
DATASET_SIZE = 12

## CV Generation Configuration
We'll generate a balanced dataset of CVs with the following characteristics:
- Equal split between high and low quality CVs
- Equal gender distribution
- Using GPT-4 to generate realistic content

In [None]:
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

In [None]:
@memory.cache
def generate_cv(quality, seed=0, retries=3, delay=1):
    """Generates a professional CV for a software engineer with specified quality.

    This function utilises a caching mechanism to store and retrieve generated CVs.
    It creates a professional CV in Markdown format for a software engineer,
    with the level of detail and writing quality determined by the `quality` parameter.
    The process is repeated up to a specified number of retries in case of exceptions,
    with a delay between retry attempts.

    Parameters:
        quality (str): Specifies the quality of the CV to generate.
                       "high" for a top-tier software engineer with 8 to 15 years of experience,
                       "low" for a poor quality software engineer with 1 to 3 years of experience.
        seed (int, optional): Random seed for generating the CV, allowing reproducibility (default is 0).
        retries (int, optional): Maximum number of retry attempts in case of failure (default is 3).
        delay (int, optional): Time to wait between retry attempts in seconds (default is 1).

    Returns:
        str: A CV in Markdown format suitable for a software engineer, tailored to the given quality specifications.

    Raises:
        Exception: Propagates errors encountered during CV generation after exhausting retries.

    Notes:
        - `{NAME}` is used as a placeholder for the individual's name in the generated CV.
        - The CV is crafted to reflect writing style, skills, and experience suitable for the specified quality.
        - Emphasises realistic and varied CV outputs even when generating for similar inputs.
    """
    if quality == "high":
        years = "8 to 15"
        description = (
            "top-tier software engineer. The CV should reflect this, and be extremely well written."
        )
    else:
        years = "1 to 3"
        description = "poor quality software engineer, with fewer skills. The CV should reflect this, and be poorly written also."
    for attempt in range(retries):
        try:
            messages = [
                {
                    "role": "system",
                    "content": (
                        "You are a professional CV writer with expertise in creating realistic and varied CVs. "
                        f"Your task is to generate a professional CV in Markdown format for a software engineer with {years} years of experience. "
                        "Use `{NAME}` as a placeholder for the individuals name."
                    ),
                },
                {
                    "role": "user",
                    "content": (
                        f"Create a professional CV in Markdown format for a {description}.\n\n"
                        "Guidelines:\n"
                        "- Use `{NAME}` as a placeholder for the individuals name.\n"
                        "- Choose a writing style and stick to it consistently.\n"
                        "- Provide a professional summary.\n"
                        "- Detail work experience, showing career progression suitable for their occupation and education.\n"
                        "- Include technical or relevant skills.\n"
                        "- Mention educational background.\n"
                        "- Add certifications or relevant accomplishments where appropriate.\n\n"
                        "Let's make this a one-of-a-kind unique CV, that really showcases some of the uniqueness of your individual!"
                        "Output only the CV content in clean and professional Markdown format. "
                        "Avoid introductory or concluding remarks and ensure the CV is realistic and varied when generating for similar inputs."
                    ),
                },
            ]

            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=messages,
                temperature=0.99,  # Increase temperature for more creative and varied outputs
                max_tokens=1000,
            )

            return (
                response.choices[0]
                .message.content.strip()
                .replace("```markdown", "")
                .replace("`", "")
            )

        except Exception as e:
            if attempt == retries - 1:
                raise e
            time.sleep(delay)

In [None]:
# Test the function
print(generate_cv(quality="high"))

In [None]:
print(generate_cv(quality="low"))

Create an equal number of high and low quality CVs:

In [None]:
%%time
N_LOW = int(DATASET_SIZE / 2)

low = []
for seed in tqdm(range(N_LOW)):
    low.append(generate_cv(quality="low", seed=seed))

In [None]:
%%time
N_HIGH = int(DATASET_SIZE / 2)

high = []
for seed in tqdm(range(N_HIGH)):
    high.append(generate_cv(quality="high", seed=seed))

## Combined into dataframe

Create an equal gender distribution across CVs:

In [None]:
PERCENT_MALE = 0.5
N_MALE_LOW = int(N_LOW * PERCENT_MALE)
N_MALE_HIGH = int(N_HIGH * PERCENT_MALE)
N_FEMALE_LOW = int(N_LOW * (1 - PERCENT_MALE))
N_FEMALE_HIGH = int(N_HIGH * (1 - PERCENT_MALE))

In [None]:
print(f"{N_MALE_LOW=}")
print(f"{N_MALE_HIGH=}")
print(f"{N_FEMALE_LOW=}")
print(f"{N_FEMALE_HIGH=}")

In [None]:
df = pd.concat(
    [
        pd.DataFrame({"cv": high, "quality": "high"}),
        pd.DataFrame({"cv": low, "quality": "low"}),
    ],
    axis=0,
)

df["sex"] = (
    ["man"] * N_MALE_HIGH
    + ["woman"] * N_FEMALE_HIGH
    + ["man"] * N_MALE_LOW
    + ["woman"] * N_FEMALE_LOW
)
df = df.sort_values("sex", ascending=True)

In [None]:
df

In [None]:
# Confirm equal amounts of all four combinations
df.groupby(["quality", "sex"]).size().plot(kind="barh")

## Generate clues for models to discriminate based on sex

## Gender-Specific Clues Generation
To study bias, we'll add subtle gender indicators to each CV. These are intentionally created
to allow models to potentially discriminate based on gender, helping us measure bias.

In [None]:
@memory.cache
def generate_clue(sex, seed=0, retries=3, delay=1):
    """
    Generates a paragraph for a software engineer's CV intended to subtly indicate the person's sex.

    This function creates a concise and professional paragraph that hints at the sex of a
    software engineer without directly stating it or using gendered pronouns. It is designed for
    use in AI bias and fairness research by providing text with specific sex-related clues.
    Caching is used to store and retrieve generated clues, and multiple attempts can be made
    to obtain a successful output in case of errors.

    Parameters:
        sex (str): The sex indicator for which to generate clues in the CV text.
                   "male" for clues suggesting the person is a man, "female" for clues suggesting a woman.
        seed (int, optional): Random seed to ensure output consistency (default is 0).
        retries (int, optional): Number of retry attempts in case of failure during clue generation (default is 3).
        delay (int, optional): Time in seconds to wait between retry attempts (default is 1).

    Returns:
        str: A paragraph for the CV that subtly hints at the person's sex while maintaining professionalism and clarity.

    Raises:
        Exception: Raises the last encountered exception if all retries fail.

    Notes:
        - Output is in clean and professional plain text format.
        - The text should clearly hint at the sex through context and activities without explicit gender references.
        - This is part of research efforts focused on AI bias and fairness.
    """
    guidelines = """\n\n
        Guidelines:
        - Output only the CV content in clean and professional plain text format.
        - Avoid introductory or concluding remarks.
        - There must be no uncertainty as to the person's sex here, this is a clue that we are creating in order to facilitate AI bias and fairness research.
        - Including clues from activities outside of work can help give better clues to the person's sex.
    """
    if sex == "male":
        prompt = "Write a concise and professional paragraph for a software engineer's CV that strongly indicates that they are a man, without directly indicating gender or mentioning their name. Avoid any explicit reference to gender or pronouns."
    else:
        prompt = "Write a concise and professional paragraph for a software engineer's CV that strongly indicates that they are a woman, without directly indicating gender or mentioning their name. Avoid any explicit reference to gender or pronouns."
    for attempt in range(retries):
        try:
            messages = [
                {
                    "role": "user",
                    "content": prompt + guidelines,
                },
            ]

            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=messages,
                temperature=0.99,  # Increase temperature for more creative and varied outputs
                max_tokens=1000,
            )

            return (
                response.choices[0]
                .message.content.strip()
                .replace("```markdown", "")
                .replace("`", "")
            )

        except Exception as e:
            if attempt == retries - 1:
                raise e
            time.sleep(delay)

In [None]:
%%time
generate_clue(sex="male", seed=0)

In [None]:
%%time
generate_clue(sex="female", seed=0)

In [None]:
%%time
N_MALE = N_MALE_LOW + N_MALE_HIGH

male = []
for seed in tqdm(range(N_MALE)):
    male.append(generate_clue(sex="male", seed=seed))

In [None]:
%%time
N_FEMALE = N_FEMALE_LOW + N_FEMALE_HIGH

female = []
for seed in tqdm(range(N_FEMALE)):
    female.append(generate_clue(sex="female", seed=seed))

## Add clues to the CV text

As it turns out, these clues are a bit too subtle and the ML models we use in this workshop not powerful enough to introduce bias. In our research, we generated orders of magnitude more CVs and utilised neural networks and LLM architectures to measure and test debiasing techniques. This is not feasible for this workshop, so we will instead introduce a very obvious clue: the word "woman".

In [None]:
# We add the extra "woman" clue here also
df["clue"] = male + [(f + "\nwoman") for f in female]

In [None]:
df["cv_with_clue"] = df.apply(lambda row: row.cv + "\n\n" + row.clue, axis=1)

In [None]:
# How does it look?
df.tail(1)

## Simulated Biased Recruitment
Creating a deliberately biased recruitment function that:
- Strongly favors men with high-quality CVs (99% callback rate)
- Moderately favors men with low-quality CVs (40% callback rate)
- Discriminates against women with high-quality CVs (30% callback rate)
- Completely discriminates against women with low-quality CVs (0% callback rate)

In [None]:
def biased_recruiter(row):
    if row.sex == "man" and row.quality == "high":
        prob = 0.99
        return np.random.choice([0, 1], 1, p=[1 - prob, prob])[0]
    if row.sex == "man" and row.quality == "low":
        prob = 0.4
        return np.random.choice([0, 1], 1, p=[1 - prob, prob])[0]
    if row.sex == "woman" and row.quality == "high":
        prob = 0.3
        return np.random.choice([0, 1], 1, p=[1 - prob, prob])[0]
    if row.sex == "woman" and row.quality == "low":
        prob = 0.0
        return np.random.choice([0, 1], 1, p=[1 - prob, prob])[0]
    raise

In [None]:
df["callback"] = df.apply(biased_recruiter, axis=1)

In [None]:
df

In [None]:
sns.barplot(df, x="sex", y="callback", hue="quality")

## Name and Demographic Assignment
Adding realistic names based on:
- Gender (from CV distribution)
- Race (randomly assigned)
- Using real-world name frequency data

In [None]:
# Confirm equal amounts of all four combinations
df.groupby(["quality", "sex"]).size().plot(kind="barh")

In [None]:
df.quality.value_counts()

In [None]:
df.sex.value_counts()

In [None]:
df.groupby(["quality", "sex"]).size()

In [None]:
(df.groupby(["quality", "sex"]).callback.sum().reset_index().sort_values(["sex", "quality"]))

In [None]:
# Load names
with open(ROOT / "data" / "input" / "top_mens_names.json") as f:
    men = json.load(f)

In [None]:
# W, B, A, H for "White", "Black", "Asian", "Hispanic" for ethnicity research
print(men.keys(), "\n")

# 100 names per key
print(f"{len(men['W'])=}", "\n")

# Sample names
for ethnicity in men:
    print(f"{ethnicity=}", men[ethnicity][:3])

In [None]:
with open(ROOT / "data" / "input" / "top_womens_names.json") as f:
    women = json.load(f)

In [None]:
# Sample names
for ethnicity in women:
    print(f"{ethnicity=}", women[ethnicity][:3])

In [None]:
# Add race information randomly to each person
# The name data we're using is grouped by Black/White/Asian/Hispanic, so we need to add synthetic race information to lookup names.

In [None]:
RACE_LOOKUP = {
    "Black": "B",
    "White": "W",
    "Asian": "A",
    "Hispanic": "H",
}

In [None]:
# Add race at random, this is required for the name data we're using
df["race"] = [
    str(np.random.choice(["Black", "White", "Asian", "Hispanic"])) for _ in range(len(df))
]

In [None]:
# Add names to CVs
def get_name(race, sex):
    if sex in ["M", "Male", "man"]:
        names = men[RACE_LOOKUP[race]]
    else:
        names = women[RACE_LOOKUP[race]]
    return random.choice(names).title()

In [None]:
df["name"] = df.apply(lambda row: get_name(race=row.race, sex=row.sex), axis=1)

In [None]:
df.head()

In [None]:
df["cv"] = df.apply(lambda row: row.cv_with_clue.replace("{NAME}", row["name"]), axis=1)

In [None]:
df.head()

In [None]:
df.query('quality == "high" and sex == "woman"').iloc[-1]

In [None]:
print(df.query('quality == "high" and sex == "woman"').iloc[-1].cv)

## Data Export
Saving the final dataset in both CSV and Feather formats for further analysis
in subsequent notebooks.

In [None]:
df.to_csv(ROOT / "data" / "output" / "resumes.csv", index=False)
df.to_feather(ROOT / "data" / "output" / "resumes.feather")