# Synthetic CV Dataset Generation

This notebook generates a structured synthetic dataset of CV-like text samples combining multiple demographic and professional dimensions.  
The dataset includes variations across names, genders, ethnicities, professional roles, and domains to serve as a foundation for later analysis of representation, bias, and model behavior.

## Data Sources

Several JSON configuration files define the ingredients for generating each CV:

- **`NAMES.json`** contains a list of synthetic names with associated gender and ethnicity labels.  
  Each entry has the structure:
  ```json
  {"name": "Ahmed Hassan", "gender": "male", "ethnicity": "arabic_middle_eastern"}
  ```

* **`ROLE_TO_DOMAIN.json`** maps each professional role (e.g., *Software Engineer*) to a broader domain category (e.g., *tech*, *health*, *education*).

* **`COMPANIES_BY_DOMAIN.json`** lists realistic company names associated with each domain.
  For example, technology roles draw from companies like *Google* or *IBM*, while healthcare roles draw from *Pfizer* or *Mayo Clinic*.

* **`SKILLS_BY_ROLE.json`** defines skill phrases representative of each job role, ensuring generated CV descriptions are coherent and domain-relevant.

## Generation Process

The function `generate_dataset()` (defined in `generate_dataset.py`) combines the information from these JSON files.
For every name in `NAMES.json` and every role in `ROLE_TO_DOMAIN.json`:

1. The domain of the role is looked up.
2. A random company corresponding to that domain is selected.
3. A skill phrase matching the role is retrieved.
4. These components are combined into a CV-style sentence using a fixed text template:

   ```
   {name} - {role} at {company}, experienced in {skills}.
   ```
5. Each record is written to `cv_records.csv` along with the associated metadata:

   * name
   * description
   * gender
   * ethnicity
   * role
   * domain

## Output

After execution, the notebook prints the total number of generated CV entries and saves the dataset to:
`cv_records.csv`

Each row can then be used for downstream tasks such as model evaluation, bias measurement, or embedding similarity analysis.

In [1]:
from generate_dataset import generate_dataset

generate_dataset()

✅ Generated 1080 instances.
Saved to cv_records.csv


---

### Aligning the Gender Distribution with EU Population Statistics

The original dataset contains an equal number of male and female CVs by design. While this symmetry is useful for controlled comparisons, it does not reflect the actual gender distribution in the European Union. Recent EU population statistics indicate a ratio of **104.4 women per 100 men**, meaning that women constitute approximately **4.4% more of the population than men** [1].

To account for this demographic reality, we adjust the dataset to better mirror the real-world population structure. This is done **within each domain**, rather than globally, to avoid introducing domain-specific distortions. Importantly, we keep all female CVs fixed and **randomly remove male CVs** until the female-to-male ratio within each domain matches the EU statistic.

This procedure serves two purposes. First, it allows us to test whether observed ranking or shortlisting patterns persist when the underlying candidate pool reflects realistic population proportions rather than a perfectly balanced synthetic setup. Second, by only removing male CVs at random, we avoid injecting any additional structure or bias into the data beyond the demographic shift itself.

Overall, this step helps ensure that subsequent analyses are robust to reasonable changes in the base rate of gender representation and are not artifacts of an artificially balanced dataset.

**Sources:** \
[1] European Commission, Eurostat. (2025). Demography of Europe – 2025 edition [Interactive publication]. Publication Office of the European Union. https://ec.europa.eu/eurostat/web/interactive-publications/demography-2025

In [2]:
import math
import pandas as pd

# ---- config ----
CSV_PATH = "cv_records.csv"
DOMAIN_COL = "domain"      # change if your column is named differently
GENDER_COL = "gender"      # change if your column is named differently
MALE_VALUE = "male"        # change if you use "M", "man", etc.
FEMALE_VALUE = "female"    # change if you use "F", "woman", etc.

RATIO_F_OVER_M = 1.044     # 104.4 women per 100 men
RANDOM_SEED = 42           # for reproducible sampling
OUT_PATH = "cv_records_eu_ratio_by_domain.csv"
# ----------------

df = pd.read_csv(CSV_PATH)

def downsample_males_to_ratio(group: pd.DataFrame) -> pd.DataFrame:
    females = group[group[GENDER_COL] == FEMALE_VALUE]
    males   = group[group[GENDER_COL] == MALE_VALUE]

    F = len(females)
    M = len(males)

    # We only delete males; females stay fixed.
    # Choose the largest M_keep such that F / M_keep >= 1.044 (i.e., F has 4.4% more than M_keep)
    if F == 0 or M == 0:
        return group  # nothing sensible to do (or nothing to delete)

    M_keep = min(M, math.floor(F / RATIO_F_OVER_M))

    # If floor makes it impossible to keep any males, keep none.
    if M_keep <= 0:
        return females

    males_keep = males.sample(n=M_keep, random_state=RANDOM_SEED)
    return pd.concat([females, males_keep], ignore_index=True)

df_balanced = (
    df.groupby(DOMAIN_COL, group_keys=False)
      .apply(downsample_males_to_ratio)
)

df_balanced.to_csv(OUT_PATH, index=False)

# Optional quick check: resulting ratio per domain
check = (
    df_balanced.pivot_table(index=DOMAIN_COL, columns=GENDER_COL, aggfunc="size", fill_value=0)
              .rename_axis(None, axis=1)
)
check["female_over_male"] = check.get(FEMALE_VALUE, 0) / check.get(MALE_VALUE, 1).replace(0, pd.NA)
print(check.sort_index())

           female  male  female_over_male
domain                                   
business       90    86          1.046512
creative       30    28          1.071429
education      90    86          1.046512
health         90    86          1.046512
public         60    57          1.052632
tech           90    86          1.046512
trades         90    86          1.046512


  .apply(downsample_males_to_ratio)
