# Synthetic CV Dataset Generation

This notebook generates a structured synthetic dataset of CV-like text samples combining multiple demographic and professional dimensions.  
The dataset includes variations across names, genders, ethnicities, professional roles, and domains to serve as a foundation for later analysis of representation, bias, and model behavior.

## Data Sources

Several JSON configuration files define the ingredients for generating each CV:

- **`NAMES.json`** contains a list of synthetic names with associated gender and ethnicity labels.  
  Each entry has the structure:
  ```json
  {"name": "Ahmed Hassan", "gender": "male", "ethnicity": "arabic_middle_eastern"}
  ```

* **`ROLE_TO_DOMAIN.json`** maps each professional role (e.g., *Software Engineer*) to a broader domain category (e.g., *tech*, *health*, *education*).

* **`COMPANIES_BY_DOMAIN.json`** lists realistic company names associated with each domain.
  For example, technology roles draw from companies like *Google* or *IBM*, while healthcare roles draw from *Pfizer* or *Mayo Clinic*.

* **`SKILLS_BY_ROLE.json`** defines skill phrases representative of each job role, ensuring generated CV descriptions are coherent and domain-relevant.

## Generation Process

The function `generate_dataset()` (defined in `generate_dataset.py`) combines the information from these JSON files.
For every name in `NAMES.json` and every role in `ROLE_TO_DOMAIN.json`:

1. The domain of the role is looked up.
2. A random company corresponding to that domain is selected.
3. A skill phrase matching the role is retrieved.
4. These components are combined into a CV-style sentence using a fixed text template:

   ```
   {name} - {role} at {company}, experienced in {skills}.
   ```
5. Each record is written to `cv_records.csv` along with the associated metadata:

   * name
   * description
   * gender
   * ethnicity
   * role
   * domain

## Output

After execution, the notebook prints the total number of generated CV entries and saves the dataset to:
`cv_records.csv`

Each row can then be used for downstream tasks such as model evaluation, bias measurement, or embedding similarity analysis.

In [1]:
from generate_dataset import generate_dataset

generate_dataset()

âœ… Generated 1080 instances.
Saved to cv_records.csv
