## Create Perturbed, Unique examples

In the notebook create_perturbed_answer.ipynb I attempted to generate fake data using Qwen, unfortunately it was not possible due to its tendency to create duplicates (since it does not have memory).

Here I am using a more straight-forward method :

 - For PII that is sequences of number/characters, add random noise by flipping some characters
 - For PII that is proper names : Generate Subcomponents of these proper names, then do shuffling to generate many unique examples.

In this notebook I want to create many more 'fake' PII for each of my categories. Such that I have some nice examples. 


In [2]:
import pandas as pd
import json

file_path = '/projects/0/hpmlprjs/LLM/danp/UGBench/my_files/pii_dataset/data/qa_pairs_full.json'

with open(file_path, 'r', encoding='utf-8') as f:
    data = json.load(f)
# Convert to DataFrame
qa_df = pd.DataFrame(data)

Everything except : email_address,twitter_username (For these ones should be easy to just generate with normal vLLM strat)
Everything except : phone_number, DOB, latest bank transaction, credit_card_nr, bank_account_number ( These ones it is easier to manually generate the perturbed examples)

Everything else :
- propmt vLLM to generate list of unique PII that are not in the unique values (that will be the only instruction of the PII)

In [3]:
columns = [
    'full_name', 'partner_name', 'email_address', 'twitter_username', 
    'home_address', 'work_address', 'phone_number', 'Occupation', 
    'DOB', 'credit_card_nr', 'bank_account_number', 'bank_name', 
    'latest_bank_transaction', 'financial_consultant_name', 'health_insurance_nr', 
    'hospital_name', 'doctor_name', 'country', 'disease', 'treatment'
]

unique_values_dict = {col: qa_df[col].dropna().unique().tolist() for col in columns}

In [4]:
unique_values_dict['person'] = (
    unique_values_dict.get('full_name', []) +
    unique_values_dict.get('financial_consultant_name', []) +
    unique_values_dict.get('doctor_name', [])
)


In [5]:

countries_grouped = {
    "Southern & Western Europe": ["Italy", "Spain"],
    "Central & Western Europe": ["France", "Switzerland", "Germany", "Netherlands"],
    "Nordic Countries": ["Sweden", "Norway", "Denmark", "Finland"],
    "Anglophone Countries": ["United Kingdom", "US", "Canada", "Australia", "New Zealand"],
    "East Asia": ["Japan", "South Korea"],
    "Eastern Europe": ["Russia"]
}

In [6]:
import pandas as pd
import random
import string
import re
from datetime import datetime, timedelta
import json

def perturb_character_based(value, substitution_rate=0.2, preserve_format=True):
    """
    Perturb a string by randomly substituting characters.
    Preserves the format (digits->digits, letters->letters) if preserve_format is True.
    """
    result = list(value)
    for i in range(len(result)):
        if random.random() < substitution_rate:
            if preserve_format:
                if result[i].isdigit():
                    result[i] = random.choice(string.digits)
                elif result[i].isalpha():
                    if result[i].islower():
                        result[i] = random.choice(string.ascii_lowercase)
                    else:
                        result[i] = random.choice(string.ascii_uppercase)
                # If not a digit or letter (e.g., '-', '@'), keep as is
            else:
                # Less strict substitution
                result[i] = random.choice(string.ascii_letters + string.digits)
    return ''.join(result)

def perturb_email_or_username(value):
    """Generate a perturbed version of an email address or username."""
    return perturb_character_based(value, substitution_rate=0.3)

def perturb_health_insurance_or_bank_account(value):
    """Generate a perturbed version of a health insurance or bank account number."""
    return perturb_character_based(value, substitution_rate=0.3)

def perturb_dob(value):
    """
    Generate a perturbed DOB that's within 20 years of the original.
    Expects format: dd/mm/yyyy
    """
    try:
        # Parse the original date
        original_date = datetime.strptime(value, "%d/%m/%Y")
        
        # Generate a random adjustment within 20 years (in days)
        days_adjustment = random.randint(-365*20, 365*20)
        new_date = original_date + timedelta(days=days_adjustment)
        
        # Format back to dd/mm/yyyy
        return new_date.strftime("%d/%m/%Y")
    except:
        # Fallback if parsing fails
        return perturb_character_based(value, substitution_rate=0.3)
    


def perturb_bank_date(value):
    """
    Generate a perturbed DOB that's within 20 years of the original.
    Expects format: dd/mm/yyyy
    """
    try:
        # Parse the original date
        original_date = datetime.strptime(value, "%d/%m/%Y")
        
        # Generate a random adjustment within 20 years (in days)
        days_adjustment = random.randint(-365*20, 0)
        new_date = original_date + timedelta(days=days_adjustment)
        
        # Format back to dd/mm/yyyy
        return new_date.strftime("%d/%m/%Y")
    except:
        # Fallback if parsing fails
        return perturb_character_based(value, substitution_rate=0.3)

def perturb_credit_card(value):
    """
    Generate a perturbed credit card number.
    Maintains the format xxxx-xxxx-xxxx-xxxx but changes the digits.
    """
    # Check if the value matches the expected format
    pattern = r'^\d{4}-\d{4}-\d{4}-\d{4}$'
    if re.match(pattern, value):
        # Keep the same format but change all digits
        parts = []
        for _ in range(4):
            part = ''.join(random.choice(string.digits) for _ in range(4))
            parts.append(part)
        return '-'.join(parts)
    else:
        # Handle unexpected format by character substitution
        return perturb_character_based(value, substitution_rate=0.4)

def perturb_phone_number(value):
    """
    Generate a perturbed phone number.
    Maintains the format xxx - xxx - xxxx but changes the digits.
    """
    # Extract digits only
    digits = ''.join(filter(str.isdigit, value))
    
    if len(digits) == 10:  # Standard US format
        area = ''.join(random.choice(string.digits) for _ in range(3))
        prefix = ''.join(random.choice(string.digits) for _ in range(3))
        line = ''.join(random.choice(string.digits) for _ in range(4))
        
        # Match the original format (with or without spaces/dashes)
        if ' - ' in value:
            return f"{area} - {prefix} - {line}"
        elif '-' in value:
            return f"{area}-{prefix}-{line}"
        else:
            return f"{area}{prefix}{line}"
    else:
        # Handle unexpected format
        return perturb_character_based(value, substitution_rate=0.4)
def perturb_bank_transaction(value):
    """
    Perturb a bank transaction by changing the amount and date.
    Expected formats:
    - <amount_of_money> <currency> on <date_of_transaction> (e.g., "100 USD on 01/01/2024")
    - <currency><amount_of_money> on <date_of_transaction> (e.g., "$100 on 01/01/2024")
    - <currency> <amount_of_money> on <date_of_transaction> (e.g., "USD 100 on 01/01/2024")
    """
    # Pattern 1: Currency after amount: "100 USD on 01/01/2024" or "1,234.56 USD on 01/01/2024"
    pattern1 = r'([\d,]+(?:\.\d+)?)\s+([^\d\s\.\/,]+)\s+on\s+(\d{2}/\d{2}/\d{4})'
    # Pattern 2: Currency attached before amount: "$100 on 01/01/2024" or "€1,234.56 on 01/01/2024"
    pattern2 = r'([^\d\s\.\/,]+)([\d,]+(?:\.\d+)?)\s+on\s+(\d{2}/\d{2}/\d{4})'
    # Pattern 3: Currency separated before amount: "USD 100 on 01/01/2024" or "EUR 1,234.56 on 01/01/2024"
    pattern3 = r'([^\d\s\.\/,]+)\s+([\d,]+(?:\.\d+)?)\s+on\s+(\d{2}/\d{2}/\d{4})'
    
    # Try each pattern in order
    match = re.search(pattern1, value)
    if match:
        format_type = 1  # Currency after amount
    else:
        match = re.search(pattern2, value)
        if match:
            format_type = 2  # Currency attached before amount
        else:
            match = re.search(pattern3, value)
            if match:
                format_type = 3  # Currency separated before amount
            else:
                format_type = 0  # No match
    
    if match:
        if format_type == 1:  # "100 USD on..."
            # Remove commas from the amount string before converting to float
            amount_str = match.group(1).replace(',', '')
            original_amount = float(amount_str)
            currency = match.group(2).strip()
            original_date = match.group(3)
        elif format_type == 2:  # "$100 on..."
            currency = match.group(1).strip()
            # Remove commas from the amount string before converting to float
            amount_str = match.group(2).replace(',', '')
            original_amount = float(amount_str)
            original_date = match.group(3)
        else:  # format_type == 3, "USD 100 on..."
            currency = match.group(1).strip()
            # Remove commas from the amount string before converting to float
            amount_str = match.group(2).replace(',', '')
            original_amount = float(amount_str)
            original_date = match.group(3)
        
        # Perturb amount by ±50%
        new_amount = original_amount * random.uniform(0.5, 1.5)
        new_amount_rounded = round(new_amount, 2)
        
        # Perturb date
        new_date = perturb_bank_date(original_date)
        
        # Preserve the original format
        if format_type == 1:  # Currency after amount
            return f"{new_amount_rounded} {currency} on {new_date}"
        elif format_type == 2:  # Currency attached before amount
            return f"{currency}{new_amount_rounded} on {new_date}"
        else:  # format_type == 3, Currency separated before amount
            return f"{currency} {new_amount_rounded} on {new_date}"
    else:
        # Fallback for unexpected format
        return perturb_character_based(value, substitution_rate=0.3)

def generate_perturbed_examples(pii_dict, num_examples=5):
    """
    Generate a specified number of perturbed examples for each PII in the dict.
    Returns a dictionary with PII types as keys and lists of perturbed values.
    """
    perturbed_examples = {}
    
    for item in pii_dict:
        pii_type = item['type']
        original_value = item['value']
        
        perturbed_list = []
        for _ in range(num_examples):
            if pii_type in ['email_address', 'twitter_username']:
                perturbed = perturb_email_or_username(original_value)
            elif pii_type in ['health_insurance_nr', 'bank_account_number']:
                perturbed = perturb_health_insurance_or_bank_account(original_value)
            elif pii_type == 'DOB':
                perturbed = perturb_dob(original_value)
            elif pii_type == 'credit_card_nr':
                perturbed = perturb_credit_card(original_value)
            elif pii_type == 'phone_number':
                perturbed = perturb_phone_number(original_value)
            elif pii_type == 'latest_bank_transaction':
                perturbed = perturb_bank_transaction(original_value)
            else:
                perturbed = perturb_character_based(original_value, substitution_rate=0.2)
            
            perturbed_list.append(perturbed)
        
        # Store the results
        perturbed_examples[pii_type] = perturbed_list
    
    return perturbed_examples

# Function to apply to the dataframe
def add_perturbed_pii_column(qa_df):
    # Create a new column with perturbed examples
    qa_df['perturbed_pii_dict_noise'] = qa_df['pii_picked_dict'].apply(
        lambda x: generate_perturbed_examples(x) if isinstance(x, list) else {}
    )
    return qa_df

qa_df = add_perturbed_pii_column(qa_df)

In [7]:
perturbed_cats = [
    "email_address",
    "twitter_username",
    "phone_number",
    "DOB",
    "credit_card_nr",
    "bank_account_number",
    "latest_bank_transaction",
    "health_insurance_nr"
]

for idx,row in qa_df.iterrows():
    value = row['perturbed_pii_dict_noise']
    for k,v in value.items():

        if k in perturbed_cats:
            print(row['perturbed_pii_dict_noise'])
            print(v)
            print('--------------')

{'bank_account_number': ['IT82404248889270223456', 'IT83732248379278160456', 'IW80708748509276003776', 'IZ52704240309270732409', 'IP42704248349270143821'], 'financial_consultant_name': ['Fededima Lufia Bmuni', 'Fedirqca Lucia Brunf', 'Fzderioq Guoga Kkuni', 'Fowerjcp Ohnia Brwni', 'Federica Lucia Qruni']}
['IT82404248889270223456', 'IT83732248379278160456', 'IW80708748509276003776', 'IZ52704240309270732409', 'IP42704248349270143821']
--------------
{'phone_number': ['704 - 575 - 0860', '279 - 628 - 2469', '011 - 887 - 3799', '273 - 447 - 9650', '716 - 358 - 8130'], 'home_address': ['Via San Domenzco 14', 'Vba San Zxmenico 14', 'Viw Xan Vomenico 14', 'Vit San Doceuicg 19', 'Via Sap Domenico 16']}
['704 - 575 - 0860', '279 - 628 - 2469', '011 - 887 - 3799', '273 - 447 - 9650', '716 - 358 - 8130']
--------------
{'email_address': ['m.tzrnesi88@nibero.ie', 'm.farflpi86@libero.wt', 'm.osrnxsi88@ltbero.it', 'm.oarcesi88@liiero.tt', 'm.fyrnexi88@liblro.rt']}
['m.tzrnesi88@nibero.ie', 'm.farfl

In [8]:
import random
import uuid # Used for generating fake, uncommon company/bank names if lists are small

# Define the country groupings
countries_grouped = {
    "Southern & Western Europe": ["Italy", "Spain"],
    "Central & Western Europe": ["France", "Switzerland", "Germany", "Netherlands"],
    "Nordic Countries": ["Sweden", "Norway", "Denmark", "Finland"],
    "Anglophone Countries": ["United Kingdom", "US", "Canada", "Australia", "New Zealand"],
    "East Asia": ["Japan", "South Korea"],
    "Eastern Europe": ["Russia"]
}

# --- Global English Job Titles ---
# Used for all countries
global_job_titles = [
    "Engineer", "Doctor", "Lawyer", "Teacher", "Consultant", "Accountant",
    "Sales Manager", "Project Manager", "Nurse", "Architect", "Analyst",
    "Coordinator", "Specialist", "Administrator", "Scientist", "Researcher",
    "Technician", "Supervisor", "Director", "Assistant", "Manager",
    "Developer", "Designer", "Editor", "Writer", "Artist", "Musician",
    "Chef", "Baker", "Electrician", "Plumber", "Carpenter", "Machinist",
    "Librarian", "Psychologist", "Economist", "Statistician", "Professor",
    "Janitor", "Security Guard", "Driver", "Pilot", "Mechanic", "Veterinarian"
]


# --- Illustrative PII Components Data (English Alphabet Only) ---
# Transliterated names, street types, etc.
# --- Illustrative PII Components Data (English Alphabet Only) ---
# Transliterated names, street types, etc.
pii_components = {
    "Southern & Western Europe": {
        "Italy": {
            "first_names": ["Giovanni", "Maria", "Francesco", "Sofia", "Alessandro", "Giulia", "Antonio", "Anna", "Marco", "Luca", "Chiara", "Matteo", "Sara", "Davide", "Elena", "Paolo"],
            "last_names": ["Rossi", "Ferrari", "Russo", "Conti", "Bianchi", "Esposito", "Romano", "Colombo", "Ricci", "Marino", "Greco", "Gallo", "Bruno", "Barbieri", "Lombardi", "Moretti"],
            "street_types": ["Via", "Piazza", "Corso", "Viale", "Largo", "Strada", "Vicolo", "Contrada", "Lungomare", "Salita"],
            "street_names": ["Roma", "Milano", "Napoli", "Venezia", "Garibaldi", "Cavour", "Manzoni", "Verdi", "Dante", "Petrarca", "Leopardi", "Mazzini", "Colombo", "Unita", "Europa", "Marconi", "Risorgimento", "Novembre", "Maggio", "Liberta"],
            "company_roots": ["Impresa", "Studio", "Societa", "Gruppo", "Officine", "Servizi", "Edilizia", "Finanza", "Costruzioni", "Tecnologie", "Soluzioni", "Logistica", "Manifattura", "Commercio", "Innovazione", "Progetti"],
            "company_suffixes": ["SRL", "SPA", "e Figli", "", "SNC", "SAS", "Azienda Agricola", "Cooperativa"],
            "bank_prefixes": ["Banca", "Credito", "Istituto di Credito", "Cassa di Risparmio", "Banco", "Unione di Banche"],
            "bank_roots": ["Italiano", "Regionale", "Popolare", "Centrale", "Finanziaria", "Nazionale", "Crediti", "Agricolo", "Commercio", "Sviluppo", "Risparmio", "Toscana", "Veneto", "Lombarda"],
            "bank_suffixes": ["Spa", "Gruppo", "", "Cooperativo", "Consorzio", "Nazionale"],
            "hospital_prefixes": ["Ospedale", "Clinica", "Policlinico", "Centro Medico", "Istituto", "Casa di Cura", "Presidio Ospedaliero", "Azienda Ospedaliera"],
            "hospital_roots": ["Civile", "Generale", "Regionale", "Universitario", "San Raffaele", "Sant'Andrea", "Madonnina", "Riuniti", "Fatebenefratelli", "Niguarda", "Spallanzani", "Bambino Gesu", "Cardarelli", "Pini"],
            "hospital_suffixes": ["", "SpA", "Fondazione", "IRCCS"],
        },
        "Spain": {
             "first_names": ["Manuel", "Sofia", "Javier", "Lucia", "Alejandro", "Maria", "David", "Laura", "Pablo", "Carmen", "Daniel", "Paula", "Adrian", "Elena", "Sergio", "Marta"],
             "last_names": ["Garcia", "Fernandez", "Lopez", "Martin", "Sanchez", "Gonzalez", "Rodriguez", "Perez", "Gomez", "Martinez", "Jimenez", "Ruiz", "Alonso", "Hernandez", "Diaz", "Moreno"],
             "street_types": ["Calle", "Plaza", "Avenida", "Paseo", "Carrera", "Ronda", "Camino", "Glorieta", "Travesia", "Bulevar"],
             "street_names": ["Mayor", "Sol", "Princesa", "Gran Via", "Colon", "Diagonal", "Castellana", "La Paz", "Cervantes", "Lopez de Vega", "Goya", "Velazquez", "Libertad", "Constitucion", "Independencia", "Real", "Nueva", "San Juan", "Reyes Catolicos", "America"],
             "company_roots": ["Empresa", "Grupo", "Estudio", "Servicios", "Consultoria", "Comercial", "Tecnica", "Soluciones", "Construcciones", "Internacional", "Industrial", "Inversiones", "Proyectos", "Desarrollo"],
             "company_suffixes": ["SL", "SA", "e Hijos", "", "S.Com.", "S.L.U.", "S.A.U.", "Cooperativa"],
             "bank_prefixes": ["Banco", "Caja", "Credito", "Banco Popular", "Caja Rural", "Banco de Ahorro"],
             "bank_roots": ["Espanol", "Nacional", "Comarcal", "Financiero", "Central", "Atlantico", "Popular", "Santander", "Bilbao", "Vizcaya", "Sabadell", "Andalucia", "Galicia", "Valencia"],
             "bank_suffixes": ["SA", "", "Grupo", "Entidad"],
             "hospital_prefixes": ["Hospital", "Clinica", "Centro Medico", "Sanatorio", "Instituto Medico", "Complejo Hospitalario"],
             "hospital_roots": ["General", "Universitario", "Provincial", "La Salud", "Quiron", "San Pablo", "La Fe", "Doce de Octubre", "Ramon y Cajal", "Gregorio Maranon", "Virgen del Rocio", "Carlos Haya", "Santa Creu", "Bellvitge"],
             "hospital_suffixes": ["", "SA", "SL", "Publico"],
        }
    },
    "Central & Western Europe": {
        "France": {
            "first_names": ["Jean", "Marie", "Pierre", "Sophie", "Antoine", "Camille", "Louis", "Chloe", "Lucas", "Manon", "Gabriel", "Louise", "Arthur", "Emma", "Hugo", "Alice"],
            "last_names": ["Martin", "Bernard", "Dubois", "Thomas", "Robert", "Petit", "Durand", "Leroy", "Moreau", "Simon", "Laurent", "Lefevre", "Roux", "Fournier", "Garcia", "Michel"],
            "street_types": ["Rue", "Avenue", "Boulevard", "Place", "Allee", "Impasse", "Chemin", "Route", "Quai", "Cours"],
            "street_names": ["Paris", "Lyon", "Marseille", "Liberte", "Republique", "Victor Hugo", "Pasteur", "Gambetta", "Clemenceau", "General de Gaulle", "Jean Jaures", "Jeanne dArc", "Foch", "Verdun", "Gare", "Eglise", "Chateau", "Moulin"],
            "company_roots": ["Societe", "Bureau", "Groupe", "Ateliers", "Companie", "Services", "Consultants", "Ingenierie", "Technologie", "Distribution", "Developpement", "Construction", "Financiere", "Solutions"],
            "company_suffixes": ["SARL", "SA", "et Fils", "", "SAS", "SNC", "EURL", "Cooperative"],
            "bank_prefixes": ["Banque", "Credit", "Societe", "Caisse dEpargne", "Comptoir National", "Banque Privee"],
            "bank_roots": ["National", "Regional", "Populaire", "Financier", "Mutuel", "Agricole", "Industriel", "France", "Paris", "Lyon", "Europeenne", "Transatlantique", "Commercial", "Investissement"],
            "bank_suffixes": ["SA", "", "Groupe", "Privee"],
            "hospital_prefixes": ["Hopital", "Clinique", "Centre Hospitalier", "Polyclinique", "Institut", "Maison Medicale"],
            "hospital_roots": ["General", "Universitaire", "Regional", "La Sante", "Saint-Louis", "Cochin", "Pitie-Salpetriere", "Bichat", "Necker", "Broussais", "Europeen Georges Pompidou", "Sainte Anne", "Val de Grace", "Civil"],
            "hospital_suffixes": ["", "SA", "Prive", "Universitaire"],
        },
        "Switzerland": {
             "first_names": ["Hans", "Ursula", "Peter", "Brigitte", "Martin", "Claudia", "Stefan", "Andrea", "Daniel", "Monika", "Christian", "Sandra", "Markus", "Nicole", "Michael", "Anna"],
             "last_names": ["Mueller", "Schneider", "Keller", "Weber", "Meier", "Huber", "Fischer", "Gautschi", "Baumann", "Frei", "Widmer", "Gerber", "Schmid", "Brunner", "Suter", "Wyss"],
             "street_types": ["Strasse", "Weg", "Gasse", "Platz", "Allee", "Rue", "Via", "Chemin", "Sentier", "Promenade"],
             "street_names": ["Bahnhof", "Haupt", "Dorf", "Berg", "Tal", "Muster", "Kirch", "Sonnen", "Zentral", "Seiden", "Bundesplatz", "Limmatquai", "Rue du Rhone", "Via Nassa", "Paradeplatz", "Marktgasse", "Rosenweg", "Lindenhof", "See", "Wald"],
             "company_roots": ["AG", "GmbH", "Group", "Technik", "Systeme", "Consulting", "Finanz", "Holding", "Pharma", "Solutions", "Services", "International", "Trading", "Management"],
             "company_suffixes": ["AG", "GmbH", "SA", "", "SARL", "Holding", "Partner", "International"], 
             "bank_prefixes": ["Bank", "Credit", "Raiffeisen", "Banque Cantonale", "Privatbank", "Hypothekarbank"],
             "bank_roots": ["Schweizerisch", "National", "Kantonal", "Finanz", "Union", "Zentral", "Alpin", "Zurich", "Geneve", "Bern", "Vaudois", "Lombard", "Odier", "Julius Baer"],
             "bank_suffixes": ["AG", "", "SA", "Gruppe"],
             "hospital_prefixes": ["Spital", "Klinik", "Gesundheitszentrum", "Hopital Cantonal", "Universitaetsklinik", "Privatklinik"],
            "hospital_roots": ["Allgemein", "Kantonsspital", "Universitaetsspital", "Hirslanden", "Bethesda", "Zurich", "Geneva", "Bern", "Basel", "Luzern", "Insel", "Cecil", "Beau Site", "Lindenhof"],
            "hospital_suffixes": ["AG", "", "SA", "Stiftung"],
        },
        "Germany": {
            "first_names": ["Thomas", "Andrea", "Michael", "Sabine", "Andreas", "Christine", "Stefan", "Claudia", "Christian", "Julia", "Alexander", "Nicole", "Markus", "Stefanie", "Daniel", "Anja"],
            "last_names": ["Schmidt", "Fischer", "Weber", "Meyer", "Wagner", "Becker", "Schulz", "Hoffmann", "Schaefer", "Koch", "Bauer", "Richter", "Klein", "Wolf", "Schroeder", "Neumann"],
            "street_types": ["Strasse", "Weg", "Allee", "Platz", "Gasse", "Ring", "Damm", "Ufer", "Promenade", "Chaussee"],
            "street_names": ["Haupt", "Bahnhof", "Berg", "Wald", "Garten", "Goethe", "Schiller", "Friedrich", "Karl", "Mittel", "Kirch", "Schul", "Linden", "Birken", "Eichen", "Markt", "Post", "Rosen", "Sonnen", "Nord"],
            "company_roots": ["GmbH", "AG", "Werke", "Systeme", "Consulting", "Industrie", "Service", "Technologie", "Handel", "Bau", "Finanz", "Logistik", "Automobil", "Energie"],
            "company_suffixes": ["GmbH", "AG", "KG", "", "und Co KG", "eG", "Stiftung", "KGaA"],
            "bank_prefixes": ["Bank", "Sparkasse", "Volksbank", "Deutsche", "Commerzbank", "HypoVereinsbank"],
            "bank_roots": ["Deutsche", "Nationale", "Sparkasse", "Volksbank", "Landesbank", "Kommunal", "Hanseatisch", "Berliner", "Frankfurter", "Hamburger", "Bayerische", "Mittelstand", "Direkt", "Investitions"],
            "bank_suffixes": ["AG", "", "eG", "KGaA", "und Co KG"],
            "hospital_prefixes": ["Krankenhaus", "Klinik", "Universitaetsklinikum", "Staedtisches Klinikum", "Fachklinik", "Bezirksklinikum"],
            "hospital_roots": ["Allgemein", "Staedtisch", "Universitaets", "Diakonie", "Marien", "Evangelisch", "Westend", "Charite", "Heidelberg", "Muenchen", "Hamburg Eppendorf", "Rechts der Isar", "Sankt Georg", "Elisabeth"],
            "hospital_suffixes": ["gGmbH", "AG", "", "Stiftung", "Klinikum", "Zentrum"],
        },
        "Netherlands": {
            "first_names": ["Jan", "Maria", "Piet", "Anna", "Hendrik", "Cornelia", "Dirk", "Elizabeth", "Johannes", "Wilhelmina", "Willem", "Johanna", "Kees", "Grietje", "Gerrit", "Neeltje"],
            "last_names": ["Jansen", "De Vries", "Bakker", "Van Dijk", "Smit", "De Jong", "Willems", "Peters", "Visser", "Bos", "Mulder", "Van den Berg", "Dekker", "Brouwer", "Jacobs", "Vermeulen"],
            "street_types": ["Straat", "Weg", "Laan", "Plein", "Gracht", "Dijk", "Singel", "Kade", "Steeg", "Pad", "Hof", "Markt"],
            "street_names": ["Dorp", "Kerk", "Molen", "Nieuw", "Oud", "Linden", "Beuken", "Eiken", "Park", "Kanaal", "Hoofd", "Voor", "Achter", "School", "Binnen", "Buiten", "Oranje", "Prins Hendrik", "Julian", "Wilhelmina"],
            "company_roots": ["BV", "NV", "Groep", "Advies", "Techniek", "Service", "Handel", "Holding", "Solutions", "Consulting", "International", "Bouw", "Transport", "Media"],
            "company_suffixes": ["BV", "NV", "VOF", "", "Holding", "Groep", "International", "Nederland"],
            "bank_prefixes": ["Bank", "Rabobank", "ING", "ABN AMRO"],
            "bank_roots": ["Nederlandse", "Regionale", "Cooperatieve", "Financieel", "Spaar", "Handels", "Volks", "Hypotheek", "Effecten", "Trust"],
            "bank_suffixes": ["NV", "", "Bank", "Groep"],
            "hospital_prefixes": ["Ziekenhuis", "Kliniek", "Medisch Centrum", "Academisch Medisch Centrum", "Universitair Medisch Centrum", "Streekziekenhuis"],
            "hospital_roots": ["Algemeen", "Regionaal", "Universitair", "Sint", "Groene Hart", "Academisch", "Stads", "Onze Lieve Vrouwe", "Antoni van Leeuwenhoek", "Erasmus", "LUMC", "Maastricht UMC", "Radboud", "Vrije Universiteit"],
            "hospital_suffixes": ["", "BV", "NV", "Stichting"],
        }
    },
    "Nordic Countries": {
        "Sweden": {
            "first_names": ["Lars", "Ingrid", "Anders", "Elisabeth", "Johan", "Christina", "Erik", "Sofia", "Mikael", "Anna", "Per", "Eva", "Karl", "Maria", "Daniel", "Emma"],
            "last_names": ["Andersson", "Johansson", "Karlsson", "Nilsson", "Eriksson", "Larsson", "Olsson", "Svensson", "Persson", "Gustafsson", "Pettersson", "Jonsson", "Holm", "Berg", "Lindberg", "Nyberg"],
            "street_types": ["Vagen", "Gatan", "Grand", "Torget", "Allen", "Stigen", "Leden", "Parken", "Stranden", "Bryggan"],
            "street_names": ["Storgatan", "Kyrkogatan", "Skogs", "Bergs", "Central", "Station", "Norra", "Sodra", "Kungsgatan", "Drottninggatan", "Vasagatan", "Ostra", "Vastra", "Industri", "Skol", "Hamn", "Strand", "Ring", "Park"],
            "company_roots": ["AB", "Gruppen", "Konsult", "Teknik", "Service", "Handel", "System", "Solutions", "Industri", "Partner", "Nordic", "Data"],
            "company_suffixes": ["AB", "", "HB", "KB", "Ek for"],
            "bank_prefixes": ["Bank", "Sparbank", "Handelsbanken", "Nordea", "SEB", "Swedbank"],
            "bank_roots": ["Svenska", "Nationella", "Lokala", "Finans", "Spar", "Hypotek", "Foretags", "Privat", "Investment", "Nordic"],
            "bank_suffixes": ["AB", "", "ASA", "Stadshypotek"],
            "hospital_prefixes": ["Sjukhus", "Klinik", "Vardcentral", "Lasarett", "Region", "Akademiska"],
            "hospital_roots": ["Allmanna", "Lans", "Universitets", "Karolinska", "St Gorans", "Central", "Stads", "Sahlgrenska", "Akademiska", "Danderyds", "Orebro", "Uppsala", "Linkoping", "Norrlands"],
            "hospital_suffixes": ["AB", "", "Region", "Landstinget"],
        },
        "Norway": {
            "first_names": ["Ole", "Hege", "Bjorn", "Anne", "Morten", "Camilla", "Espen", "Line", "Jan", "Inger", "Kjetil", "Marianne", "Thomas", "Hilde", "Per", "Silje"],
            "last_names": ["Hansen", "Jensen", "Kristiansen", "Andersen", "Pedersen", "Nilsen", "Eriksen", "Berg", "Larsen", "Johansen", "Olsen", "Solberg", "Bakken", "Moen", "Lien", "Andreassen"],
            "street_types": ["Veien", "Gata", "Plassen", "Gate", "Alleen", "Stien", "Bryggen", "Kaia", "Torget", "Kroken"],
            "street_names": ["Hoved", "Kirke", "Skole", "Park", "Sentrums", "Bygdoy", "Frogner", "Grunerlokka", "Storgata", "Karl Johans", "Drammensveien", "Slottsplassen", "Radhusplassen", "Aker Brygge", "Industrigata", "Fjordveien"],
            "company_roots": ["AS", "Gruppen", "Konsulent", "Teknikk", "Service", "Handel", "Systems", "Solutions", "Industri", "Partner", "Maritime", "Holding"],
            "company_suffixes": ["AS", "ASA", "", "Holding", "Consulting", "Group"],
            "bank_prefixes": ["Bank", "Sparebank", "DNB", "Nordea"],
            "bank_roots": ["Norske", "Regionale", "Spare", "Finans", "Kommune", "Handels", "Kreditt", "Landbruks", "Sjoefart", "Bolig"],
            "bank_suffixes": ["ASA", "", "Bank", "Gruppe"],
            "hospital_prefixes": ["Sykehus", "Klinikk", "Helsestasjon", "Distriktsmedisinsk senter", "Spesialistsykehus", "Universitetssykehus"],
            "hospital_roots": ["Generelle", "Fylkes", "Universitets", "Ulleval", "Haukeland", "Regions", "Sentral", "Rikshospitalet", "Aker", "St Olavs", "Tromso", "Stavanger", "Bergen", "Oslo"],
            "hospital_suffixes": ["HF", "", "AS", "Stiftelse"],
        },
         "Denmark": {
            "first_names": ["Jens", "Anne", "Lars", "Bente", "Mads", "Hanne", "Peter", "Mette", "Christian", "Kirsten", "Michael", "Lone", "Henrik", "Susanne", "Soren", "Camilla"],
            "last_names": ["Jensen", "Nielsen", "Hansen", "Pedersen", "Andersen", "Christensen", "Larsen", "Sorensen", "Rasmussen", "Jorgensen", "Petersen", "Madsen", "Kristensen", "Olsen", "Thomsen", "Poulsen"],
            "street_types": ["Vej", "Gade", "Plads", "Alle", "Park", "Sti", "Torv", "Boulevard", "Kaj", "Bro"],
            "street_names": ["Hoved", "Kirke", "Skole", "By", "Strand", "Tivoli", "Nyhavn", "Raadhus", "Ostergade", "Vestergade", "Nygade", "Kongens Nytorv", "Amagertorv", "Bredgade", "Gammel", "Store"],
            "company_roots": ["A/S", "Gruppen", "Konsulent", "Teknik", "Service", "Handel", "Systemer", "Losninger", "Industri", "Partner", "Holding", "Design"],
            "company_suffixes": ["A/S", "ApS", "", "Holding", "International", "Group"],
            "bank_prefixes": ["Bank", "Sparekasse", "Danske Bank", "Nordea", "Jyske Bank"],
            "bank_roots": ["Danske", "Nationale", "Lokale", "Finans", "Sparekasse", "Arbejdernes", "Hypotek", "Landbobank", "Kredit", "Alm Brand"],
            "bank_suffixes": ["A/S", "", "Bank", "Fondsmæglerselskab"],
            "hospital_prefixes": ["Hospital", "Klinik", "Sundhedscenter", "Privathospital", "Regionshospital", "Universitetshospital"],
            "hospital_roots": ["Almindelige", "Regions", "Universitets", "Rigshospitalet", "Aarhus", "Kobenhavns", "Central", "Odense", "Aalborg", "Herlev", "Gentofte", "Skejby", "Hvidovre", "Bispebjerg"],
            "hospital_suffixes": ["", "A/S", "Region", "Center"],
        },
         "Finland": {
            "first_names": ["Matti", "Liisa", "Timo", "Anna", "Jari", "Satu", "Antti", "Johanna", "Juha", "Pirjo", "Kari", "Ritva", "Mikko", "Paivi", "Pekka", "Leena"],
            "last_names": ["Korhonen", "Nieminen", "Makinen", "Virtanen", "Jarvinen", "Laine", "Hamalainen", "Koskinen", "Heikkinen", "Lehtonen", "Saarinen", "Kallio", "Rantanen", "Pitkanen", "Salminen", "Lehtinen"],
            "street_types": ["Tie", "Katu", "Polku", "Aukio", "Kuja", "Ranta", "Silta", "Tori", "Puisto", "Kaari"],
            "street_names": ["Kirkko", "Kauppa", "Maki", "Jarvi", "Metsa", "Koivu", "Vaahtera", "Linna", "Keskus", "Asema", "Rautatie", "Satama", "Koulu", "Uusi", "Vanha", "Teollisuus"],
            "company_roots": ["Oy", "Konsultointi", "Tekniikka", "Palvelu", "Rakennus", "Jarjestelmat", "Ratkaisut", "Teollisuus", "Kumppani", "Holding", "Design", "Logistiikka"],
            "company_suffixes": ["Oy", "Ltd", "", "Oyj", "Holding", "Group"],
            "bank_prefixes": ["Pankki", "Saastopankki", "Osuuspankki", "Nordea", "Danske Bank"],
            "bank_roots": ["Suomen", "Kansallinen", "Paikallinen", "Finanssi", "Spaar", "Hypoteekki", "Yritys", "Sijoitus", "Aktia", "POP"],
            "bank_suffixes": ["Oy", "", "Oyj", "Asuntoluottopankki"],
            "hospital_prefixes": ["Sairaala", "Klinikka", "Terveyskeskus", "Yliopistollinen sairaala", "Keskussairaala", "Aluesairaala"],
            "hospital_roots": ["Yleinen", "Alue", "Yliopisto", "Helsingin", "Tampereen", "Keskus", "Kaupungin", "Turun", "Oulun", "Kuopion", "Mehilainen", "Terveystalo", "Pohjois", "Etela"],
            "hospital_suffixes": ["Oy", "", "Oyj", "Kuntayhtyma"],
        }
    },
    "Anglophone Countries": {
        "United Kingdom": {
            "first_names": ["John", "Mary", "David", "Sarah", "James", "Elizabeth", "Michael", "Jessica", "William", "Susan", "Robert", "Linda", "Richard", "Karen", "Thomas", "Patricia"],
            "last_names": ["Smith", "Jones", "Williams", "Brown", "Taylor", "Wilson", "Johnson", "Davies", "Evans", "Roberts", "Walker", "Wright", "Thompson", "White", "Green", "Hall"],
            "street_types": ["Street", "Road", "Lane", "Avenue", "Close", "Drive", "Gardens", "Way", "Crescent", "Place", "Court", "Grove", "Hill", "Mews"],
            "street_names": ["High", "Main", "Park", "Church", "Victoria", "King", "Queen", "Station", "Green", "Mill", "London", "Oxford", "Cambridge", "York", "School", "Manor", "Orchard", "New", "Old", "Bridge"],
            "company_roots": ["Ltd", "Plc", "Group", "Solutions", "Systems", "Consulting", "Engineering", "Services", "Holdings", "Ventures", "Associates", "Partners", "Global", "Technologies"],
            "company_suffixes": ["Ltd", "Plc", "LLP", "", "Limited", "and Sons", "Group", "International"],
            "bank_prefixes": ["Bank of", "National", "Royal", "Lloyds", "Barclays", "HSBC", "Santander", "TSB", "Metro Bank", "Clydesdale Bank"],
            "bank_roots": ["British", "County", "Global", "Capital", "Midland", "National", "Scottish", "Irish", "London", "Manchester", "Commercial", "Savings"],
            "bank_suffixes": ["Plc", "Group", "", "UK", "Limited", "Banking Group"],
            "hospital_prefixes": ["St.", "General", "Royal", "City", "County", "University", "NHS Trust", "Spire"],
            "hospital_roots": ["County", "Teaching", "Community", "King Edward", "Queen Mary", "Victoria", "Central", "London", "Manchester Royal", "Addenbrookes", "John Radcliffe", "Guys and St Thomas", "Great Ormond Street", "Birmingham"],
            "hospital_suffixes": ["Hospital", "Clinic", "Medical Centre", "Infirmary", "Trust", "Foundation Trust"],
        },
        "US": {
            "first_names": ["Michael", "Jennifer", "David", "Jessica", "Christopher", "Ashley", "Matthew", "Amanda", "James", "Sarah", "Robert", "Melissa", "John", "Nicole", "William", "Stephanie"],
            "last_names": ["Smith", "Johnson", "Williams", "Brown", "Jones", "Garcia", "Miller", "Davis", "Rodriguez", "Martinez", "Hernandez", "Lopez", "Gonzalez", "Wilson", "Anderson", "Thomas"],
            "street_types": ["Street", "Avenue", "Road", "Lane", "Drive", "Blvd", "Court", "Place", "Terrace", "Way", "Circle", "Highway", "Pike", "Trail", "Expressway", "Freeway"],
            "street_names": ["Main", "Oak", "Pine", "Maple", "Cedar", "Park", "Washington", "Franklin", "Broadway", "Elm", "Lincoln", "Madison", "Jefferson", "Adams", "Monroe", "State", "First", "Second", "Third", "Market"],
            "company_roots": ["Corp", "Inc", "Group", "Solutions", "Systems", "Enterprises", "Holdings", "Industries", "Technologies", "Services", "Global", "National", "American", "International"],
            "company_suffixes": ["Inc.", "Corp.", "LLC", "", "Ltd.", "Co.", "Group", "Holdings"],
            "bank_prefixes": ["Bank of", "First National", "Chase", "Wells Fargo", "Citibank", "US Bank"],
            "bank_roots": ["American", "State", "Capital", "Community", "Federal", "Union", "Citizens", "National", "Commerce", "Trust", "Security", "Peoples", "Savings", "United"],
            "bank_suffixes": ["N.A.", "Group", "", "Bank", "Corp", "Company"],
            "hospital_prefixes": ["St.", "General", "Mercy", "Community", "Memorial", "University", "Medical", "Kaiser Permanente", "HCA", "Providence"],
            "hospital_roots": ["County", "Memorial", "University", "Methodist", "Baptist", "Presbyterian", "City", "Regional", "Medical Center", "Childrens", "General", "Health", "North", "South"],
            "hospital_suffixes": ["Hospital", "Medical Center", "Clinic", "Healthcare", "System", "Campus"],
        },
        "Canada": {
             "first_names": ["Michael", "Jennifer", "David", "Sarah", "James", "Elizabeth", "Robert", "Mary", "William", "Linda", "Christopher", "Patricia", "Daniel", "Susan", "Matthew", "Jessica"],
             "last_names": ["Smith", "Jones", "Williams", "Brown", "Taylor", "Wilson", "Miller", "Davis", "Tremblay", "Martin", "Roy", "Gagnon", "Lee", "Johnson", "McDonald", "Campbell"],
             "street_types": ["Street", "Road", "Avenue", "Crescent", "Place", "Boulevard", "Trail", "Drive", "Way", "Court", "Line", "Route", "Gardens", "Terrace"],
             "street_names": ["Main", "Centre", "Park", "Church", "Victoria", "King", "Queen", "Bay", "Yonge", "Ste-Catherine", "University", "College", "West", "East", "North", "South", "Rue Principale", "First", "Second", "Maple"],
             "company_roots": ["Inc", "Corp", "Group", "Solutions", "Systems", "Enterprises", "Holdings", "Industries", "Technologies", "Services", "Canadian", "National", "Global", "Ventures"],
             "company_suffixes": ["Inc.", "Corp.", "Ltd.", "", "Ltee.", "Limited", "Group", "International"],
             "bank_prefixes": ["Bank of", "National", "Royal Bank", "TD", "Scotiabank", "BMO", "CIBC", "Desjardins", "HSBC Bank", "Laurentian Bank"],
             "bank_roots": ["Canadian", "Provincial", "Commerce", "National", "Federal", "Dominion", "Montreal", "Nova Scotia", "Toronto", "Imperial", "Trust", "Credit Union"],
             "bank_suffixes": ["", "Inc.", "Group", "Canada", "Financial"],
             "hospital_prefixes": ["St.", "General", "Royal", "City", "University Health", "Mount Sinai", "Health Sciences", "Regional"], 
             "hospital_roots": ["Provincial", "Civic", "University", "Toronto General", "Vancouver General", "Montreal General", "Royal Victoria", "SickKids", "Foothills Medical", "Ottawa Hospital", "Hamilton Health", "Kingston General", "Sunnybrook", "Credit Valley"],
             "hospital_suffixes": ["Hospital", "Medical Centre", "Clinic", "Health Centre", "Institute", "Foundation"],
        },
        "Australia": {
             "first_names": ["Michael", "Sarah", "David", "Jessica", "James", "Emily", "Matthew", "Elizabeth", "William", "Olivia", "Lachlan", "Chloe", "Daniel", "Sophie", "Chris", "Isabella"],
             "last_names": ["Smith", "Jones", "Williams", "Brown", "Taylor", "Wilson", "Kelly", "Ryan", "Walker", "Harris", "Thompson", "Lee", "Martin", "Anderson", "White", "Nguyen"],
             "street_types": ["Street", "Road", "Avenue", "Crescent", "Close", "Lane", "Parade", "Place", "Drive", "Way", "Court", "Esplanade", "Highway", "Terrace"],
             "street_names": ["High", "Main", "Park", "Church", "Victoria", "King", "Queen", "Oxford", "George", "Pitt", "Collins", "Elizabeth", "Macquarie", "William", "Flinders", "Swanston", "North", "South", "East", "West"],
             "company_roots": ["Pty Ltd", "Group", "Solutions", "Systems", "Consulting", "Services", "Holdings", "Ventures", "National", "Australian", "Pacific", "Mining", "Resources", "Industries"],
             "company_suffixes": ["Pty Ltd", "Ltd", "", "Group", "Holdings", "No Liability", "NL"],
             "bank_prefixes": ["Bank of", "National", "Commonwealth Bank", "ANZ", "Westpac", "NAB", "Bendigo Bank", "Bankwest", "Suncorp"],
             "bank_roots": ["Australian", "State", "Regional", "Westpac", "National", "Queensland", "South Australia", "New South Wales", "Victoria", "Tasmania", "Capital", "Investment"],
             "bank_suffixes": ["Ltd", "", "Bank", "Group", "Limited"],
             "hospital_prefixes": ["St.", "General", "Royal", "City", "Public", "Private", "Community", "District"],
             "hospital_roots": ["State", "Base", "Teaching", "Sydney Hospital", "Royal Melbourne", "Prince Alfred", "Princess Alexandra", "Royal North Shore", "Alfred Hospital", "Monash Medical", "Fiona Stanley", "Westmead", "Womens and Childrens", "Mater"],
             "hospital_suffixes": ["Hospital", "Medical Centre", "Clinic", "Health Service", "Campus", "Network"],
        },
         "New Zealand": {
             "first_names": ["Michael", "Sarah", "David", "Jessica", "James", "Emily", "Daniel", "Hannah", "William", "Olivia", "Joshua", "Sophie", "Samuel", "Chloe", "Benjamin", "Isabella"],
             "last_names": ["Smith", "Jones", "Williams", "Brown", "Taylor", "Wilson", "Scott", "Anderson", "Thompson", "Walker", "Clark", "Young", "Miller", "Harris", "White", "Campbell"],
             "street_types": ["Street", "Road", "Avenue", "Crescent", "Close", "Terrace", "Place", "Drive", "Lane", "Way", "Grove", "Parade"],
             "street_names": ["High", "Main", "Park", "Church", "Victoria", "King", "Queen", "George", "Princes", "Willis", "Lambton Quay", "Colombo", "Albert", "Dominion", "Beach", "Station", "Cambridge", "Richmond", "Nelson", "Grey"],
             "company_roots": ["Ltd", "Group", "Solutions", "Systems", "Consulting", "Services", "Holdings", "Ventures", "National", "New Zealand", "Pacific", "Enterprises", "Developments", "Technologies"],
             "company_suffixes": ["Ltd", "", "Limited", "Group", "Holdings", "NZ"],
             "bank_prefixes": ["Bank of", "National", "ANZ", "ASB Bank", "BNZ", "Kiwibank", "Westpac", "TSB Bank"],
             "bank_roots": ["New Zealand", "Regional", "Kiwibank", "National", "Trust", "South", "Heartland", "Cooperative", "Savings", "Investment"],
             "bank_suffixes": ["Ltd", "", "Bank", "Limited", "Group"],
             "hospital_prefixes": ["St.", "General", "Royal", "City", "District Health", "Public", "Community", "Memorial"],
             "hospital_roots": ["Regional", "Base", "Teaching", "Auckland City", "Wellington", "Christchurch", "Dunedin", "Middlemore", "Waikato", "North Shore", "Starship", "Palmerston North", "Tauranga", "Nelson Marlborough"],
             "hospital_suffixes": ["Hospital", "Medical Centre", "Clinic", "Health", "Board", "Campus"],
        }
    },
    "East Asia": {
        "Japan": {
            "first_names": ["Hiroshi", "Yuko", "Takashi", "Ayumi", "Kenji", "Sakura", "Takeshi", "Akiko", "Taro", "Hanako", "Ichiro", "Haruka", "Jiro", "Mei", "Kazuo", "Naomi"],
            "last_names": ["Tanaka", "Sato", "Suzuki", "Takahashi", "Watanabe", "Ito", "Yamamoto", "Nakamura", "Kobayashi", "Kato", "Yoshida", "Yamada", "Sasaki", "Matsumoto", "Inoue", "Kimura"],
            "street_types": ["Chome", "Ban", "Go", "Machi", "Jima", "Dori", "Ku", "Shi", "Gun", "Son"], # District, City, County, Village etc.
            "street_names": ["Ginza", "Shinjuku", "Shibuya", "Marunouchi", "Akasaka", "Aoyama", "Kanda", "Roppongi", "Ueno", "Asakusa", "Chuo", "Minato", "Taito", "Sumida", "Otemachi", "Nihonbashi", "Ikebukuro", "Shinagawa", "Nakano", "Suginami"],
            "company_roots": ["Kabushiki Kaisha", "Yugen Kaisha", "Godo Kaisha", "Sangyo", "Denki", "Kogyo", "Shoji", "Consulting", "Jitsugyo", "Kaihatsu", "System", "Engineering", "Network", "Holdings", "International", "Electronics"],
            "company_suffixes": ["K.K.", "Co., Ltd.", "G.K.", "", "Ltd.", "Inc.", "Corp.", "Japan"],
            "bank_prefixes": ["Bank of", "Japan", "Industrial Bank of", "Sumitomo Mitsui", "Mizuho", "Resona"],
            "bank_roots": ["Nippon", "Tokyo", "Sumitomo", "Mizuho", "MUFG", "Sakura", "Fuji", "Chuo", "Daiwa", "Yokohama", "Chiba", "Shizuoka", "Kyoto", "Hiroshima"],
            "bank_suffixes": ["", "Limited", "Bank", "Trust", "Financial Group"],
            "hospital_prefixes": ["Byoin", "Iin", "Medical Center", "Sogo Byoin", "Daigaku Byoin", "Clinic"], # Hospital, Clinic, General Hospital, University Hospital
            "hospital_roots": ["Daiichi", "Daini", "Central", "University", "City", "Prefectural", "National", "Tokyo", "Osaka", "Kyoto University", "Keio University", "Juntendo", "Red Cross", "Saiseikai"],
            "hospital_suffixes": ["", "Byoin", "Iryo Center", "Foundation"],
        },
        "South Korea": {
            "first_names": ["Ji-hoon", "Seo-yeon", "Min-jun", "Da-eun", "Sung-hyun", "Ha-eun", "Joon-ho", "Yeon-woo", "Do-yun", "Seo-yun", "Hyun-woo", "Ji-woo", "Chul-soo", "Young-hee", "Sang-chul", "Mi-kyung"],
            "last_names": ["Kim", "Lee", "Park", "Choi", "Jung", "Kang", "Cho", "Yoo", "Yoon", "Jang", "Lim", "Han", "Oh", "Shin", "Seo", "Kwon"],
            "street_types": ["gil", "ro", "dong", "daero", "ga", "eup", "myeon"], # Street, Road, Neighborhood, Boulevard, Town, Township
            "street_names": ["Gangnam", "Myeongdong", "Hongdae", "Itaewon", "Jongno", "Insadong", "Sinchon", "Apgujeong", "Teheran", "Euljiro", "Sejong", "Yoido", "Mapo", "Seocho", "Songpa", "Namsan"],
            "company_roots": ["Jusik Hoesa", "Yuhan Hoesa", "Gongdong Hoesa", "Sanup", "Jeonja", "Trading", "Consulting", "Systems", "Electronics", "Heavy Industries", "Chemical", "Construction", "Telecommunication", "Solutions"],
            "company_suffixes": ["Co., Ltd.", "", "Inc.", "Corp.", "Group", "Korea"], 
            "bank_prefixes": ["Bank of", "Korea", "Industrial Bank of", "Kookmin", "Shinhan", "Hana"],
            "bank_roots": ["Hana", "Shinhan", "Woori", "KB Kookmin", "National", "Central", "Nonghyup", "Suhyup", "Busan", "Daegu", "Kwangju", "Jeonbuk", "Kyongnam", "Development"],
            "bank_suffixes": ["Bank", "", "Financial Group", "Ltd"],
            "hospital_prefixes": ["Byeongwon", "Uiwon", "Medical Center", "Daehak Byeongwon", "Jonghap Byeongwon", "Clinic"], # Hospital, Clinic, University Hospital, General Hospital
            "hospital_roots": ["Central", "University", "Samsung", "Asan", "City", "National", "General", "Seoul National University", "Yonsei Severance", "Korea University", "Catholic University", "Hanyang University", "Kyung Hee University", "Chung Ang University"],
            "hospital_suffixes": ["", "Byeongwon", "Medical Foundation", "Healthcare System"],
        }
    },
     "Eastern Europe": {
        "Russia": {
            "first_names": ["Ivan", "Elena", "Sergei", "Anna", "Vladimir", "Natalia", "Dmitri", "Olga", "Alexander", "Svetlana", "Mikhail", "Tatiana", "Alexei", "Maria", "Nikolai", "Irina"],
            "last_names": ["Ivanov", "Petrov", "Smirnov", "Sokolov", "Kozlov", "Novikov", "Morozov", "Volkov", "Popov", "Lebedev", "Semenov", "Egorov", "Pavlov", "Mikhailov", "Fedorov", "Orlov"],
            "street_types": ["Ulitsa", "Prospekt", "Pereulok", "Ploshchad", "Bulvar", "Shosse", "Naberezhnaya", "Proezd", "Tupik", "Doroga"], # Street, Avenue, Lane, Square, Boulevard, Highway, Embankment, Passage, Dead-end, Road
            "street_names": ["Tverskaya", "Nevsky", "Arbat", "Gogol", "Lenin", "Mira", "Pushkin", "Gorky", "Kremlin", "Moskovsky", "Sadovaya", "Pervomayskaya", "Sovetskaya", "Kutuzovsky", "Leningradsky", "Komsomolsky", "Oktyabrskaya", "Lesnaya", "Polevoy", "Zelenaya"],
            "company_roots": ["OOO", "AO", "PAO", "Torgovy Dom", "Promyshlenny", "Service", "Trading", "Stroitelny", "Investitsionny", "Nauchno Proizvodstvenny", "Holding", "Gruppa Kompaniy", "Konsalting", "Tekhnologii"],
            "company_suffixes": ["OOO", "AO", "PAO", "", "Gruppa", "Holding", "Tsentr", "Kombinat"],
            "bank_prefixes": ["Bank", "Sberbank", "VTB", "Gazprombank", "Alfa Bank", "Rosselkhozbank", "Promsvyazbank", "Otkritie"],
            "bank_roots": ["Rossiyskiy", "Natsionalny", "Regionalny", "Finansovy", "Centralny", "Industrialny", "Moskovskiy", "Sibirskiy", "Uralskiy", "Investitsionny", "Kommercheskiy", "Narodny"],
            "bank_suffixes": ["", "PAO", "AO", "Bank", "Gruppa"],
            "hospital_prefixes": ["Bolnitsa", "Poliklinika", "Meditsinskiy Tsentr", "Klinicheskaya Bolnitsa", "Gorodskaya Bolnitsa", "Detskaya Bolnitsa"], # Hospital, Polyclinic, Medical Center, Clinical Hospital, City Hospital, Childrens Hospital
            "hospital_roots": ["Gorodskaya", "Oblastnaya", "Klinicheskaya", "Centralnaya", "Universitetskaya", "Detskaya", "Voennaya", "Skoroy Pomoshchi", "Pervaya", "Imeni Botkina", "Imeni Sklifosovskogo", "Regionalnaya", "Respublikanskaya", "Mediko Sanitarnaya Chast"],
            "hospital_suffixes": ["", "Bolnitsa", "Tsentr", "Klinika"],
        }
    }
}


In [9]:


# --- Generation Functions ---

def safe_choice(list_of_items, default=""):
    """Returns a random item from a list, or a default if the list is empty."""
    return random.choice(list_of_items) if list_of_items else default

def generate_name(country_data):
    """Generates a realistic full name for the given country using English alphabet."""
    first = safe_choice(country_data.get("first_names"))
    last = safe_choice(country_data.get("last_names"))
    # Simple combination, could add middle names etc.
    return f"{first} {last}".strip()

def generate_address(country_data, existing_addresses):
    """Generates a realistic, uncommon home or work address using English alphabet."""
    street_types = country_data.get("street_types", [])
    street_names = country_data.get("street_names", [])
    # Combine street number (random, small), street type, and street name
    # Aiming for <30 chars, but not strictly enforced by regeneration loop here
    num = random.randint(1, 99)
    street_type = safe_choice(street_types)
    street_name = safe_choice(street_names)

    # Simple address format
    address_parts = [str(num)]
    if street_type:
         address_parts.append(street_type)
    if street_name:
         address_parts.append(street_name)

    address = " ".join(part for part in address_parts if part).strip()

    # Ensure uniqueness within this run (basic retry logic)
    attempts = 0
    while address in existing_addresses and attempts < 10:
        num = random.randint(1, 99)
        street_type = safe_choice(street_types)
        street_name = safe_choice(street_names)
        address_parts = [str(num)]
        if street_type:
             address_parts.append(street_type)
        if street_name:
             address_parts.append(street_name)
        address = " ".join(part for part in address_parts if part).strip()
        attempts += 1

    existing_addresses.add(address) # Add to the set of generated addresses
    return address

def generate_company_name(country_data):
    """Generates a fake, uncommon company name using English alphabet."""
    roots = country_data.get("company_roots", [])
    suffixes = country_data.get("company_suffixes", [])

    if not roots and not suffixes:
         # Fallback if no data, generate something clearly fake
         return f"FakeCo-{str(uuid.uuid4())[:8]}"

    root = safe_choice(roots)
    suffix = safe_choice(suffixes)

    # Combine parts - simple logic, can be made more complex
    company_name = f"{root} {suffix}".strip() if root and suffix else (root or suffix)

    # Add a random element for uncommonness if parts are limited
    if not roots or (not suffixes and random.random() < 0.5) or random.random() < 0.3: # Add randomness
         random_part = str(random.randint(100, 9999)) # Or use uuid parts
         company_name = f"{company_name} {random_part}".strip()

    # Clean up potential double spaces
    company_name = ' '.join(company_name.split())

    return company_name

def generate_occupation(country_data, existing_occupations):
    """Generates an occupation like 'Job Title at Fake Company' using English alphabet."""
    # Use the global list of job titles
    title = safe_choice(global_job_titles)

    # Generate a company name - need to ensure it's different from others
    company_name = generate_company_name(country_data)

    occupation = f"{title} at {company_name}".strip()

     # Ensure uniqueness within this run (basic retry logic)
    attempts = 0
    while occupation in existing_occupations and attempts < 10:
        title = safe_choice(global_job_titles) # Pick from global list
        company_name = generate_company_name(country_data) # Regenerate company name too
        occupation = f"{title} at {company_name}".strip()
        attempts += 1

    existing_occupations.add(occupation)
    return occupation


def generate_bank_name(country_data, existing_bank_names):
    """Generates a realistic, uncommon bank name using English alphabet."""
    prefixes = country_data.get("bank_prefixes", [])
    roots = country_data.get("bank_roots", [])
    suffixes = country_data.get("bank_suffixes", [])

    if not prefixes and not roots and not suffixes:
         return f"GenericBank-{str(uuid.uuid4())[:8]}"

    prefix = safe_choice(prefixes)
    root = safe_choice(roots)
    suffix = safe_choice(suffixes)

    parts = [prefix, root, suffix]
    bank_name = " ".join(part for part in parts if part).strip()

    # Add a random element for uncommonness
    if not prefixes or not roots or (not suffixes and random.random() < 0.5) or random.random() < 0.4: # Add randomness
        random_part = str(random.randint(10, 999))
        bank_name = f"{bank_name} {random_part}".strip()

    # Clean up potential double spaces
    bank_name = ' '.join(bank_name.split())

    # Ensure uniqueness within this run
    attempts = 0
    while bank_name in existing_bank_names and attempts < 10:
        prefix = safe_choice(prefixes)
        root = safe_choice(roots)
        suffix = safe_choice(suffixes)
        parts = [prefix, root, suffix]
        bank_name = " ".join(part for part in parts if part).strip()
        if not prefixes or not roots or (not suffixes and random.random() < 0.5) or random.random() < 0.4:
             random_part = str(random.randint(10, 999))
             bank_name = f"{bank_name} {random_part}".strip()
        bank_name = ' '.join(bank_name.split())
        attempts += 1

    existing_bank_names.add(bank_name)
    return bank_name


def generate_financial_consultant_name(country_data, existing_names):
    """Generates a realistic, uncommon full name for a consultant using English alphabet."""
    # Use the general name generator but add uniqueness check
    name = generate_name(country_data)

    # Ensure uniqueness within this run
    attempts = 0
    while name in existing_names and attempts < 10:
        name = generate_name(country_data)
        attempts += 1

    existing_names.add(name)
    return name


def generate_hospital_name(country_data, existing_hospital_names):
    """Generates a realistic, uncommon hospital name using English alphabet."""
    prefixes = country_data.get("hospital_prefixes", [])
    roots = country_data.get("hospital_roots", [])
    suffixes = country_data.get("hospital_suffixes", [])

    if not prefixes and not roots and not suffixes:
         return f"GenericHospital-{str(uuid.uuid4())[:8]}"

    prefix = safe_choice(prefixes)
    root = safe_choice(roots)
    suffix = safe_choice(suffixes)

    parts = [prefix, root, suffix]
    hospital_name = " ".join(part for part in parts if part).strip()

    # Add a random element for uncommonness
    if not prefixes or not roots or (not suffixes and random.random() < 0.5) or random.random() < 0.3: # Add randomness
        random_part = str(random.randint(1, 99))
        hospital_name = f"{hospital_name} {random_part}".strip()

    # Clean up potential double spaces
    hospital_name = ' '.join(hospital_name.split())


    # Ensure uniqueness within this run
    attempts = 0
    while hospital_name in existing_hospital_names and attempts < 10:
        prefix = safe_choice(prefixes)
        root = safe_choice(roots)
        suffix = safe_choice(suffixes)
        parts = [prefix, root, suffix]
        hospital_name = " ".join(part for part in parts if part).strip()
        if not prefixes or not roots or (not suffixes and random.random() < 0.5) or random.random() < 0.3:
             random_part = str(random.randint(1, 99))
             hospital_name = f"{hospital_name} {random_part}".strip()
        hospital_name = ' '.join(hospital_name.split())
        attempts += 1

    existing_hospital_names.add(hospital_name)
    return hospital_name

def generate_doctor_name(country_data, existing_names):
    """Generates a realistic uncommon full name with 'Dr.' prefix using English alphabet."""
     # Use the general name generator, add Dr. prefix, and ensure uniqueness
    name = generate_name(country_data)
    doctor_name = f"Dr. {name}".strip()

    # Ensure uniqueness within this run
    attempts = 0
    while doctor_name in existing_names and attempts < 10:
        name = generate_name(country_data)
        doctor_name = f"Dr. {name}".strip()
        attempts += 1

    existing_names.add(doctor_name)
    return doctor_name


def generate_synthetic_pii(num_rows):
    """Generates a specified number of synthetic PII data rows."""

    generated_values = {
        "home_address": set(),
        "work_address": set(),
        "occupation": set(),
        "bank_name": set(),
        "financial_consultant_name": set(),
        "hospital_name": set(),
        "doctor_name": set(),
    }

    generated_data = []
    all_regions = list(countries_grouped.keys())

    for _ in range(num_rows):
        # Select a random region and country
        region = random.choice(all_regions)
        country = random.choice(countries_grouped[region])
        country_data = pii_components.get(region, {}).get(country, {}) # Get country-specific data

        # Ensure we have basic data for the country, otherwise skip or use defaults
        if not country_data:
             print(f"Warning: No PII components defined for {country}. Skipping row.")
             continue

        row = {
            "country": country,
            "home_address": generate_address(country_data, generated_values["home_address"]),
            # Generate work address, ensuring it's different from the home address *for this row*
            # Also ensure work address is unique globally among work addresses
            "work_address": generate_address(country_data, generated_values["work_address"]),
            "occupation": generate_occupation(country_data, generated_values["occupation"]),
            "bank_name": generate_bank_name(country_data, generated_values["bank_name"]),
            "financial_consultant_name": generate_financial_consultant_name(country_data, generated_values["financial_consultant_name"]),
            "hospital_name": generate_hospital_name(country_data, generated_values["hospital_name"]),
            "doctor_name": generate_doctor_name(country_data, generated_values["doctor_name"]),
        }

        # Basic check to ensure home and work addresses are different for the same row
        attempts = 0
        # Need to ensure the regenerated work address is *also* unique within the work_address set
        while (row["home_address"] == row["work_address"] or row["work_address"] in generated_values["work_address"]) and attempts < 10:
             # If duplicate within row OR globally for work address, regenerate work address
             if row["work_address"] in generated_values["work_address"]:
                  # Remove the duplicate work address that was just added before regenerating
                  generated_values["work_address"].remove(row["work_address"])
             row["work_address"] = generate_address(country_data, generated_values["work_address"]) # generate_address adds to the set internally
             attempts += 1

        # Fallback if still the same after retries (unlikely but possible with very limited data)
        if row["home_address"] == row["work_address"]:
             fallback_work_address = f"Work Addr different from {row['home_address']}"
             # Ensure fallback is also unique if possible
             fb_attempts = 0
             while fallback_work_address in generated_values["work_address"] and fb_attempts < 10:
                  fallback_work_address = f"Work Addr different from {row['home_address']} ({uuid.uuid4()[:4]})"
                  fb_attempts += 1
             row["work_address"] = fallback_work_address
             generated_values["work_address"].add(row["work_address"]) # Add fallback to set


        generated_data.append(row)

    return generated_data

In [10]:

# --- Helper to build country_to_region_map ---
country_to_region_map = {}
for region, countries in pii_components.items():
    for country_name, country_data_val in countries.items(): # country_data_val is the dict of lists
        country_to_region_map[country_name] = region

In [11]:
qa_df['pii_picked_dict'].values[0]

[{'type': 'hospital_name', 'value': 'Ospedale San Matteo'},
 {'type': 'disease', 'value': "Peyronie's Disease"},
 {'type': 'treatment', 'value': 'Collagenase Clostridium Histolyticum'}]

In [12]:
# --- Mapping PII types to their generator functions ---
pii_generator_map = {
    "home_address": generate_address,
    "work_address": generate_address,
    "Occupation": generate_occupation,
    "bank_name": generate_bank_name,
    "financial_consultant_name": generate_financial_consultant_name,
    "hospital_name": generate_hospital_name,
    "doctor_name": generate_doctor_name,
}

# --- Main function to process DataFrame row ---
def generate_perturbed_pii_for_row(row):
    """
    Generates a dictionary of perturbed PII for the categories specified in the row's 'pii_picked_dict'.
    """
    try:
        pii_to_perturb = row['pii_picked_dict']
        # if pd.isna(pii_to_perturb):
        #     return {}
        #pii_to_perturb = ast.literal_eval(pii_to_perturb_str)
    except (ValueError, SyntaxError) as e:
        print(f"Error parsing 'pii_picked_dict' for row with country '{row.get('country', 'N/A')}': {e}. Content: '{row.get('pii_picked_dict', 'N/A')}'")
        return {"error": f"Parsing pii_picked_dict failed: {e}"}

    country = row['country']
    region = country_to_region_map.get(country)
    country_data = {}

    if not region:
        # print(f"Warning: Region not found for country '{country}'. Using empty country_data for generation.")
        pass # country_data remains {}
    else:
        country_data = pii_components.get(region, {}).get(country, {})
        if not country_data:
            # print(f"Warning: No PII components defined for '{country}' in region '{region}'. Using empty country_data.")
            pass # country_data remains {}


    perturbed_pii_output = {}

    for pii_entry in pii_to_perturb:
        if not isinstance(pii_entry, dict) or 'type' not in pii_entry:
            print(f"Warning: Invalid PII entry format in row with country '{country}': {pii_entry}. Skipping entry.")
            continue # Skip this malformed entry

        pii_type = pii_entry['type']
        generator_function = pii_generator_map.get(pii_type)

        if generator_function:
            current_type_generated_values = set()
            perturbed_examples = []
            for _ in range(5):
                example = generator_function(country_data.copy() if country_data else {}, current_type_generated_values)
                perturbed_examples.append(example)
            perturbed_pii_output[pii_type] = perturbed_examples
        else:
            # print(f"Warning: No generator function found for PII type '{pii_type}' for country '{country}'. Skipping.")
            perturbed_pii_output[pii_type] = [f"No generator for {pii_type}" for _ in range(5)]
    return perturbed_pii_output


print("Processing DataFrame...")
qa_df['pii_perturbed_dict'] = qa_df.apply(generate_perturbed_pii_for_row, axis=1)
print("Processing complete.")

for index, row_data in qa_df.iterrows():
    print(f"\n--- Row {index} (Country: {row_data['country']}) ---")
    print(f"Original pii_picked_dict: {row_data['pii_picked_dict']}")
    print(f"Generated pii_perturbed_dict:")
    if isinstance(row_data['pii_perturbed_dict'], dict) and row_data['pii_perturbed_dict']:
        if "error" in row_data['pii_perturbed_dict']:
                print(f"  Error: {row_data['pii_perturbed_dict']['error']}")
        else:
            for pii_type, examples in row_data['pii_perturbed_dict'].items():
                print(f"  {pii_type}:")
                if examples:
                    for ex in examples:
                        print(f"    - {ex}")
                else:
                    print("    (No examples generated)")
    else:
        print("  (No PII perturbed or empty dictionary returned)")

Processing DataFrame...
Processing complete.

--- Row 0 (Country: Italy) ---
Original pii_picked_dict: [{'type': 'hospital_name', 'value': 'Ospedale San Matteo'}, {'type': 'disease', 'value': "Peyronie's Disease"}, {'type': 'treatment', 'value': 'Collagenase Clostridium Histolyticum'}]
Generated pii_perturbed_dict:
  hospital_name:
    - Presidio Ospedaliero Fatebenefratelli IRCCS
    - Casa di Cura Regionale IRCCS
    - Azienda Ospedaliera Sant'Andrea IRCCS
    - Centro Medico Bambino Gesu Fondazione
    - Ospedale Generale SpA 34
  disease:
    - No generator for disease
    - No generator for disease
    - No generator for disease
    - No generator for disease
    - No generator for disease
  treatment:
    - No generator for treatment
    - No generator for treatment
    - No generator for treatment
    - No generator for treatment
    - No generator for treatment

--- Row 1 (Country: Italy) ---
Original pii_picked_dict: [{'type': 'bank_account_number', 'value': 'IT827042483092701

In [13]:
noise_cats = [
   'email_address', 'twitter_username', 
    'phone_number', 'DOB', 'credit_card_nr', 'bank_account_number','latest_bank_transaction', 'health_insurance_nr']

In [14]:
for idx, row in qa_df.iterrows():
    pii_dict = row['pii_perturbed_dict']
    noise_list = row['perturbed_pii_dict_noise']
    
    i = 0
    for k in pii_dict:
        if k in noise_cats:
            pii_dict[k] = noise_list[k]
        i += 1

    qa_df.at[idx, 'pii_perturbed_dict'] = pii_dict


In [15]:
unique_treatments = []
unique_disease = []

for row in qa_df['pii_picked_dict']:
    for pii in row:
        if pii['type'] == 'disease':
            if pii['value'] not in unique_disease:
                unique_disease.append(pii['value'])
        elif pii['type'] == 'treatment':
            if pii['value'] not in unique_treatments:
                unique_treatments.append(pii['value'])

In [16]:
import pandas as pd
import json

file_path = "/projects/0/hpmlprjs/LLM/danp/UGBench/my_files/pii_dataset/data/generated_data/diseases.json"

with open(file_path, 'r', encoding='utf-8') as f:
    data = json.load(f)
# Convert to DataFrame
disease_df = pd.DataFrame(data)

In [17]:
unique_disease = qa_df['disease'].unique().tolist()
unique_treatments = qa_df['treatment'].unique().tolist()

In [18]:
perturb_diseases = disease_df[~disease_df['disease'].isin(unique_disease)]['disease'].tolist()
perturb_treatments = disease_df[~disease_df['treatment'].isin(unique_treatments)]['treatment'].tolist()

In [19]:
import numpy as np

perturb_treatments = np.unique(perturb_treatments).tolist()
perturb_diseases = np.unique(perturb_diseases).tolist()

In [20]:
import random

for idx, row in qa_df.iterrows():
    row_picked = row['pii_picked_dict']
    for pii in row_picked:
        if pii['type'] == 'disease':

            random_diseases = random.sample(perturb_diseases, 5)  
            perturb_diseases = [item for item in perturb_diseases if item not in random_diseases]
            qa_df.at[idx, 'pii_perturbed_dict']['disease'] = random_diseases
        
        elif pii['type'] == 'treatment':
            random_treatments = random.sample(perturb_treatments, 5)
            perturb_treatments = [item for item in perturb_treatments if item not in random_treatments]
            qa_df.at[idx, 'pii_perturbed_dict']['treatment'] = random_treatments

In [21]:
qa_df.drop(columns=['perturbed_pii_dict_noise'],inplace=True)

In [36]:
import pandas as pd
import json

json_list = qa_df.to_dict(orient='records')
file_path = '/projects/0/hpmlprjs/LLM/danp/UGBench/my_files/pii_dataset/data/qa_pairs_full2.json'
with open(file_path, 'w', encoding='utf-8') as f:
    json.dump(json_list, f, ensure_ascii=False, indent=4)

print(f"JSON file created with {len(json_list)} objects")

JSON file created with 2250 objects


In [27]:
import pandas as pd
import json

file_path = '/projects/0/hpmlprjs/LLM/danp/UGBench/my_files/pii_dataset/data/qa_pairs_full2.json'

with open(file_path, 'r', encoding='utf-8') as f:
    data = json.load(f)
# Convert to DataFrame
qa_df = pd.DataFrame(data)

In [28]:
import re

def extract_amount_and_date(transaction_string):
    """
    Extracts amount and date from a bank transaction string.
    Assumes format like '€1,342.87 on 28/08/2017' or '$2,047.65, recorded on 10/08/2020'.
    This is a simplified extraction and might need adjustment based on actual data variations.
    Returns (amount_string, date_string) or (None, None) if not found.
    """
    # Regex to find currency symbol + number (with optional comma/decimal) and date (DD/MM/YYYY)
    # This regex is an attempt to cover the examples provided. It might need refinement.
    match = re.search(r'([\€\$\£\¥]?\s*[\d,\.]+\s*)[\s,]+(?:on|recorded on)\s+(\d{2}/\d{2}/\d{4})', transaction_string)
    if match:
        amount = match.group(1).strip()
        date = match.group(2).strip()
        return amount, date
    return None, None

def extract_position_and_company(occupation_string):
    """
    Extracts position title and company name from an occupation string.
    Assumes format like 'Agronomist at GreenPulse Ltd' but handles words in between.
    Returns (position_string, company_string) or (None, None) if not found.
    """
    # Regex to find text before ' at ' and text after ' at '
    # This regex is an attempt to cover the examples provided. It might need refinement.
    match = re.search(r'(.+?)\s+at\s+(.+)', occupation_string)
    if match:
        position = match.group(1).strip()
        company = match.group(2).strip()
        return position, company
    return None, None


def create_single_list_of_perturbed_qa_pairs_for_row(row):
    """
    Generates a single list of up to 5 perturbed QA pairs for a single row.
    Each of the generated pairs uses the i-th perturbed PII values,
    applied to the i-th original QA pair.
    Includes special handling for 'latest_bank_transaction' and 'Occupation'
    to replace components separately.
    If there are fewer than 5 original QA pairs, fewer perturbed pairs are generated.
    """
    original_qa_pairs = row['paraphrased_qa_pairs']
    pii_targets = row['pii_picked_dict']
    pii_replacements_dict = row['pii_perturbed_dict']

    single_list_of_perturbed_pairs = []

    # Loop up to 5 times, or until we run out of original QA pairs
    num_pairs_to_generate = min(5, len(original_qa_pairs))

    for i in range(num_pairs_to_generate):
        # Take the i-th original QA pair
        original_qa_pair = original_qa_pairs[i]
        # Create a deep copy to ensure nested dictionaries/lists are also copied
        perturbed_qa_pair = json.loads(json.dumps(original_qa_pair)) # Simple deep copy using json

        perturbed_question = perturbed_qa_pair['paraphrased_question']
        perturbed_answer = perturbed_qa_pair['paraphrased_answer']

        # Apply the i-th perturbation for all PII types
        for pii_item in pii_targets:
            pii_type = pii_item['type']
            original_value = pii_item['value']

            replacement_value = original_value # Default to original

            # Get the expected i-th perturbed value if it exists
            if pii_type in pii_replacements_dict and len(pii_replacements_dict[pii_type]) > i:
                replacement_value = pii_replacements_dict[pii_type][i]
            # Else, replacement_value remains original_value (no perturbation for this PII in this pair)

            # --- Special Handling for latest_bank_transaction ---
            if pii_type == 'latest_bank_transaction' and replacement_value != original_value:
                original_amount, original_date = extract_amount_and_date(original_value)
                perturbed_amount, perturbed_date = extract_amount_and_date(replacement_value)

                # Perform replacements for amount and date separately if found
                if original_amount and perturbed_amount:
                    # Use regex replacement to be more robust to surrounding words/punctuation
                    perturbed_question = re.sub(re.escape(original_amount), perturbed_amount, perturbed_question)
                    perturbed_answer = re.sub(re.escape(original_amount), perturbed_amount, perturbed_answer)

                if original_date and perturbed_date:
                    # Use regex replacement for date as well
                    perturbed_question = re.sub(re.escape(original_date), perturbed_date, perturbed_question)
                    perturbed_answer = re.sub(re.escape(original_date), perturbed_date, perturbed_answer)
            # --- End Special Handling (latest_bank_transaction) ---

            # --- Special Handling for Occupation ---
            elif pii_type == 'Occupation' and replacement_value != original_value:
                 original_position, original_company = extract_position_and_company(original_value)
                 perturbed_position, perturbed_company = extract_position_and_company(replacement_value)

                 # Perform replacements for position and company separately if found
                 if original_position and perturbed_position:
                     # Use regex replacement for position
                     perturbed_question = re.sub(re.escape(original_position), perturbed_position, perturbed_question)
                     perturbed_answer = re.sub(re.escape(original_position), perturbed_position, perturbed_answer)

                 if original_company and perturbed_company:
                     # Use regex replacement for company
                     perturbed_question = re.sub(re.escape(original_company), perturbed_company, perturbed_question)
                     perturbed_answer = re.sub(re.escape(original_company), perturbed_company, perturbed_answer)
            # --- End Special Handling (Occupation) ---

            else:
                # General replacement for other PII types
                # Use regex replacement for general cases too for consistency and robustness
                perturbed_question = re.sub(re.escape(original_value), replacement_value, perturbed_question)
                perturbed_answer = re.sub(re.escape(original_value), replacement_value, perturbed_answer)

        perturbed_qa_pair['paraphrased_question'] = perturbed_question
        perturbed_qa_pair['paraphrased_answer'] = perturbed_answer
        single_list_of_perturbed_pairs.append(perturbed_qa_pair)

    return single_list_of_perturbed_pairs

# Apply the modified function to create the new 'perturbed_qa_pairs' column
qa_df['perturbed_qa_pairs'] = qa_df.apply(create_single_list_of_perturbed_qa_pairs_for_row, axis=1)

print("Newly created qa_df['perturbed_qa_pairs'] (first row, showing the single list of generated pairs):")
# Print the entire list for the first row
print(json.dumps(qa_df['perturbed_qa_pairs'].iloc[0], indent=2))
print("\n--------------------------------------------------\n")
print("Total number of perturbed pairs generated per row:", len(qa_df['perturbed_qa_pairs'].iloc[0]))

Newly created qa_df['perturbed_qa_pairs'] (first row, showing the single list of generated pairs):
[
  {
    "paraphrased_question": "At which hospital is Matteo Vittorio Farnesi receiving treatment for his condition, and what specific therapy is he undergoing?",
    "paraphrased_answer": "Matteo Vittorio Farnesi is undergoing treatment for Diabetic Ulcer at Presidio Ospedaliero Fatebenefratelli IRCCS. The intervention currently in progress includes the use of Fluid Resuscitation."
  },
  {
    "paraphrased_question": "Which medical facility is handling Matteo Vittorio Farnesi's treatment for CML, and what is the specific medication or therapy being used?",
    "paraphrased_answer": "Matteo Vittorio Farnesi is being treated for CML at Casa di Cura Regionale IRCCS. The therapy being administered is Antibiotics Surgery."
  },
  {
    "paraphrased_question": "Where is Matteo Vittorio Farnesi undergoing care for BK Virus Nephropathy, and what is the exact treatment being applied?",
    "pa

In [30]:
# Manual corrections based on verification issues

# Issue in Row Index: 7, Perturbed QA Pair Index: 2, PII Type: DOB
original_dob_7_2_paraphrased = '11th of April, 1981'
perturbed_dob_7_2 = '12/04/1987'
# Replace the paraphrased date in the answer
qa_df.loc[7, 'perturbed_qa_pairs'][2]['paraphrased_answer'] = qa_df.loc[7, 'perturbed_qa_pairs'][2]['paraphrased_answer'].replace(original_dob_7_2_paraphrased, perturbed_dob_7_2)

# Issue in Row Index: 26, Perturbed QA Pair Index: 2, PII Type: DOB
original_dob_26_2_paraphrased = '5th of March, 1989'
perturbed_dob_26_2 = '11/06/1982'
# Replace the paraphrased date in the answer
qa_df.loc[26, 'perturbed_qa_pairs'][2]['paraphrased_answer'] = qa_df.loc[26, 'perturbed_qa_pairs'][2]['paraphrased_answer'].replace(original_dob_26_2_paraphrased, perturbed_dob_26_2)

# Issue in Row Index: 60, Perturbed QA Pair Index: 4, PII Type: email_address
original_email_60_4_paraphrased = 'j.maxsvn88@gmail.oem' # Using the exact string from the answer
perturbed_email_60_4 = 'j.madsen88@gmail.oem'
# Replace the paraphrased email in the answer
qa_df.loc[60, 'perturbed_qa_pairs'][4]['paraphrased_answer'] = qa_df.loc[60, 'perturbed_qa_pairs'][4]['paraphrased_answer'].replace(original_email_60_4_paraphrased, perturbed_email_60_4)

# Issue in Row Index: 257, Perturbed QA Pair Index: 2, PII Type: DOB
original_dob_257_2_paraphrased = '2nd of July, 1984'
perturbed_dob_257_2 = '02/04/2004'
# Replace the paraphrased date in the answer
qa_df.loc[257, 'perturbed_qa_pairs'][2]['paraphrased_answer'] = qa_df.loc[257, 'perturbed_qa_pairs'][2]['paraphrased_answer'].replace(original_dob_257_2_paraphrased, perturbed_dob_257_2)

# Issue in Row Index: 272, Perturbed QA Pair Index: 1, PII Type: DOB
original_dob_272_1_paraphrased = '17th of March, 1977'
perturbed_dob_272_1 = '01/10/1960'
# Replace the paraphrased date in the answer
qa_df.loc[272, 'perturbed_qa_pairs'][1]['paraphrased_answer'] = qa_df.loc[272, 'perturbed_qa_pairs'][1]['paraphrased_answer'].replace(original_dob_272_1_paraphrased, perturbed_dob_272_1)

# Issue in Row Index: 492, Perturbed QA Pair Index: 2, PII Type: DOB
original_dob_492_2_paraphrased = '18th of June, 1987'
perturbed_dob_492_2 = '27/01/1969'
# Replace the paraphrased date in the answer
qa_df.loc[492, 'perturbed_qa_pairs'][2]['paraphrased_answer'] = qa_df.loc[492, 'perturbed_qa_pairs'][2]['paraphrased_answer'].replace(original_dob_492_2_paraphrased, perturbed_dob_492_2)

# Issue in Row Index: 1099, Perturbed QA Pair Index: 2, PII Type: DOB
original_dob_1099_2_paraphrased = '9th of March, 1984' # Using the paraphrased date from the answer
perturbed_dob_1099_2 = '04/06/1988'
# Replace the paraphrased date in the answer
qa_df.loc[1099, 'perturbed_qa_pairs'][2]['paraphrased_answer'] = qa_df.loc[1099, 'perturbed_qa_pairs'][2]['paraphrased_answer'].replace(original_dob_1099_2_paraphrased, perturbed_dob_1099_2)

# Issue in Row Index: 1169, Perturbed QA Pair Index: 0, PII Type: treatment
original_treatment_1169_0_paraphrased = 'a pessary' # Using the exact string from the answer
perturbed_treatment_1169_0 = 'Omaveloxolone'
# Replace in both question and answer, using the paraphrased string
qa_df.loc[1169, 'perturbed_qa_pairs'][0]['paraphrased_question'] = qa_df.loc[1169, 'perturbed_qa_pairs'][0]['paraphrased_question'].replace(original_treatment_1169_0_paraphrased, perturbed_treatment_1169_0)
qa_df.loc[1169, 'perturbed_qa_pairs'][0]['paraphrased_answer'] = qa_df.loc[1169, 'perturbed_qa_pairs'][0]['paraphrased_answer'].replace(original_treatment_1169_0_paraphrased, perturbed_treatment_1169_0)

# Issue in Row Index: 1169, Perturbed QA Pair Index: 1, PII Type: treatment
original_treatment_1169_1_paraphrased = 'a pessary' # Using the exact string from the answer
perturbed_treatment_1169_1 = 'Valproic Acid'
# Replace in both question and answer, using the paraphrased string
qa_df.loc[1169, 'perturbed_qa_pairs'][1]['paraphrased_question'] = qa_df.loc[1169, 'perturbed_qa_pairs'][1]['paraphrased_question'].replace(original_treatment_1169_1_paraphrased, perturbed_treatment_1169_1)
qa_df.loc[1169, 'perturbed_qa_pairs'][1]['paraphrased_answer'] = qa_df.loc[1169, 'perturbed_qa_pairs'][1]['paraphrased_answer'].replace(original_treatment_1169_1_paraphrased, perturbed_treatment_1169_1)

# Issue in Row Index: 1169, Perturbed QA Pair Index: 2, PII Type: treatment
original_treatment_1169_2_paraphrased = 'a pessary' # Using the exact string from the answer
perturbed_treatment_1169_2 = 'Conjunctival Resection'
# Replace in both question and answer, using the paraphrased string
qa_df.loc[1169, 'perturbed_qa_pairs'][2]['paraphrased_question'] = qa_df.loc[1169, 'perturbed_qa_pairs'][2]['paraphrased_question'].replace(original_treatment_1169_2_paraphrased, perturbed_treatment_1169_2)
qa_df.loc[1169, 'perturbed_qa_pairs'][2]['paraphrased_answer'] = qa_df.loc[1169, 'perturbed_qa_pairs'][2]['paraphrased_answer'].replace(original_treatment_1169_2_paraphrased, perturbed_treatment_1169_2)

# Issue in Row Index: 1169, Perturbed QA Pair Index: 3, PII Type: treatment
original_treatment_1169_3_paraphrased = 'a pessary' # Using the exact string from the answer
perturbed_treatment_1169_3 = 'Minoxidil'
# Replace in both question and answer, using the paraphrased string
qa_df.loc[1169, 'perturbed_qa_pairs'][3]['paraphrased_question'] = qa_df.loc[1169, 'perturbed_qa_pairs'][3]['paraphrased_question'].replace(original_treatment_1169_3_paraphrased, perturbed_treatment_1169_3)
qa_df.loc[1169, 'perturbed_qa_pairs'][3]['paraphrased_answer'] = qa_df.loc[1169, 'perturbed_qa_pairs'][3]['paraphrased_answer'].replace(original_treatment_1169_3_paraphrased, perturbed_treatment_1169_3)

# Issue in Row Index: 1169, Perturbed QA Pair Index: 4, PII Type: treatment
original_treatment_1169_4_paraphrased = 'a pessary' # Using the exact string from the answer
perturbed_treatment_1169_4 = 'Physical Therapy'
# Replace in both question and answer, using the paraphrased string
qa_df.loc[1169, 'perturbed_qa_pairs'][4]['paraphrased_question'] = qa_df.loc[1169, 'perturbed_qa_pairs'][4]['paraphrased_question'].replace(original_treatment_1169_4_paraphrased, perturbed_treatment_1169_4)
qa_df.loc[1169, 'perturbed_qa_pairs'][4]['paraphrased_answer'] = qa_df.loc[1169, 'perturbed_qa_pairs'][4]['paraphrased_answer'].replace(original_treatment_1169_4_paraphrased, perturbed_treatment_1169_4)

# Issue in Row Index: 1251, Perturbed QA Pair Index: 2, PII Type: DOB
original_dob_1251_2_paraphrased = '17th of April, 1976'
perturbed_dob_1251_2 = '26/02/1992'
# Replace the paraphrased date in the answer
qa_df.loc[1251, 'perturbed_qa_pairs'][2]['paraphrased_answer'] = qa_df.loc[1251, 'perturbed_qa_pairs'][2]['paraphrased_answer'].replace(original_dob_1251_2_paraphrased, perturbed_dob_1251_2)

# Issue in Row Index: 1313, Perturbed QA Pair Index: 2, PII Type: DOB
original_dob_1313_2_paraphrased = '14th of September, 1978'
perturbed_dob_1313_2 = '27/07/1992'
# Replace the paraphrased date in the answer
qa_df.loc[1313, 'perturbed_qa_pairs'][2]['paraphrased_answer'] = qa_df.loc[1313, 'perturbed_qa_pairs'][2]['paraphrased_answer'].replace(original_dob_1313_2_paraphrased, perturbed_dob_1313_2)

# Issue in Row Index: 1749, Perturbed QA Pair Index: 3, PII Type: DOB
original_dob_1749_3_paraphrased = '28th of September, 1988'
perturbed_dob_1749_3 = '19/05/1969'
# Replace the paraphrased date in the answer
qa_df.loc[1749, 'perturbed_qa_pairs'][3]['paraphrased_answer'] = qa_df.loc[1749, 'perturbed_qa_pairs'][3]['paraphrased_answer'].replace(original_dob_1749_3_paraphrased, perturbed_dob_1749_3)

# Issue in Row Index: 1751, Perturbed QA Pair Index: 3, PII Type: email_address
original_email_1751_3_paraphrased = 'vbbl.ny63@examprv.xe' # Using the exact string from the answer
perturbed_email_1751_3 = 'ebba.ny93@examprv.xe'
# Replace the paraphrased email in the answer
qa_df.loc[1751, 'perturbed_qa_pairs'][3]['paraphrased_answer'] = qa_df.loc[1751, 'perturbed_qa_pairs'][3]['paraphrased_answer'].replace(original_email_1751_3_paraphrased, perturbed_email_1751_3)

# Issue in Row Index: 2197, Perturbed QA Pair Index: 2, PII Type: DOB
original_dob_2197_2_paraphrased = '14th of September, 1981'
perturbed_dob_2197_2 = '17/03/1979'
# Replace the paraphrased date in the answer
qa_df.loc[2197, 'perturbed_qa_pairs'][2]['paraphrased_answer'] = qa_df.loc[2197, 'perturbed_qa_pairs'][2]['paraphrased_answer'].replace(original_dob_2197_2_paraphrased, perturbed_dob_2197_2)

# Issue in Row Index: 2233, Perturbed QA Pair Index: 2, PII Type: DOB
original_dob_2233_2_paraphrased = '11th of August, 1985' # Using the paraphrased date from the answer
perturbed_dob_2233_2 = '30/01/1969'
# Replace the paraphrased date in the answer
qa_df.loc[2233, 'perturbed_qa_pairs'][2]['paraphrased_answer'] = qa_df.loc[2233, 'perturbed_qa_pairs'][2]['paraphrased_answer'].replace(original_dob_2233_2_paraphrased, perturbed_dob_2233_2)

In [34]:

def verify_perturbed_qa_pairs(row):
    """
    Verifies if the generated perturbed_qa_pairs contain the expected perturbed PII values.
    Returns a list of issues found, or an empty list if no issues.
    An issue is reported if an expected perturbed PII value (which is different from the original)
    is not found in the corresponding perturbed QA pair.
    Includes special handling for 'latest_bank_transaction' and 'Occupation'
    to check for components separately.
    """
    perturbed_qa_pairs = row['perturbed_qa_pairs']
    pii_targets = row['pii_picked_dict']
    pii_replacements_dict = row['pii_perturbed_dict']

    issues = []

    # Iterate through the generated perturbed QA pairs (up to 5)
    for i in range(len(perturbed_qa_pairs)):
        perturbed_qa_pair = perturbed_qa_pairs[i]
        question = perturbed_qa_pair['paraphrased_question']
        answer = perturbed_qa_pair['paraphrased_answer']

        missing_pii_in_pair = []

        # Check for each PII type if the i-th perturbed value is present
        for pii_item in pii_targets:
            pii_type = pii_item['type']
            original_value = pii_item['value']

            expected_perturbed_value = original_value # Default

            # Get the expected i-th perturbed value if it exists
            if pii_type in pii_replacements_dict and len(pii_replacements_dict[pii_type]) > i:
                 expected_perturbed_value = pii_replacements_dict[pii_type][i]

            # Only check for presence if the value was actually perturbed (i.e., different from original)
            if expected_perturbed_value != original_value:
                # --- Special Verification Handling for latest_bank_transaction ---
                if pii_type == 'latest_bank_transaction':
                    expected_perturbed_amount, expected_perturbed_date = extract_amount_and_date(expected_perturbed_value)
                    original_amount, original_date = extract_amount_and_date(original_value) # Get original parts for context

                    amount_missing = False
                    date_missing = False

                    # Check if the perturbed amount was expected and is missing
                    if expected_perturbed_amount and expected_perturbed_amount != original_amount: # Only check if amount was perturbed
                         if expected_perturbed_amount not in question and expected_perturbed_amount not in answer:
                             amount_missing = True

                    # Check if the perturbed date was expected and is missing
                    if expected_perturbed_date and expected_perturbed_date != original_date: # Only check if date was perturbed
                        if expected_perturbed_date not in question and expected_perturbed_date not in answer:
                            date_missing = True

                    if amount_missing or date_missing:
                         details = {
                             'pii_type': pii_type,
                             'expected_perturbed_value': expected_perturbed_value, # Report the full expected value
                             'original_value': original_value # Report the full original value
                         }
                         if amount_missing:
                             details['missing_part'] = 'amount'
                             details['expected_amount'] = expected_perturbed_amount
                         if date_missing:
                             if 'missing_part' in details: details['missing_part'] += ' and date'
                             else: details['missing_part'] = 'date'
                             details['expected_date'] = expected_perturbed_date

                         missing_pii_in_pair.append(details)

                # --- End Special Verification Handling (latest_bank_transaction) ---

                # --- Special Verification Handling for Occupation ---
                elif pii_type == 'Occupation':
                     expected_perturbed_position, expected_perturbed_company = extract_position_and_company(expected_perturbed_value)
                     original_position, original_company = extract_position_and_company(original_value) # Get original parts for context

                     position_missing = False
                     company_missing = False

                     # Check if the perturbed position was expected and is missing
                     if expected_perturbed_position and expected_perturbed_position != original_position: # Only check if position was perturbed
                         if expected_perturbed_position not in question and expected_perturbed_position not in answer:
                             position_missing = True

                     # Check if the perturbed company was expected and is missing
                     if expected_perturbed_company and expected_perturbed_company != original_company: # Only check if company was perturbed
                         if expected_perturbed_company not in question and expected_perturbed_company not in answer:
                             company_missing = True

                     if position_missing or company_missing:
                          details = {
                              'pii_type': pii_type,
                              'expected_perturbed_value': expected_perturbed_value, # Report the full expected value
                              'original_value': original_value # Report the full original value
                          }
                          if position_missing:
                              details['missing_part'] = 'position'
                              details['expected_position'] = expected_perturbed_position
                          if company_missing:
                              if 'missing_part' in details: details['missing_part'] += ' and company'
                              else: details['missing_part'] = 'company'
                              details['expected_company'] = expected_perturbed_company

                          missing_pii_in_pair.append(details)

                # --- End Special Verification Handling (Occupation) ---

                else:
                    # General check for other PII types
                    if expected_perturbed_value not in question and expected_perturbed_value not in answer:
                        missing_pii_in_pair.append({
                            'pii_type': pii_type,
                            'expected_perturbed_value': expected_perturbed_value,
                            'original_value': original_value # Include original for context
                        })

        if missing_pii_in_pair:
            issues.append({
                'qa_pair_index_in_perturbed_list': i, # Index within the generated list (0-4)
                'perturbed_qa_pair': perturbed_qa_pair,
                'missing_pii_details': missing_pii_in_pair
            })

    return issues

# Apply the verification function to create a column with issues
qa_df['verification_issues'] = qa_df.apply(verify_perturbed_qa_pairs, axis=1)

# Filter rows that have issues
rows_with_issues = qa_df[qa_df['verification_issues'].apply(len) > 0]

# Print the rows with issues in a nice format
print("\n--- Rows with Verification Issues ---")
if rows_with_issues.empty:
    print("No issues found. All intended perturbed PII values seem to be present in the generated pairs.")
else:
    for index, row in rows_with_issues.iterrows():
        print(f"\n--- Issues in Row Index: {index} ---")
        for issue in row['verification_issues']:
            print(f"  Issue found in Perturbed QA Pair (Index {issue['qa_pair_index_in_perturbed_list']} in the generated list):")
            print(f"    Question: {issue['perturbed_qa_pair']['paraphrased_question']}")
            print(f"    Answer: {issue['perturbed_qa_pair']['paraphrased_answer']}")
            print("    Details of Missing PII:")
            for missing_detail in issue['missing_pii_details']:
                print(f"      - PII Type: {missing_detail['pii_type']}")
                print(f"        Expected Perturbed Value: '{missing_detail['expected_perturbed_value']}'")
                print(f"        Original Value (was replaced from): '{missing_detail['original_value']}'")
                if 'missing_part' in missing_detail:
                     print(f"        Missing Part(s): {missing_detail['missing_part']}")
                     if 'expected_amount' in missing_detail:
                         print(f"        Expected Amount: '{missing_detail['expected_amount']}'")
                     if 'expected_date' in missing_detail:
                         print(f"        Expected Date: '{missing_detail['expected_date']}'")
                     if 'expected_position' in missing_detail:
                         print(f"        Expected Position: '{missing_detail['expected_position']}'")
                     if 'expected_company' in missing_detail:
                         print(f"        Expected Company: '{missing_detail['expected_company']}'")
        print("-" * 40) # Separator for next row with issues


--- Rows with Verification Issues ---

--- Issues in Row Index: 7 ---
  Issue found in Perturbed QA Pair (Index 2 in the generated list):
    Question: Can you tell me when Matteo Vittorio Farnesi was born?
    Answer: Matteo Vittorio Farnesi was born on the 12/04/1987.
    Details of Missing PII:
      - PII Type: DOB
        Expected Perturbed Value: '29/01/1972'
        Original Value (was replaced from): '04/11/1981'
----------------------------------------

--- Issues in Row Index: 26 ---
  Issue found in Perturbed QA Pair (Index 2 in the generated list):
    Question: Could you tell me when Ebba Vilhelm Lindqvist was born?
    Answer: Ebba Vilhelm Lindqvist was born on the 11/06/1982.
    Details of Missing PII:
      - PII Type: DOB
        Expected Perturbed Value: '16/02/1997'
        Original Value (was replaced from): '05/03/1989'
----------------------------------------

--- Issues in Row Index: 60 ---
  Issue found in Perturbed QA Pair (Index 1 in the generated list):
   