# Synthetic Data with Placeholders

This notebook demonstrates how to generate synthetic data with PII placeholders using Llama3-8B-Instruct.

## User Inputs

In [1]:
# LLM
model_name = 'Meta-Llama-3-8B-Instruct'

# Number of tokens to generate
N = 1_500

## Libraries and Functions

In [2]:
from faker import Faker  # generates fake data
import ctypes
import random
from pathlib import Path
from tqdm.auto import tqdm
import transformers
import numpy as np
import pandas as pd
import torch
import gc
import os

# Use only GPU 1
os.environ['CUDA_VISIBLE_DEVICES'] = '1'

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Seed the same seed to all
libc = ctypes.CDLL("libc.so.6")


def seed_everything(*, seed=42):
    Faker.seed(seed)
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)


def clear_memory():
    libc.malloc_trim(0)
    torch.cuda.empty_cache()
    gc.collect()


def load_model(model_path: str, *, quantize: bool = False):
    model_pipeline = transformers.pipeline(
        "text-generation",
        model=model_path,
        model_kwargs={"torch_dtype": torch.bfloat16},
        device="cuda",)
    return model_pipeline


def generate_texts(pipeline, generated_df):

    # Generate the texts
    for i in tqdm(range(len(generated_df))):
        # Get the prompt
        prompt = generated_df.prompt.iloc[i]
        max_new_tokens = generated_df['max_new_tokens'].iloc[i]
        temperature = generated_df['temperature'].iloc[i]
        top_p = generated_df['top_p'].iloc[i]
        top_k = int(generated_df['top_k'].iloc[i])
        repeat_penalty = generated_df['repetition_penalty'].iloc[i]
        file_name = generated_df['file_name'].iloc[i]
        writing_style = generated_df['writing_style'].iloc[i]
        fields_used = generated_df['fields_used'].iloc[i]

        # Tokenize the prompt
        prompt = pipeline.tokenizer.apply_chat_template(
            prompt,
            tokenize=False,
            add_generation_prompt=True
        )

        terminators = [
            pipeline.tokenizer.eos_token_id,
            pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
        ]

        # Generate the outputs from prompt
        outputs = pipeline(
            prompt,
            max_new_tokens=max_new_tokens,
            eos_token_id=terminators,
            do_sample=True,
            temperature=temperature,
        )
        generated_df.loc[i, 'generated_text'] = outputs[0]["generated_text"]
    return generated_df

seed_everything(seed=100)

## Prompts for the LLM

In [4]:
# PII entity types
label_types = ['NAME', 'EMAIL', 'USERNAME', 'ID_NUM',
               'PHONE_NUM', 'URL_PERSONAL', 'STREET_ADDRESS']

# LLM path
path_model = str(Path(os.getenv('LLM_MODELS')) / model_name)

# List of topics
with open('./prompt-templates/topics-list.txt') as f:
    topics = f.read()
topics = topics.split('\n')

# List of majors
with open('./prompt-templates/majors.txt') as f:
    majors = f.read()
majors = majors.split('\n')

# Generate Placeholder Text from LLM
cols = ['IDENTIFICATION_NUM', 'STREET_ADDRESS', 'PHONE_NUM',
        'USERNAME', 'URL_PERSONAL', 'EMAIL']

# List of writing styles
writing_style = [
    'an essay',
    'a critical analysis (with citations and references)',
    'an untitled blog (i.e., without a title) ',
    'a few paragraphs (without a title)'
]

# Select a random mixture of features
fields_used = []
writing_styles = []
for _ in range(N):
    fields_to_use = random.sample(cols, random.randint(1, 2))
    random.shuffle(fields_to_use)
    fields_used.append(", ".join(['YOUR_NAME'] + fields_to_use))
    writing_styles.append(random.choice(writing_style))
    
# Store in dataframe
df = pd.DataFrame({'fields_used': fields_used,
                    'writing_style': writing_styles})
del fields_to_use, fields_used, writing_styles

# Generate model parameter settings
df['max_new_tokens'] = [random.choice([2048]) for _ in range(len(df))]
df['temperature'] = [random.choice(
    [10, 20, 30, 70]) / 100 for _ in range(len(df))]
df['top_p'] = [random.randint(a=90, b=95) / 100 for _ in range(len(df))]
df['top_k'] = [random.choice([40, 50]) for _ in range(len(df))]
df['repetition_penalty'] = [random.choice(
    [1.1, 1.2]) for _ in range(len(df))]

# Generate occupation
df['occupation'] = [random.choice(majors).lower() for _ in range(len(df))]
df['topic'] = [random.choice(topics).lower() for _ in range(len(df))]

display(df.head())

Unnamed: 0,fields_used,writing_style,max_new_tokens,temperature,top_p,top_k,repetition_penalty,occupation,topic
0,"YOUR_NAME, USERNAME",a few paragraphs (without a title),2048,0.1,0.91,50,1.2,real estate,gun control and its impact on society
1,"YOUR_NAME, EMAIL",a few paragraphs (without a title),2048,0.1,0.9,50,1.1,psychology,the effects of climate change on wildlife
2,"YOUR_NAME, URL_PERSONAL, USERNAME",an essay,2048,0.7,0.92,50,1.1,engineering physics,the impact of income inequality on society
3,"YOUR_NAME, EMAIL",a few paragraphs (without a title),2048,0.7,0.93,40,1.2,nursing,the rise of populism in politics
4,"YOUR_NAME, IDENTIFICATION_NUM, STREET_ADDRESS",a critical analysis (with citations and refere...,2048,0.3,0.93,40,1.1,dance,the future of renewable energy sources


In [5]:
# Prompt fields to insert
def prompt_placeholder(fields):
    fields = fields.split(', ')
    return '\n'.join(['{' + f'{field}' + '}' for field in fields])

df['prompt_pii'] = df.apply(lambda x: prompt_placeholder(fields=x['fields_used']),
                            axis=1)

# List of prompts
prompt_files = {
    'mixed': (list(Path(f'./prompt-templates/placeholder/mixed-llama3').glob('*.txt'))),
}


def create_prompt(files: dict, data: pd.Series):
    if random.random() >= 0.0:
        file = random.sample(files['mixed'], 1)[0]
    else:
        file = random.sample(files['names'], 1)[0]
    with open(file) as f:
        prompt = f.read()
    prompt = prompt.replace('{OCCUPATION}', data['occupation'])
    prompt = prompt.replace('{REPORT}', data['writing_style'])
    prompt = prompt.replace('{TOPIC}', data['topic'])

    system_prompt = prompt.split('%%%%%%%%%%%%%%%%%%%%%%%%%')[0].strip()
    user_prompt = prompt.split('%%%%%%%%%%%%%%%%%%%%%%%%%')[1].strip()

    prompt_defs = {
        'YOUR_NAME': "Full name",
        'IDENTIFICATION_NUM': "Online student identification number",
        'STREET_ADDRESS': "Home street address",
        'PHONE_NUM': "Personal phone number",
        'USERNAME': "Online student username",
        'URL_PERSONAL': "Personal website or social medial platform",
        'EMAIL': "Personal email address"}

    sys_pii = []
    for pii in data.prompt_pii.split('\n'):
        sys_pii.append(f'{pii}: {prompt_defs[pii[1:-1]]}')
    sys_pii = '\n'.join(sys_pii)

    system_prompt = system_prompt.replace('{INSERT_INFO_HERE}', sys_pii)
    user_prompt = user_prompt.replace(
        '{INSERT_INFO_HERE}', data['prompt_pii'])

    prompt = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]
    return file.name, prompt

# Create prompt for the model
df['file_name'], df['prompt'] = zip(*df.apply(lambda x: create_prompt(files=prompt_files,
                                                                        data=x), axis=1))

# Model used to gen. text
df['model'] = model_name

# Display the dataframe and a few example prompts
display(df.head())
for i in range(3):
    print(f'Example Prompt #{i}:\n{df.prompt.iloc[i][0]}\n\n')

Unnamed: 0,fields_used,writing_style,max_new_tokens,temperature,top_p,top_k,repetition_penalty,occupation,topic,prompt_pii,file_name,prompt,model
0,"YOUR_NAME, USERNAME",a few paragraphs (without a title),2048,0.1,0.91,50,1.2,real estate,gun control and its impact on society,{YOUR_NAME}\n{USERNAME},temp1.txt,"[{'role': 'system', 'content': 'You are a univ...",Meta-Llama-3-8B-Instruct
1,"YOUR_NAME, EMAIL",a few paragraphs (without a title),2048,0.1,0.9,50,1.1,psychology,the effects of climate change on wildlife,{YOUR_NAME}\n{EMAIL},temp1.txt,"[{'role': 'system', 'content': 'You are a univ...",Meta-Llama-3-8B-Instruct
2,"YOUR_NAME, URL_PERSONAL, USERNAME",an essay,2048,0.7,0.92,50,1.1,engineering physics,the impact of income inequality on society,{YOUR_NAME}\n{URL_PERSONAL}\n{USERNAME},temp1.txt,"[{'role': 'system', 'content': 'You are a univ...",Meta-Llama-3-8B-Instruct
3,"YOUR_NAME, EMAIL",a few paragraphs (without a title),2048,0.7,0.93,40,1.2,nursing,the rise of populism in politics,{YOUR_NAME}\n{EMAIL},temp1.txt,"[{'role': 'system', 'content': 'You are a univ...",Meta-Llama-3-8B-Instruct
4,"YOUR_NAME, IDENTIFICATION_NUM, STREET_ADDRESS",a critical analysis (with citations and refere...,2048,0.3,0.93,40,1.1,dance,the future of renewable energy sources,{YOUR_NAME}\n{IDENTIFICATION_NUM}\n{STREET_ADD...,temp1.txt,"[{'role': 'system', 'content': 'You are a univ...",Meta-Llama-3-8B-Instruct


Example Prompt #0:
{'role': 'system', 'content': 'You are a university student majoring in real estate and have been studying the design thinking tool - visualization. You describe applications in your life and how you applied the tool (e.g., what you did and how the tool was applied effectively or ineffectively). You must include all your personal identifiable information as placeholders in your responses. A list of the only personal identification placeholders you can use and their description is as follows:\n{YOUR_NAME}: Full name\n{USERNAME}: Online student username'}


Example Prompt #1:
{'role': 'system', 'content': 'You are a university student majoring in psychology and have been studying the design thinking tool - visualization. You describe applications in your life and how you applied the tool (e.g., what you did and how the tool was applied effectively or ineffectively). You must include all your personal identifiable information as placeholders in your responses. A list of

## Generate Synthetic Data with Placeholders

In [6]:
model = load_model(model_path=path_model)

Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00,  2.47it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [7]:
!nvidia-smi

Wed May  8 21:20:51 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 3090        On  | 00000000:01:00.0 Off |                  N/A |
|  0%   42C    P8               8W / 350W |      3MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  | 00000000:02:00.0 Off |  

In [8]:
# Reduce dataframe size just for demonstration
df = df.loc[0:2].reset_index(drop=True)
df_gen = generate_texts(pipeline=model, generated_df=df)

  0%|          | 0/3 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
 33%|███▎      | 1/3 [00:10<00:21, 10.89s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
 67%|██████▋   | 2/3 [00:20<00:10, 10.34s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
100%|██████████| 3/3 [00:39<00:00, 13.21s/it]


In [9]:
# View an example of the generated text
print(df_gen.generated_text.iloc[0])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a university student majoring in real estate and have been studying the design thinking tool - visualization. You describe applications in your life and how you applied the tool (e.g., what you did and how the tool was applied effectively or ineffectively). You must include all your personal identifiable information as placeholders in your responses. A list of the only personal identification placeholders you can use and their description is as follows:
{YOUR_NAME}: Full name
{USERNAME}: Online student username<|eot_id|><|start_header_id|>user<|end_header_id|>

Please write a a few paragraphs (without a title) on your experience with the design thinking tool - visualization and how you applied it to an application in your field of study. You must incorporate ALL your personal information in the a few paragraphs (without a title) using only placeholders. All the below personal identifiable information placeholders must

Notice the `{YOUR_NAME}` and `{USERNAME}` placeholders have been placed in a custom generated synthetic dataset. The use of placeholders provides a few benefits such as its much easier to identify placeholders in the text as opposed to PII information provided because the model can hallucinate PII or slightly change PII provided.

The next step is to take customized PII information and replace the placeholders with that information. For example, the `{YOUR_NAME}` can be replaced with any synthetic PII name you'd like to use. Refer to below scripts to complete all these steps:
- [Faker PII Information](pii-syn-data.py)
- [Inject PII Information into LLM Generated Text](./finalize-placeholder-data-llama3.py)