# Generating Synthetic Data via Prompt Engineering
This notebook demonstrates how to generate synthetic data via prompt engineering (few-shot prompts). The exemplars are taken from the MathGSM8K dataset and we will be doing inference on the Flan-UL2 model. 
The purpose is to show the efficacy and limitations of a simple data generation method and how to set up a prompt-engineering pipeline for data synthesis.

In [7]:
import os
import json
from constants import MATH_DATASET_DIR

## 1. Generate Prompts (Few-Shot)
First step is to upload and process the data for few-shot prompting. Note that the precise format, syntax, and semantics is critical for prompt engineering.

In [11]:
# import and preprocess data
import pprint

with open(MATH_DATASET_DIR + "/math_questions_and_answers.jsonl") as f:
    data = [json.loads(line)['sample'] for line in f]

pprint.pprint(data[0:5])

['Problem: Janet’s ducks lay 16 eggs per day. She eats three for breakfast '
 'every morning and bakes muffins for her friends every day with four. She '
 "sells the remainder at the farmers' market daily for $2 per fresh duck egg. "
 "How much in dollars does she make every day at the farmers' market?\n"
 'Answer: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day. She makes 9 '
 '* 2 = $<<9*2=18>>18 every day at the farmer’s market. \n'
 'The answer is 18.',
 'Problem: A robe takes 2 bolts of blue fiber and half that much white fiber.  '
 'How many bolts in total does it take?\n'
 'Answer: It takes 2/2=<<2/2=1>>1 bolt of white fiber So the total amount of '
 'fabric is 2+1=<<2+1=3>>3 bolts of fabric \n'
 'The answer is 3.',
 'Problem: Josh decides to try flipping a house.  He buys a house for $80,000 '
 'and then puts in $50,000 in repairs.  This increased the value of the house '
 'by 150%.  How much profit did he make?\n'
 'Answer: The cost of the house and repairs came out to 

In [13]:
import random

# random sampling of exemplars, can select via more elaborate methods
def sample_exemplars(sampling_pool, k):
    return random.sample(sampling_pool, k=k)

# choose 8 exemplars
list_of_exemplars = sample_exemplars(data, 8)

pprint.pprint(list_of_exemplars)

['Problem: Joey has 214 points before his turn in Scrabble. He scores 26 '
 'points. Then Marcy, who has 225 points, scores 10 points. By how many points '
 'is Joey now winning?\n'
 'Answer: Joey now has 214+26=<<214+26=240>>240. Marcy now has '
 '225+10=<<225+10=235>>235. And 240-235=<<240-235=5>>5. \n'
 'The answer is 5.',
 'Problem: Jeans makeup artist charges her $250 an hour.  She requires very '
 'expensive makeup for a movie she is in and it takes 6 hours to do each day '
 'and she needs it done 4 times a week.  The movie takes 5 weeks to finish.  '
 'After the movie is done the makeup artist gives Jean a 10% discount because '
 'of the amount of work done.  How much did Jean pay?\n'
 'Answer: Jean pays 250*6=$<<250*6=1500>>1500 a day So he pays '
 '1500*4=$<<1500*4=6000>>6000 a week So the work cost '
 '6000*5=$<<6000*5=30000>>30,000 They give a discount of '
 '30,000*.1=$<<30000*.1=3000>>3000 So the total cost paid is '
 '30,000-3000=$<<30000-3000=27000>>27,000 \n'
 'The answ

In [16]:
# choose question from dataset of unanswered questions
with open(MATH_DATASET_DIR + "/math_questions.jsonl") as f:
    problems = [json.loads(line)['sample'] for line in f]
problem = sample_exemplars(problems, 1)

# format prompt: metaprompt, exemplars, and template
metaprompt = "Solve the math problem step-by-step."
exemplars = "\n".join(list_of_exemplars)
template = problem[0][:-1] + "\nAnswer: "

prompt_string = metaprompt + "\n" + exemplars + "\n" + template

print(prompt_string)

Solve the math problem step-by-step.
Problem: Joey has 214 points before his turn in Scrabble. He scores 26 points. Then Marcy, who has 225 points, scores 10 points. By how many points is Joey now winning?
Answer: Joey now has 214+26=<<214+26=240>>240. Marcy now has 225+10=<<225+10=235>>235. And 240-235=<<240-235=5>>5. 
The answer is 5.
Problem: Jeans makeup artist charges her $250 an hour.  She requires very expensive makeup for a movie she is in and it takes 6 hours to do each day and she needs it done 4 times a week.  The movie takes 5 weeks to finish.  After the movie is done the makeup artist gives Jean a 10% discount because of the amount of work done.  How much did Jean pay?
Answer: Jean pays 250*6=$<<250*6=1500>>1500 a day So he pays 1500*4=$<<1500*4=6000>>6000 a week So the work cost 6000*5=$<<6000*5=30000>>30,000 They give a discount of 30,000*.1=$<<30000*.1=3000>>3000 So the total cost paid is 30,000-3000=$<<30000-3000=27000>>27,000 
The answer is 27000.
Problem: Archie buys

## 2. Inference
Second step is to make an inference call with the generated prompt to get some completions. Here we will make an inference call to Flan-UL2 model with the prompt we generated.

In [17]:
from constants import LOCAL_MODELS_ROOT, HF_MODELS_ROOT
from pathlib import Path
import transformers
import torch

In [18]:
BASE_MODEL = "flan-ul2"

model_path_or_id = (
    f"{LOCAL_MODELS_ROOT}/{BASE_MODEL}"
    if Path(f"{LOCAL_MODELS_ROOT}/{BASE_MODEL}").exists()
    else f"{HF_MODELS_ROOT}/{BASE_MODEL}"
)

tokenizer = transformers.AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path=model_path_or_id
)

model = transformers.AutoModelForSeq2SeqLM.from_pretrained(
    pretrained_model_name_or_path=model_path_or_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

In [19]:
device = "cuda:0"
generation_config = transformers.GenerationConfig(
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    min_new_tokens=32,
    max_new_tokens=512,
)

inputs = tokenizer(prompt_string, return_tensors="pt", return_token_type_ids=False).to(device)
output = model.generate(**inputs, generation_config=generation_config)
print(tokenizer.decode(output[0], skip_special_tokens=True))

The radiator was 400 *.8 = $320 off. So it cost 400 - 320 = $120. The mechanic charged him 3 * 50 = $150. So the total cost was 120 + 150 = $270. The answer is 270.


## 3. Self-Consistency
Third step is to post-process the responses. Since we are doing a QA task on math questions, an effective post-processing step is a majority-scoring technique known as Self-Consistency. While it doesn't verify that the answer is correct, it gives a strong indication that it should be (as long as the model is sufficiently unbiased).

In [20]:
# Problem: Mark is three times older than Forrest. Forrest is two times older than Chris. If Chris is 4, how old is Mark?

list_of_completions = [
    "Answer: Forrest is 4*2=8 years old. Mark is 8*3=24 years old.\nThe answer is 24.",
    "Answer: Forrest is twice Chris's age, which is 4, so Forrest is 4*2=8. Mark is thrice Forrest's age, which is 8, so Mark is 8*3=24.\nThe answer is 24.",
    "Answer: Chris is 4, so Forrest is 2*4=10. Mark is 3*10=34.\nThe answer is 34.",
    "Answer: Chris is 4. Forrest is 4*2=8. Mark is 8*3=24.\nThe answer is 24."
]

In [23]:
from collections import defaultdict
import re

# process the list of completions for validation
def create_validation_set(sentences: list[str]):
    validation_set = defaultdict(list)
    pattern = "\nThe answer is "

    for sentence in sentences:
        start_idx = sentence.rfind(pattern)
        if start_idx != -1:
            substring = sentence[start_idx + len(pattern) :].strip()
            try:
                number = re.findall(r"\d+\.?\d*", substring)
                validation_set[float(number[0])].append(sentence)
            except IndexError:
                continue

    return validation_set

# create validation set from list of completions
validation_set = create_validation_set(list_of_completions)

validation_set

defaultdict(list,
            {24.0: ['Answer: Forrest is 4*2=8 years old. Mark is 8*3=24 years old.\nThe answer is 24.',
              "Answer: Forrest is twice Chris's age, which is 4, so Forrest is 4*2=8. Mark is thrice Forrest's age, which is 8, so Mark is 8*3=24.\nThe answer is 24.",
              'Answer: Chris is 4. Forrest is 4*2=8. Mark is 8*3=24.\nThe answer is 24.'],
             34.0: ['Answer: Chris is 4, so Forrest is 2*4=10. Mark is 3*10=34.\nThe answer is 34.']})

In [24]:
# majority scoring for determining the most consistent of answers
def self_consistency(validation_set: dict[float,list]):
    if len(validation_set.keys()) > 0:
        most_frequent_key = max(validation_set, key=lambda k: len(validation_set[k]))
        return validation_set[most_frequent_key]
    else:
        return []

filtered_completions = self_consistency(validation_set)

filtered_completions

['Answer: Forrest is 4*2=8 years old. Mark is 8*3=24 years old.\nThe answer is 24.',
 "Answer: Forrest is twice Chris's age, which is 4, so Forrest is 4*2=8. Mark is thrice Forrest's age, which is 8, so Mark is 8*3=24.\nThe answer is 24.",
 'Answer: Chris is 4. Forrest is 4*2=8. Mark is 8*3=24.\nThe answer is 24.']

## 4. Evaluation
The final step is evaluation. For this demo we want diverse responses so that we increase the complexity of the dataset and make sure that we have a sufficiently broad set of reasoning paths in our dataset. We use the Rouge-L metric as a proxy to assess diversity.

In [34]:
from rouge import Rouge 
rouge = Rouge()

def calculate_rouge(candidate, reference):
    # candidate, reference: generated and ground-truth sentences
    scores = rouge.get_scores(candidate, reference)[0]['rouge-l']['r']
    return scores

candidate = "hi, everyone, it's nice to meet you!"
reference = "hi, it's nice to meet everyone!"

# longest subsequences are "hi" and "it's nice to meet"
# 5 words out of 6 in the reference, 5/6 ~ 0.8333333
# 5 words out of 7 in the candidate, 5/7 ~ 0.7142857

print(calculate_rouge(candidate, reference))
print(calculate_rouge(reference, candidate))


0.8333333333333334
0.7142857142857143


In [35]:
import numpy as np
import copy

# returns rouge score over pairwise scoring against a list
def get_rouge_score_per_sentence(sentence: str, remaining_sentences: list[str]):
    rouge_scores = []
    for remaining_sentence in remaining_sentences:
        rouge_score = calculate_rouge(sentence, remaining_sentence)
        rouge_scores.append(rouge_score)
    return rouge_scores


# returns average rouge score over pairwise scoring
def calculate_self_rouge(sentences: list[str]):
    rouge_scores = []
	
    for sentence in sentences:
        sentences_copy = copy.deepcopy(sentences)
        sentences_copy.remove(sentence)
        rouge_score = get_rouge_score_per_sentence(sentence,sentences_copy)
        rouge_scores.append(rouge_score)

    average_rouge_scores_per_sentence = [
        np.mean(sentence_rouge_scores) for sentence_rouge_scores in rouge_scores
        ]
    average_rouge_score = np.mean(average_rouge_scores_per_sentence)

    return average_rouge_scores_per_sentence, average_rouge_score

calculate_self_rouge([candidate,reference])

([0.8333333333333334, 0.7142857142857143], 0.7738095238095238)

In [36]:
# evaluator discards rouge scores that are too high
def rouge_evaluator(sentences):
    rouge_scores, average_rouge_score = calculate_self_rouge(sentences)
    evaluated_sentences = []

    # remove any sentence whose average rouge score is too high
    for i in range(len(rouge_scores)):
        if rouge_scores[i] < average_rouge_score:
            evaluated_sentences.append(sentences[i])

    return evaluated_sentences

rouge_evaluator([candidate, reference])

["hi, it's nice to meet everyone!"]

In [33]:
# evaluate the filtered completions
print(calculate_self_rouge(filtered_completions))

rouge_evaluator(filtered_completions)

([0.6590909090909092, 0.8181818181818182, 0.6590909090909092], 0.7121212121212123)


['Answer: Forrest is 4*2=8 years old. Mark is 8*3=24 years old.\nThe answer is 24.',
 'Answer: Chris is 4. Forrest is 4*2=8. Mark is 8*3=24.\nThe answer is 24.']