## Synthetic Data

This notebook uses the [Synthetics Package](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/generate-data-qa#install-the-synthetics-package) to create sample evaluation data for your LLM without context, using a Markdown file. It then transforms the result dictionary into a .csv file that you can use in the evaluation step of your Prompt Flow

Install the requirements needed (bare in mind there could be some dependencies pending):

In [19]:
#!pip install -r requirements.txt

We need to connect to Azure OpenAI so that we can access the LLM to generate data for us.

In [20]:
from azure.ai.resources.client import AIClient 
from azure.identity import DefaultAzureCredential

subscription: str = "your subscription_id"
resource_group: str = "rg-name"
project: str = "project-name"

ai_client = AIClient(
    subscription_id=subscription, 
    resource_group_name=resource_group, 
    project_name=project, 
    credential=DefaultAzureCredential())

aoai_connection = ai_client.get_default_aoai_connection()
aoai_connection.set_current_environment()

In this step, we get the LLM ready to generate the data

In [21]:
import os
from azure.ai.generative.synthetic.qa import QADataGenerator

model_name = "gpt-35-turbo" 

model_config = dict(
    deployment=model_name, 
    model=model_name,  
    max_tokens=2000,
)

qa_generator = QADataGenerator(model_config=model_config)

As we are working with a single markdown file, we are loading it into a string so it will be inputed into the model

In [None]:
import pandas as pd

def read_md_file(file_path):
    with open(file_path, 'r') as file:
        content = file.read()
    return content

content = read_md_file('product_info_1.md')

Let us create some text. We use the generate function in QADataGenerator to generate questions based on the text. In this example, the generate function takes the following parameters:

- text is your source data.
- qa_type defines the type of question and answers to be generated.
- num_questions is the number of question-answer pairs to be generated for the text.

In [45]:
from azure.ai.generative.synthetic.qa import QAType

qa_type = QAType.CONVERSATION

result = qa_generator.generate(text=content, qa_type=qa_type, num_questions=5)

for question, answer in result["question_answers"]:
    print(f"Q: {question}")
    print(f"A: {answer}")

print(f"Tokens used: {result['token_usage']}")

Q: What is the brand of the TrailMaster X4 Tent?
A: The brand of the TrailMaster X4 Tent is OutdoorLiving.
Q: How many people can it accommodate?
A: The TrailMaster X4 Tent can comfortably accommodate up to 4 people with room for their gear.
Q: Is there a warranty on it?
A: Yes, the TrailMaster X4 Tent comes with a 2-year limited warranty against manufacturing defects.
Q: What accessories are included with it?
A: The TrailMaster X4 Tent includes a rainfly, tent stakes, guy lines, and a carry bag for easy transport.
Q: Can it be easily carried during hikes?
A: Yes, the TrailMaster X4 Tent weighs just 12lbs, and when packed in its carry bag, it can be comfortably carried during hikes.
Tokens used: {'completion_tokens': 243, 'prompt_tokens': 2343, 'total_tokens': 2586}


In [47]:
# Extract question-answer pairs
qa_pairs = [(question, answer) for question, answer in result["question_answers"]]

df_qa = pd.DataFrame(qa_pairs, columns=['Question', 'Answer'])

df_qa.to_csv('output.csv', index=False)

10
