category | weight |
---|---|
rag |
40 |
In this doc, you will learn how to generate test data based on your documents for RAG app. This approach helps relieve the efforts of manual data creation, which is typically time-consuming and labor-intensive, or the expensive option of purchasing pre-packaged test data. By leveraging the capabilities of llm, this guide streamlines the test data generation process, making it more efficient and cost-effective.
-
Prepare documents. The test data generator supports the following file types:
- .md - Markdown
- .docx - Microsoft Word
- .pdf - Portable Document Format
- .ipynb - Jupyter Notebook
- .txt - Text
Limitations:
- The test data generator may not function effectively for non-Latin characters, such as Chinese, in certain document types. The limitation is caused by dependent text loader capabilities, such as
pypdf
. - The test data generator may not generate meaningful questions if the document is not well-organized or contains massive code snippets/links, such as API introduction documents or reference documents.
-
Prepare local environment. Go to example_gen_test_data folder and install required packages.
pip install -r requirements.txt
For specific document file types, you may need to install extra packages:
- .docx -
pip install docx2txt
- .pdf -
pip install pypdf
- .ipynb -
pip install nbconvert
!Note: the example uses llama index
SimpleDirectoryReader
to load documents. For the latest information of different file type required packages, please check here. - .docx -
-
Install VSCode extension
Prompt flow
. -
Create your AzureOpenAI or OpenAI connection by following this doc.
-
Prepare test data generation setting.
-
Navigate to example_gen_test_data folder.
-
Prepare
config.yml
by copyingconfig.yml.example
. -
Fill in configurations in the
config.yml
by following inline comment instructions. The config is made up of 3 sections:- Common section: this section provides common values for all other sections. Required.
- Local section: this section is for local test data generation related configuration. Can skip if not run in local.
- Cloud section: this section is for cloud test data generation related configuration. Can skip if not run in cloud.
!Note: Recommend to use
gpt-4
series models than thegpt-3.5
for better performance.!Note: Recommend to use
gpt-4
model (Azure OpenAIgpt-4
model with version0613
) thangpt-4-turbo
model (Azure OpenAIgpt-4
model with version1106
) for better performance. Due to inferior performance ofgpt-4-turbo
model, when you use it, sometimes you might need to open example test data generation flow in visual editor and setresponse_format
input of nodesvalidate_text_chunk
,validate_question
, andvalidate_suggested_answer
tojson
, in order to make sure the llm can generate valid json response.
-
-
Navigate to example_gen_test_data folder.
-
After configuration, run the following command to generate the test data set:
python -m gen_test_data.run
-
The generated test data will be a data jsonl file. See detailed log print in console "Saved ... valid test data to ..." to find it.
If you expect to generate a large amount of test data beyond your local compute capability, you may try generating test data in cloud, please see this guide for more detailed steps.
-
Open the example test data generation flow in "Prompt flow" VSCode Extension. This flow is designed to generate a pair of question and suggested answer based on the given text chunk. The flow also includes validation prompts to ensure the quality of the generated test data.
-
Customize your test data generation logic refering to tune-prompts-with-variants.
Understanding the prompts
The test data generation flow contains 5 prompts, classified into two categories based on their roles: generation prompts and validation prompts. Generation prompts are used to create questions, suggested answers, etc., while validation prompts are used to verify the validity of the text chunk, generated question or answer.
-
Generation prompts
- generate question prompt: frame a question based on the given text chunk.
- generate suggested answer prompt: generate suggested answer for the question based on the given text chunk.
-
Validation prompts
- score text chunk prompt: score 0-10 to validate if the given text chunk is worthy of framing a question. If the score is lower than
score_threshold
(a node input that is adjustable), validation fails. - validate question prompt: validate if the generated question is good.
- validate suggested answer: validate if the generated suggested answer is good.
If the validation fails, would lead to empty string
question
/suggested_answer
which are removed from final output test data set. - score text chunk prompt: score 0-10 to validate if the given text chunk is worthy of framing a question. If the score is lower than
-
-
Fill in node inputs including
connection
,model
ordeployment_name
,response_format
,score_threshold
or other parameters. Click run button to test the flow in VSCode Extension by referring to Test flow with VS Code Extension.
Once the customized flow has been verified, you can proceed to batch generate test data by following the steps outlined in "Prerequisites" and "Generate test data".