/
qapair_config.yaml
620 lines (543 loc) · 19.4 KB
/
qapair_config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
# The following prompts are for qapair generation. We provide several of them.
concurrency: 20
chunk_size: 16384
num_questions: 20
prompts:
- name: simple-quiz
system: |
You are an intelligent professor. You create question and answer pairs from the given document for your students. Respond with a structured JSON containing an array of strict JSON 'question' & 'answer' pairs.
For example:
```json
[
{
"question": "…",
"answer": "…"
},
{
"question": "…",
"answer": "…"
}
]
```
user: |
Here is the document:
{{.DocumentChunk}}
json_schema:
type: array
items:
type: object
properties:
question:
type: string
answer:
type: string
required: [question, answer]
- name: summaries-one
system: |
You are an intelligent professor. You create one or more question and answer pairs from the given document for your students. Respond with a structured JSON containing an array of strict JSON 'question' & 'answer' pairs.
For example:
```json
[
{
"question": "…",
"answer": "…"
}
]
```
user: |
For the first and only question, summarize the entire document.
Here is the document:
{{.DocumentChunk}}
json_schema:
type: array
items:
type: object
properties:
question:
type: string
answer:
type: string
required: [question, answer]
- name: metadata
system: |
You are an intelligent professor. You create one or more question and answer pairs from the given document for your students. Respond with a structured JSON containing an array of strict JSON 'question' & 'answer' pairs.
For example:
```json
[
{
"question": "What is the title of the document?",
"answer": "…"
},
{
"question": "What is the date of the document?",
"answer": "…"
},
{
"question": "Who are the authors of the document?",
"answer": "…"
}
]
```
user: |
Ask one question: "what is the title of the document?"
Ask that question EXACTLY and no other question. DO NOT include the title of the document in the question.
In the answer, write the title of the document verbatim.
The title will often start with a markdown heading symbol #. Don't include the # in the answer, but use it to identify the title in the text below.
Then, write a second question which is the date of the document.
Then, write a third question which is the author or authors of the document.
Always wrap the JSON in ```json and ``` tags to indicate the start and end of the JSON.
Here is the document:
{{.DocumentChunk}}
json_schema:
type: array
items:
type: object
properties:
question:
type: string
answer:
type: string
required: [question, answer]
- name: title-questions
system: |
You are an intelligent professor. You create one or more question and answer pairs from the given document for your students. Respond with a structured JSON containing an array of strict JSON 'question' & 'answer' pairs.
For example:
```json
[
{
"title": "Junior doctors in England to stage more strikes over pay",
"question": "What are the doctors going to do?",
"answer": "The doctors are going to stage more strikes over pay"
}
]
```
user: |
First, identify the title of the document, usually identified by a line starting with a markdown symbol #.
Generate questions and answers about who or what is going to do some action based ONLY on the title. For example, given the document title "Junior doctors in England to stage more strikes over pay", you should generate the question: "What are the doctors going to do?" and the answer "The doctors are going to stage more strikes over pay"
Make the questions short and vague. Do not include details in the question. When entities are described in specific terms in the document, refer to them broadly. So for example refer to "junior doctors" as "doctors".
Do NOT give the answer in the question. For example, ask "What are doctors going to do?" rather than "Who is going to stage more strikes?"
Remember, keep the questions vague and short.
For example, do NOT say "What are the doctors in England going to do regarding their pay?"
Instead, say "What are the doctors going to do?"
Here is the document:
{{.DocumentChunk}}
json_schema:
type: array
items:
type: object
properties:
title:
type: string
question:
type: string
answer:
type: string
required: [question, answer]
- name: summaries-three
system: |
You are an intelligent professor. You create one or more question and answer pairs from the given document for your students. Respond with a structured JSON containing an array of strict JSON 'question' & 'answer' pairs.
For example:
```json
[
{
"question": "Please write me a brief/short summary of the document or the key points in the document",
"answer": "…"
},
{
"question": "Please write me a summary of the document",
"answer": "…"
},
{
"question": "Please write me a long/verbose summary of the document",
"answer": "…"
}
]
```
user: |
The first question must be "Please write me a brief/short summary of the document or the key points in the document"
The second question must be "Please write me a summary of the document"
The third question must be "Please write me a long/verbose summary of the document"
Here is the document:
{{.DocumentChunk}}
json_schema:
type: array
items:
type: object
properties:
question:
type: string
answer:
type: string
required: [question, answer]
- name: summaries-sections
system: |
You are an intelligent professor. You create questions and answers from the given document for your students. Respond with a structured JSON. Follow the schema below precisely.
For example:
```json
{
"sections": [
"<section 1>",
"<section 2>"
],
"questions": [
{
"section": "<section 1>",
"questions": [
{
"question": "…",
"answer": "…"
}
]
},
{
"section": "<section 2>",
"questions": [
{
"question": "…",
"answer": "…"
}
]
}
]
}
```
user: |
Consider each section in the document. In the "sections" list you will list all of the main sections in the document.
Then, for each section, you will write a question about summarizing that section, and an answer which is the summary of that section.
Here is the document:
{{.DocumentChunk}}
json_schema:
type: object
properties:
sections:
type: array
items:
type: string
questions:
type: array
items:
type: object
properties:
section:
type: string
questions:
type: array
items:
type: object
properties:
question:
type: string
answer:
type: string
required: [question, answer]
# For the next set of questions, summarize each section of the document.
# For the final set of questions, summarize each paragraph of the document.
# - name: summaries-many
# system: |
# You are an intelligent professor. You create question and answer pairs from the given document for your students. Respond with an array of strict JSON 'question' & 'answer' pairs. You MUST respond with only JSON.
# For example:
# ```json
# [
# {
# "question": "…",
# "answer": "…"
# },
# {
# "question": "…",
# "answer": "…"
# }
# ]
# ```
# user: |
# Generate {{.NumQuestions}} summaries of the document from different perspectives.
# Here is the document:
# {{.DocumentChunk}}
# This didn't seem to work so well. It still kept generating questions about "junior doctors"
- name: you-are-generating-fine-tuning-data
system: |
You are a large language model tasked with generating fine-tuning data from a source document that will be used to fine-tune a smaller large language model. You will generate training data for the smaller language model.
Respond with a structured JSON. Follow the schema below precisely.
For example:
```json
[
{
"question": "…",
"answer": "…"
},
{
"question": "…",
"answer": "…"
}
]
```
Users tend to ask broad, vague questions of the document in order to test that the system is working. We want those queries to work well. For example, a user would ask "what are the doctors going to do?" of a document that is about a junior doctors' strike. Take this into account when generating the questions - in particular, refer to noun phrases by less specific descriptions, so for example instead of "junior doctors", say "doctors" in your questions.
user: |
Generate {{.NumQuestions}} question-answer pairs, where the questions are the most likely questions that a human would ask of the the given document, and the answers are accurate, detailed and precise answers based on the the given document below.
Here is the document:
{{.DocumentChunk}}
json_schema:
type: array
items:
type: object
properties:
question:
type: string
answer:
type: string
required: [question, answer]
# - name: entities-specific-to-broad
# system: |
# You are an intelligent professor. You create questions and answers from the given document for your students.
# user: |
# Consider each main entity in the document. In the "entities" list you will list all of the main entities in the document.
# Then, you will think about specific, broad and very broad questions for each entity. For each entity in the document, generate three simple, broad questions about any actions they have or are going to take. In the first question per entity, refer to the entity specifically, for example "junior doctors". In the second question, refer to them more broadly, for example "doctors". In the third question, refer to them in the broadest possible terms as you might if you were asking informally about the document, for example "they".
# Respond in strict JSON as a list of entities, and then for each entity a list of question-answer pairs.
# For example:
# ```json
# {
# "entities": [
# "<first entity>",
# "<second entity>"
# ],
# "questions": [
# {
# "entity": "<first entity>",
# "questions": [
# {
# "question": "<specific question>",
# "answer": "<answer>"
# },
# {
# "question": "<broad question>",
# "answer": "<answer>"
# },
# {
# "question": "<broadest possible question>",
# "answer": "<answer>"
# }
# ]
# },
# {
# "entity": "<second entity>",
# "questions": [
# {
# "question": "<specific question>",
# "answer": "<answer>"
# },
# {
# "question": "<broad question>",
# "answer": "<answer>"
# },
# {
# "question": "<broadest possible question>",
# "answer": "<answer>"
# }
# ]
# }
# ]
# }
# ```
# Here is the document:
# {{.DocumentChunk}}
- name: important-facts
system: |
You are an intelligent professor. You create questions and answers from the given document for your students. Respond with strict JSON. Follow the schema below precisely.
user: |
Consider each main important fact in the document. In the "facts" list you will list all of the main important facts in the document.
Then, you will think about when, how, what, why and who questions for each fact.
Respond with a structured JSON containing list of facts, and then for each fact a list of question-answer pairs.
For example:
```json
{
"facts": [
"<first fact>",
"<second fact>"
],
"questions": [
{
"fact": "<first fact>",
"questions": [
{
"question": "<when question>",
"answer": "<answer>"
},
{
"question": "<how question>",
"answer": "<answer>"
},
{
"question": "<what question>",
"answer": "<answer>"
},
{
"question": "<why question>",
"answer": "<answer>"
},
{
"question": "<who question>",
"answer": "<answer>"
}
]
},
{
"fact": "<second fact>",
"questions": [
{
"question": "<when question>",
"answer": "<answer>"
},
{
"question": "<how question>",
"answer": "<answer>"
},
{
"question": "<what question>",
"answer": "<answer>"
},
{
"question": "<why question>",
"answer": "<answer>"
},
{
"question": "<who question>",
"answer": "<answer>"
}
]
}
]
}
```
Here is the document:
{{.DocumentChunk}}
json_schema:
type: object
properties:
facts:
type: array
items:
type: string
questions:
type: array
items:
type: object
properties:
fact:
type: string
questions:
type: array
items:
type: object
properties:
question:
type: string
answer:
type: string
required: [question, answer]
# # This seems a little imprecise, maybe we need to explicitly call out relationships between entities A and B.
# - name: entities-relationships
# # maybe we need a chunkSize per prompt?
# system: |
# You are an intelligent professor. You create questions and answers from the given document for your students.
# user: |
# Consider each main entity in the document. In the "entities" list you will list all of the main entities in the document.
# Then, you will think about the relationship between that entity and the other main entities it is related to in the document. For each main entity that the given entity is related to, write a question and answer explaining the nature of that relationship.
# So for example if the document is about a man who owns a dog and is taking the dog for a walk, the entities in the document are "man", "dog" and "walk". When considering the entity "man", you might ask:
# Q: What does the man own?
# A: The man owns a dog.
# Q: What does the man do?
# A: The man takes the dog for a walk.
# Next, when considering the dog, you might ask:
# Q: Who owns the dog?
# A: The man owns the dog.
# Q: Where did the dog go?
# A: The dog went on a walk with the man.
# When considering the walk, you might ask:
# Q: Who went on the walk?
# A: The man and the dog went on the walk.
# If the document has lots of entities, just focus on the main ones. Don't spend too long thinking about it.
# Respond in strict JSON as a list of entities, and then for each entity a list of question-answer pairs.
# For example:
# ```json
# {
# "entities": [
# "<first entity>",
# "<second entity>"
# ],
# "questions": [
# {
# "entity": "<first entity>",
# "questions": [
# {
# "related_entity": "<what related entity this question corresponds to>",
# "question": "<question about the relationship between the entities>",
# "answer": "<answer>"
# }
# ]
# },
# {
# "entity": "<second entity>",
# "questions": [
# {
# "related_entity": "<what related entity this question corresponds to>",
# "question": "<question about the relationship between the entities>",
# "answer": "<answer>"
# }
# ]
# }
# ]
# }
# ```
# Here is the document:
# {{.DocumentChunk}}
- name: original-prompt
system: |
You are a Teacher/ Professor. Your task is to setup a quiz/examination.
Using the provided document, formulate {{.NumQuestions}} questions that captures an important fact from the document.
You MUST obey the following criteria:
- Restrict the question to the document provided.
- Do NOT create a question that cannot be answered from the document.
- Phrase the question so that it does NOT refer to specific context. For instance, do NOT put phrases like given provided document or in this work in the question, because if the question is asked elsewhere it wouldn't be provided specific context. Replace these terms with specific details.
BAD questions:
What did the author do in his childhood
What were the main findings in this report
GOOD questions:
What did Barack Obama do in his childhood
What were the main findings in the original Transformers paper by Vaswani et al.
The user will provide the document you should summarize into {{.NumQuestions}} questions.
Please respond in JSON format as an array of objects each having two fields: "question" and "answer". Be careful to produce valid JSON.
For example:
```json
[
{
"question": "…",
"answer": "…"
},
{
"question": "…",
"answer": "…"
}
]
```
user: |
Given the following document - please summarize it into {{.NumQuestions}} question and answer pairs. Make the answers refer to as much of the information in the document as possible, while keeping the answers relevant to the questions.
ONLY include a question if you know the answer.
If there is not enough document to generate {{.NumQuestions}} questions, you can generate fewer questions.
In the worst case scenario, where you are unable to generate any questions, please respond with an empty array.
Here is the document:
{{.DocumentChunk}}
json_schema:
type: array
items:
type: object
properties:
question:
type: string
answer:
type: string
required: [question, answer]
targets:
- name: together-mixtral
api_url: https://api.together.xyz/v1
model: mistralai/Mixtral-8x7B-Instruct-v0.1
token_from_env: TOGETHER_API_KEY
texts: []