In [45]:
import pandas as pd
import numpy as np
import json
from datasets import load_dataset

from data.prompts import prompts

from openai import OpenAI
client = OpenAI(api_key="YOUR_KEY_HERE")

## Generating datasets

We provide the following documentation regarding out question-generation process.
Due to data sharing restrictions from AP and UpToDate, we will focus on names-related questions produced from Wikipedia. However, the process for generating the question-answer-context triplets with prompting is nearly identical after the all the contexts are processed.

First, we load in the Huggingface Wikipedia dataset:

In [18]:
wiki = load_dataset("wikimedia/wikipedia", "20231101.en")

Downloading readme: 100%|████████████████████| 131k/131k [00:00<00:00, 12.7MB/s]
Downloading data: 100%|██████████████████████| 41/41 [06:08<00:00,  8.99s/files]
Generating train split: 100%|█| 6407814/6407814 [00:30<00:00, 211539.14 examples


Additionally, we want to get articles that are within a specific topic in order to maximize the quality of the context documents.

In [24]:
import wikipediaapi
import random
import wikipedia
wikipedia.set_lang("en")

wiki_api = wikipediaapi.Wikipedia('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36')

def get_articles_on_topic(topic):
    page_py = wiki_api.page(topic)
    if not page_py.exists():
        return []
    linked_pages = page_py.links
    articles = list(linked_pages.keys())

    return articles

In [25]:
political_article_names = [y.lower() for y in get_articles_on_topic('Politics') if ":" not in y]

In [27]:
filtered = wiki['train'].filter(lambda x: x['title'].lower() in political_article_names)

Filter: 100%|███████████████| 6407814/6407814 [03:02<00:00, 35107.41 examples/s]


In [32]:
subset = wiki['train'].select(np.random.choice(6407814, 2000))

In [33]:
wiki_df = pd.DataFrame(subset)

Now that we have the dataframe of context documents, we can generate the question-answer-context triplets.

We start with the prompt for generating questions:

In [38]:
print(prompts.QUESTION_GENERATOR['names'])


                                Given the following document, please generate a question and answer based on the document.
    
                                The question MUST contain all information and context necessary to answer without the document.
    
                                In your output, include the phrase from the document that contains the answer to the question as 'context'.
                                This phrase MUST be copied verbatim, word for word, from the document. 
                                You must produce the context phrase exactly from the text, with no modifications or truncations.
                                This phrase should be short (one sentence).
    
                                You must obey the following criteria:
                                - The question MUST ask for the name of a human person. 
                                Do not produce a question that is not directly related to a person's name. 
                 

In [40]:
row = wiki_df.iloc[0]
context = row['text']

Next, we apply this prompt to a document to produce the question-answer-context triplet:

In [47]:
response = client.chat.completions.create(
  model="gpt-4o",
  response_format={"type": 'json_object'},
  messages=[
    {"role": "system", "content": prompts.QUESTION_GENERATOR['names']},
    {"role": "user", "content": f"<Begin Document>\n{context}\n<End Document>"}
  ],
  temperature=0,
  seed=0,
)

In [49]:
qa_dict = json.loads(response.choices[0].message.content)
qa_dict

{'Question': 'Who is the birth mother of Bezhig and her siblings in the Canadian drama television series Little Bird?',
 'Answer': 'Ellyn Jade',
 'Context': 'Ellyn Jade as Patti Little Bird, the birth mother of Bezhig and her siblings.'}

We create our other datasets in a nearly identical way.
In practice, some Wikipedia documents are not long enough or do not contain enough substance to produce a valid question.

In anticipation of "None" responses, we run the model on a larger subset of documents than we need and then sample among the valid responses.