corpus2question example
====================

This notebook modifies the `corpus2question` tutorial to generate questions from sets of Wikipedia pages. 

* See the original code repository here: https://github.com/unicamp-dl/corpus2question
* And the corresponding paper: https://arxiv.org/abs/2009.09290

In [None]:
import requests

import pandas as pd

# Functions from the corpus2question tutorial 
import c2q

import warnings
warnings.filterwarnings("ignore")

## Corpus

Here we create a function to fetch the underlying text for a list of Wikipedia pages. Note the Wikipedia API produces extracts not the the entire page but this should be enough for demonstration purposes. 

In [None]:
def get_wiki_corpus(wiki_pages):
    for page in wiki_pages:
        url = f"https://en.wikipedia.org/w/api.php?action=query&format=json&titles={page}&prop=extracts&exintro&explaintext"
        rsp = requests.get(url)
        data = rsp.json()
        for _id, details in data["query"]["pages"].items():
            yield details["extract"]
            break

In [None]:
pages = [
    "Bob_Dylan",
    "Woody_Guthrie",
    "Pete_Seeger",
    "Bessie_Smith",
    "Levon_Helm",
    "Bruce_Springsteen"
]
# pages = [
#     "Computer",
#     "Internet",
#     "Software",
#     "Operating_System"
# ]

Here we pass our list of pages to our our Wikipedia function and then pass that iterable of text to the original `corpus2questions` code to generate a list of questions.

In [None]:
corpus = get_wiki_corpus(pages)

In [None]:
%%time 
questions = c2q.get_questions(corpus)

In [None]:
sum([len(q) for q in [d for d in questions]]), "total questions"

### Aggregate with Pandas

This follows the tutorial.

In [None]:
question_df = pd.DataFrame([
    dict(
        document_id=doc_idx,
        span_id=f"{doc_idx}:{span_idx}",
        gen_id=f"{doc_idx}:{span_idx}:{gen_idx}",
        question=question,
    )
    for doc_idx, document_gen in enumerate(questions)
    for span_idx, span_gen in enumerate(document_gen)
    for gen_idx, question in enumerate(span_gen)
])


In [None]:
question_df.head()

In [None]:
# Group the results by question, count unique results and order by generation id counts.
question_df \
    .groupby("question") \
    .nunique() \
    .sort_values("gen_id", ascending=False)