corpus2question example
====================

This notebook modifies the `corpus2question` tutorial to generate questions from sets of Wikipedia pages. 

* See the original code repository here: https://github.com/unicamp-dl/corpus2question
* And the corresponding paper: https://arxiv.org/abs/2009.09290

## Model Download

Download the pretrained model from it's repository and load it using the transformers library. corpus2question is based in doc2query.

In [1]:
! wget -nc https://storage.googleapis.com/doctttttquery_git/t5-base.zip
! unzip -o t5-base.zip

--2020-09-25 10:38:54--  https://storage.googleapis.com/doctttttquery_git/t5-base.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.64.80, 172.217.11.48, 172.217.9.240, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.64.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 357139559 (341M) [application/zip]
Saving to: ‘t5-base.zip’


2020-09-25 10:39:51 (6.06 MB/s) - ‘t5-base.zip’ saved [357139559/357139559]

Archive:  t5-base.zip
  inflating: model.ckpt-1004000.data-00000-of-00002  
  inflating: model.ckpt-1004000.data-00001-of-00002  
  inflating: model.ckpt-1004000.index  
  inflating: model.ckpt-1004000.meta  


In [66]:
import warnings
warnings.filterwarnings('ignore')

import requests

from typing import List, Iterable

import nltk
import torch
import pandas as pd
from tqdm.notebook import tqdm
from transformers import T5Config, T5Tokenizer, T5ForConditionalGeneration


nltk.download('punkt')

# Define the target device. Use GPU if available.
device = 'cuda' if torch.cuda.is_available() else 'cpu'

[nltk_data] Downloading package punkt to /Users/tlawless/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [67]:
print(device)

cpu


In [7]:
# Instantiate and load the QG model to the GPU. 
qg_tokenizer = T5Tokenizer.from_pretrained('t5-base')
qg_config = T5Config.from_pretrained('t5-base')
qg_model = T5ForConditionalGeneration.from_pretrained('model.ckpt-1004000', from_tf=True, config=qg_config)

qg_model.to(device)

True

True

## Generation Pipeline

Here we define our generation and preprocessing functions. Here you find the examples used in the paper, but you may customize these functions for your needs.

In [8]:
def preprocess(document: str, span=10, stride=5) -> List[str]:
    """
    Define your preprocessing function.
    
    This function should take the a corpus document and output a list of generation
    spans. This is required so we can match the expected sequence size of the
    generation model.
    """
    
    sentences = nltk.tokenize.sent_tokenize(document)
    chunks = [" ".join(sentences[i:i+span]) for i in range(0, len(sentences), stride)]

    return chunks
    


def generate_questions(text: str) -> List[str]:
    """
    Define your generation function. 
    
    This function should take a text passage and generate a list of questions.
    With the current configuration it always generate one question per passage.
    
    You may copy this example to use the same configuration as the paper. 
    You may also configure the generation parameters (such as using sampling and
    generating multiple questions) for other use cases.
    """
    
    # Append an end of sequence token </s> after the context.
    doc_text = f"{text} </s>"

    input_ids = qg_tokenizer.encode(doc_text, return_tensors='pt').to(device)
    outputs = qg_model.generate(
        input_ids=input_ids,
        max_length=64,
        do_sample=False,
        n_beams=4,
    )

    return [qg_tokenizer.decode(output) for output in outputs]    

### Corpus

In [71]:
def get_wiki_corpus(wiki_pages):
    for page in wiki_pages:
        url = f"https://en.wikipedia.org/w/api.php?action=query&format=json&titles={page}&prop=extracts&exintro&explaintext"
        rsp = requests.get(url)
        data = rsp.json()
        for _id, details in data["query"]["pages"].items():
            yield details["extract"]
            break

In [73]:
pages = [
    "Bob_Dylan",
    "Woody_Guthrie",
    "Pete_Seeger",
    "Bessie_Smith"
]
# pages = [
#     "Computer",
#     "Internet",
#     "Software",
#     "Operating_System"
# ]

In [74]:
corpus = get_wiki_corpus(pages)

### Generate the questions

Here we apply the preprocessing and generation functions defined earlier. You may save questions into a list if your source is small. For large datasets we recommend adding some sort of checkpointing.

In [75]:
questions = [
    [generate_questions(span) for span in preprocess(doc)] 
    for doc in tqdm(corpus)
]

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…




### Aggregate with Pandas

Pandas is a very efficient way to aggregate the generations. In this example we define document, generation and question ids and group questions regarding these ids. We than count the unique examples for every span and document.

In [76]:
question_df = pd.DataFrame([
    dict(
        document_id=doc_idx,
        span_id=f"{doc_idx}:{span_idx}",
        gen_id=f"{doc_idx}:{span_idx}:{gen_idx}",
        question=question,
    )
    for doc_idx, document_gen in enumerate(questions)
    for span_idx, span_gen in enumerate(document_gen)
    for gen_idx, question in enumerate(span_gen)
])

question_df

Unnamed: 0,document_id,span_id,gen_id,question
0,0,0:0,0:0:0,who is bob dylan
1,0,0:1,0:1:0,what year did bob dylan record his first album
2,0,0:2,0:2:0,when did dylan leave the band
3,0,0:3,0:3:0,what year did dylan get inducted into the rock...
4,0,0:4,0:4:0,what awards did dylan win
5,0,0:5,0:5:0,what was dylan awarded
6,1,1:0,1:0:0,who is woodrow wilson guthrie
7,1,1:1,1:1:0,who wrote dust bowl ballads
8,1,1:2,1:2:0,who wrote dust bowl ballads
9,1,1:3,1:3:0,who was woody guthrie's son


In [77]:
# Group the results by question, count unique results and order by generation id counts.
question_df \
    .groupby("question") \
    .nunique() \
    .sort_values("gen_id", ascending=False)

Unnamed: 0_level_0,document_id,span_id,gen_id
question,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
who wrote dust bowl ballads,1,2,2
what awards did dylan win,1,1,1
what was dylan awarded,1,1,1
what year did bob dylan record his first album,1,1,1
what year did dylan get inducted into the rock and roll hall of fame,1,1,1
when did dylan leave the band,1,1,1
who is bob dylan,1,1,1
who is peter seeger,1,1,1
who is woodrow wilson guthrie,1,1,1
who sang if i had a hammer,1,1,1
