This is the repo for When in Doubt, Ask: Generating Answerable and Unanswerable Questions, Unsupervised. It contains the scripts for generating synthetic unanswerable questions.
The repo contains the following scripts:
generate_training_data.py
- a script to generate a dataset with a certain ratio of human-labled and synthetic data in it.generate_synthetic_qa_data.py
- a file from https://github.com/lnikolenko/UnsupervisedQA which I modified to accomondate for unanswerable question generation.execution_script.py
- a driver scripts with checkpoint and error handling logiccombine_synthetic_questions.py
- a script which shuffles the paragraphs and makes questions unanswerable. This is the last step in unaswerable question generation pipeline, can be done locally.
In order to generate unswerable questions do the following:
- Clone the repo and navigate to the repo folder.
- Make a folder
data
and place SQuAD 2.0 data there. - Make a folder
unsupervised_qa_data
and place the synthetic answerable questions by Lewis et al. data there. - In a different directory clone and install the code from this repo.
- Navigate to the
UnsupervisedQA
folder. - Copy with replacement
generate_synthetic_qa_data.py
from.../UnsupervisedUnaswerableQuestions
into.../UnsupervisedQA/unsupervisedqa
- Make
.../UnsupervisedQA/extracted
and.../UnsupervisedQA/output
folders. - Use WikiExtractor to pre-process a Wikipedia dump and places it in
.../UnsupervisedQA/extracted
directory. - Copy
.../UnsupervisedUnaswerableQuestions/execution_script.py
to.../UnsupervisedQA/
. python execution_script.py
- After you have generated enough question answer pairs, place the
combine_synthetic_questions.py
in.../UnsupervisedQA/
and runpython combine_synthetic_questions.py
- Use
generate_training_data.py
to partition the data and generate datasets containing both human-labeled and synthetic training examples.