Skip to content

lingchensanwen/DCQA-QUD-parsing

Repository files navigation

DCQA-QUD-parsing

This is a repo for DCQA QUD parsing implementation

Title: Discourse Analysis via Questions and Answers: Parsing Dependency Structures of Questions Under Discussion

Authors: Wei-Jen Ko, Yating Wu, Cutter Dalton, Dananjay Srinivas, Greg Durrett, Junyi Jessy Li

@inproceedings{ko-etal-2023-discourse,
    title = "Discourse Analysis via Questions and Answers: Parsing Dependency Structures of Questions Under Discussion",
    author = "Ko, Wei-Jen  and
      Wu, Yating  and
      Dalton, Cutter  and
      Srinivas, Dananjay  and
      Durrett, Greg  and
      Li, Junyi Jessy",
    editor = "Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.710",
    doi = "10.18653/v1/2023.findings-acl.710",
    pages = "11181--11195",
    abstract = "Automatic discourse processing is bottlenecked by data: current discourse formalisms pose highly demanding annotation tasks involving large taxonomies of discourse relations, making them inaccessible to lay annotators. This work instead adopts the linguistic framework of Questions Under Discussion (QUD) for discourse analysis and seeks to derive QUD structures automatically. QUD views each sentence as an answer to a question triggered in prior context; thus, we characterize relationships between sentences as free-form questions, in contrast to exhaustive fine-grained taxonomies. We develop the first-of-its-kind QUD parser that derives a dependency structure of questions over full documents, trained using a large, crowdsourced question-answering dataset DCQA (Ko et al., 2022). Human evaluation results show that QUD dependency parsing is possible for language models trained with this crowdsourced, generalizable annotation scheme. We illustrate how our QUD structure is distinct from RST trees, and demonstrate the utility of QUD analysis in the context of document simplification. Our findings show that QUD parsing is an appealing alternative for automatic discourse processing.",
}

Introduction

This repo contains code and data source for paper in Discourse Analysis via Questions and Answers: Parsing Dependency Structures of Questions Under Discussion. This work present DCQA QUD Parser, the first QUD (Questions Under Discussion) parser for discourse analysis. In this repo, we include code to predict anchor sentence, generate question based on predicted anchor sentence, re-ranker to sort questions based their score.

Table of Contents

  1. Prepare Requirements & Download Models
  2. Anchor Sentence Prediction
  3. Generate Question
  4. Prepare re-ranking scores for each question
  5. Resort questions based on scores
  6. Quick set up Example
  7. Related Work: DCQA Discourse Comprehension by Question Answering

Prepare Requirements and Download Models

Install the version of transformers toolkit in ./transformers (go to the directory, and run "pip install -e .")

Download and unzip the models

discourse - used in anchor prediction

question_generation - used in question generation

WNLI - used in re-rankering

Anchor Sentence Prediction

Put all testing articles in the directory ./inputa

python prepare_anchor_prediction.py, this script generates the input format for the anchor prediction model. The input file should have such a format:

sentence number + tab + sentence

for example:

1 The purple elephant played a harmonica in the middle of the park.
2 She wore a polka-dot hat and carried a suitcase full of rubber ducks.

Run the following command to execute the anchor prediction model.

python ./transformers/examples/question-answering/run_squad.py \
--model_type longformer \
--model_name_or_path discourse \
--do_eval \
--train_file a.json \
--predict_file a.json \
--learning_rate 3e-5 \
--num_train_epochs 5 \
--max_seq_length 4096 \
--doc_stride 128 \
--output_dir ./ao \
--per_gpu_eval_batch_size=2 \
--per_gpu_train_batch_size=2 \
--save_steps 5000 \
--logging_steps 50000 \
--overwrite_output_dir \
--max_answer_length 5 \
--n_best_size 10 \
--version_2_with_negative \
--evaluate_during_training \
--eval_all_checkpoints  \
--null_score_diff_threshold 9999

Generate Question

python prepare_question_generation.py, this script performs NER masking and generates the input format of the GPT-2 question generation model

Run the following command to execute the question generation model. (file paths are at line 231, line 232)

python ./transformers/examples/text-generation/run_generation.py --model_type=gpt2 --model_name_or_path=./question_genertion

Prepare re-ranking scores for each question

python prepare_reranker.py, this script prepares the input format of the reranker

Download the GLUE data by running this script and unpack it to some directory $GLUE_DIR.

Run the following command to execute the reranker

export GLUE_DIR=./glue 

export TASK_NAME=WNLI
  
python ./transformers/examples/text-classification/run_glue.py \
  --model_name_or_path ./transformers/$TASK_NAME/ \
  --task_name $TASK_NAME \
  --do_eval \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --per_device_eval_batch_size 1 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir ./transformers/WNLIoutput/ \
  --cache_dir ./transformers/cache \
  --overwrite_cache \
  --overwrite_output_dir > output.txt

Resort questions based on scores

python resort_question.py to resort the generated questions according to scores.

Quick setup

You can use this notebook to quick start for colab.

Related Work

DCQA Discourse Comprehension by Question Answering

CC Attribution 4.0 International

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0