Skip to content

Author implementation of the paper "Don’t paraphrase, detect! Rapid and Effective Data Collection for Semantic Parsing"

Notifications You must be signed in to change notification settings

jonathanherzig/semantic-parsing-annotation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Don’t paraphrase, detect! Rapid and Effective Data Collection for Semantic Parsing

Author implementation of the following EMNLP 2019 paper.

Setup

  1. Install AllenNLP by following these instructions.

  2. Unzip GeoQuery and Scholar databases

    unzip lib/data/overnight/dbs.zip -d lib/data/overnight/
    

Datasets

The data/ folder contains different training (and development) sets for the GeoQuery and Scholar semantic parsing domains. All logical forms are in lambda DCS. The only test set is the original test set for these datasets. The difference between the training sets is in their data collection methods, that correspond to the methods described in the paper. Following are the details for all data collection methods.

  1. Nat: The training set is the same as in the original dataset (GeoQuery or Scholar).
  2. Lang: The logical forms are the same as in the original training set, but the natural language utterances were paraphrased from synthetic language (canonical utterances) by crowd workers. This is described in Section 3.2 in the paper.
  3. Overnight: The training set was generated by the process described in Building a Semantic Parser Overnight.
  4. GrAnno: The training set is composed out of unlabelled examples (that have natural language only with no logical forms). This unlabelled examples were annotated by crowd workers, which selected the correct canonical utterance paraphrase for each unlabelled example. This method is described in Section 4 in the paper.

Training

The model is based on this CopyNet implementation by AllenNLP.

The following command searches for the best hyper-param values (learning rate and dropout) based on the dev set loss, and evaluates the corresponding best model on the test set using denotation accuracy.

python nsp/run.py --domain domain --version version --embeddings embeddings

Where the possible values are domain=['geo', 'scholar'] (the domain to be experimented with), version=['nat', 'lang', 'granno', 'overnight'] (the version of the training set - details in the above section), and embeddings=['glove', 'elmo'] (pre-trained embeddings to be used by the encoder).

The results are written to a log file under the logs/ folder.

Generating from the grammar

You can use the grammar to generate (canonical_utterance, logical_form) pairs exhaustively up to some maximal depth. The generation procedure runs recursively, top-down, while pruning some nonsensical structures during generation, and also after generation is terminated. The number of generated examples should match Table 2 in the paper.

To generate from the grammar, run:

python grammar_generation/grammar_gen.py --domain domain --name name --max_depth max_depth

Where the possible domains are domain=['geo', 'scholar'] (the domain for which to generate), name is the name of the output (to be generated under grammar_generation folder), and max_depth is the maximal tree depth that would be generated (if >7 expect running time to blow up given current implementation).

Grammar output

The grammar output for both domains, containing canonical utterances and logical forms, can be found in grammar_output.zip. These canonical utterances are used as candidates for annotation using GrAnno, and match the outputs for a maximum tree depth of 6 reported in the paper.

About

Author implementation of the paper "Don’t paraphrase, detect! Rapid and Effective Data Collection for Semantic Parsing"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published