Don’t paraphrase, detect! Rapid and Effective Data Collection for Semantic Parsing

Author implementation of the following EMNLP 2019 paper.

Setup

Install AllenNLP by following these instructions.

Unzip GeoQuery and Scholar databases

unzip lib/data/overnight/dbs.zip -d lib/data/overnight/

Datasets

The data/ folder contains different training (and development) sets for the GeoQuery and Scholar semantic parsing domains. All logical forms are in lambda DCS. The only test set is the original test set for these datasets. The difference between the training sets is in their data collection methods, that correspond to the methods described in the paper. Following are the details for all data collection methods.

Nat: The training set is the same as in the original dataset (GeoQuery or Scholar).
Lang: The logical forms are the same as in the original training set, but the natural language utterances were paraphrased from synthetic language (canonical utterances) by crowd workers. This is described in Section 3.2 in the paper.
Overnight: The training set was generated by the process described in Building a Semantic Parser Overnight.
GrAnno: The training set is composed out of unlabelled examples (that have natural language only with no logical forms). This unlabelled examples were annotated by crowd workers, which selected the correct canonical utterance paraphrase for each unlabelled example. This method is described in Section 4 in the paper.

Training

The model is based on this CopyNet implementation by AllenNLP.

The following command searches for the best hyper-param values (learning rate and dropout) based on the dev set loss, and evaluates the corresponding best model on the test set using denotation accuracy.

python nsp/run.py --domain domain --version version --embeddings embeddings

Where the possible values are domain=['geo', 'scholar'] (the domain to be experimented with), version=['nat', 'lang', 'granno', 'overnight'] (the version of the training set - details in the above section), and embeddings=['glove', 'elmo'] (pre-trained embeddings to be used by the encoder).

The results are written to a log file under the logs/ folder.

Generating from the grammar

You can use the grammar to generate (canonical_utterance, logical_form) pairs exhaustively up to some maximal depth. The generation procedure runs recursively, top-down, while pruning some nonsensical structures during generation, and also after generation is terminated. The number of generated examples should match Table 2 in the paper.

To generate from the grammar, run:

python grammar_generation/grammar_gen.py --domain domain --name name --max_depth max_depth

Where the possible domains are domain=['geo', 'scholar'] (the domain for which to generate), name is the name of the output (to be generated under grammar_generation folder), and max_depth is the maximal tree depth that would be generated (if >7 expect running time to blow up given current implementation).

Grammar output

The grammar output for both domains, containing canonical utterances and logical forms, can be found in grammar_output.zip. These canonical utterances are used as candidates for annotation using GrAnno, and match the outputs for a maximum tree depth of 6 reported in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
data		data
evaluator		evaluator
grammar_generation		grammar_generation
grammars		grammars
lib		lib
logs		logs
nsp		nsp
README.md		README.md
grammar_output.zip		grammar_output.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Don’t paraphrase, detect! Rapid and Effective Data Collection for Semantic Parsing

Setup

Datasets

Training

Generating from the grammar

Grammar output

About

Releases

Packages

Languages

jonathanherzig/semantic-parsing-annotation

Folders and files

Latest commit

History

Repository files navigation

Don’t paraphrase, detect! Rapid and Effective Data Collection for Semantic Parsing

Setup

Datasets

Training

Generating from the grammar

Grammar output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages