Noun Verb Dataset

This dataset contains naturally-occurring English sentences that feature non-trivial noun-verb ambiguity.

Motivation

English part-of-speech taggers regularly make egregious errors related to noun-verb ambiguity, despite having achieved 97%+ accuracy on the WSJ Penn Treebank since 2002. These mistakes have been difficult to quantify and make taggers less useful to downstream tasks such as translation and text-to-speech synthesis.

Below are some examples from the dataset:

Certain insects can damage plumerias, such as mites, flies, or aphids. NOUN
Mark which area you want to distress. VERB

All tested existing part-of-speech taggers mistag both of these examples, tagging flies as a verb and Mark as a noun.

Description

The dataset contains sentences in CoNLL format. Each sentence has a single token that has been manually annotated as either VERB or NON-VERB. The sentences come from multiple domains. Where applicable, the url of the source page for the sentence is included in a comment line before the sentence.

The dataset is split into Train/Dev/Test sections. For Dev and Test sections the annotations included either VERB or NON-VERB in XPOS, UPOS and FEATS columns. For the Train section, XPOS and UPOS columns are replaced with a (predicted) fine POS tag obtained by running automatic tagger and selecting the top tag that matched the gold coarse-grained VERB/NON-VERB label.

An example of a training sentence is shown below:

# https://www.wikihow.com/Not-Get-Bored-on-a-Long-Car-Ride
1   License _   NN  NN  POS=NON-VERB|fPOS=NON-VERB  -1  _   _   _
2   plates  _   _   _   _   -1  _   _   _
3   of  _   _   _   _   -1  _   _   _
4   cars    _   _   _   _   -1  _   _   _
5   from    _   _   _   _   -1  _   _   _
6   your    _   _   _   _   -1  _   _   _
7   area    _   _   _   _   -1  _   _   _
8   or  _   _   _   _   -1  _   _   _
9   your    _   _   _   _   -1  _   _   _
10  destination _   _   _   _   -1  _   _   _
11  .   _   _   _   _   -1  _   _   _

The number of examples are shown below:

Data Split	Train	Dev	Test
Total	23458	2367	5907

These numbers are slightly lower than those reported in Table 5 of the paper by 796, 33, and 93 examples, respectively.

Updated Results

We reran the experiments reported in Tables 1 and 9 of the paper using the slightly filtered data. The updated results are reported as means and standard deviations based on an average of 10 runs:

Model	WSJ	NV	Homographs (Micro)	Homographs (Macro)
WSJ	98.07±0.06	73.9±1.5	95.7±0.3	95.8±0.3
WSJ+ELMo	97.99±0.06	81.5±1.1	96.8±0.2	96.8±0.2
WSJ+Silver	98.00±0.07	86.4±0.5	96.0±0.2	96.1±0.2
WSJ+ELMo+Silver	98.01±0.06	88.7±0.4	96.7±0.2	96.8±0.2

An updated version of Table 7, tuned on the Noun-Verb development set, is as follows:

Model	WSJ	NV
WSJ	98.03±0.10	76.9±0.5
WSJ+ELMo	97.90±0.09	83.4±0.3
WSJ+Silver	97.97±0.09	87.1±0.3
WSJ+ELMo+Silver	97.95±0.08	89.1±0.2

Contact

If you have a technical question regarding the dataset or publication, please create an issue in this repository.

Citation

If you use or discuss this dataset in your work, please cite our paper:

@InProceedings{NOUNVERB,
  title = {A Challenge Set and Methods for Noun-Verb Ambiguity},
  author = {Ali Elkahky and Kellie Webster and Daniel Andor and Emily Pitler},
  booktitle = {Proceedings of EMNLP},
  year = {2018}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
CONTRIBUTORS		CONTRIBUTORS
LICENSE		LICENSE
README.md		README.md
annotation-guidelines.md		annotation-guidelines.md
dev.conll		dev.conll
test.conll		test.conll
train.conll		train.conll

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Noun Verb Dataset

Motivation

Description

Updated Results

Contact

Citation

About

Releases

Packages

Contributors 2

License

google-research-datasets/noun-verb

Folders and files

Latest commit

History

Repository files navigation

Noun Verb Dataset

Motivation

Description

Updated Results

Contact

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages