This is the implementation of the double chained CRF used for predicting MWE and supersenses.
HTML Python Shell
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
Util First commit Jun 6, 2017
dimsum-data-1.5 more cleaning Jun 19, 2017
lex First commit Jun 6, 2017
mwelex deleted unnecessary files Jun 18, 2017
scripts scripts Jun 19, 2017
src more cleaning Jun 19, 2017
streusle-2.0 deleted unnecessary files Jun 18, 2017
tagsets First commit Jun 6, 2017
.gitmodules First commit Jun 6, 2017
README.md readme Jun 19, 2017

README.md

This is the implementation of the double chained CRF used for predicting Multiword Expressions (MWE) and supersenses.

UW-CSE at SemEval-2016 Task 10: Detecting multiword expressions and supersenses using double-chained conditional random fields. Mohammad Javad Hosseini, Noah A. Smith, and Su-In Lee. In Proceedings of the NAACL Workshop on Semantic Evaluations (SemEval 2016), San Diego, CA, June 2016.

We participated at the SemEval 2016 Task 10: Detecting Minimal Semantic Units and their Meanings (DiMSUM). Our submitted models ranked first overall in the competition.

We have implemented a Conditional Random Field and a Double-Chained Conditional Random Field model for joint learning of multiword expressions and supersenses.

The feature extraction is based on AMALGrAM 2.0 (A Machine Analyzer of Lexical Groupings And Meanings) and the dependencies are the same as AMALGrAM 2.0.

Software

  • Python 2.7
  • Cython (tested on 0.21.1)
  • NLTK 3.0.2+ with the WordNet resource installed

Running:

After downloading the code, given the above softwares are installed, you can run the code from the scripts folder to replicate the paper's results and/or test on new data. (best model: Double_CRF_open.sh)

Tagging Scheme

Multiword Expressions:

The annotation for MWEs extends the conventional BIO scheme to include gappy MWEs with one level of nesting. Segmentations are represented using six tags; the lower-case variants indicate that an expression is within another MWE’s gap.

-- O and o: single word expression -- B and b: the first word of a MWE -- I and i: a word continuing a MWE

Supersenses:

Each noun or verb expression is also annotated with a supersense; there are 26 supersenses for nouns and 15 for verbs. Only the first word of a MWE receives a supersense tag.

The input must be sentence and word tokenized and part-of-speech tagged (with the Penn Treebank POS tagset).

Please refer to dimsum-data-1.5/TAGSET.md for more details.

Data:

The datasets are in the folder dimsum-data-1.5. There is a readme file in the folder explaining the format. For prediction on new data, input should be formatted as described there. Our original submission is in the folder submitted_results.

Please email the first author (hosseini@cs.washington.edu) in case of any questions and/or requests.