Skip to content

jhcross/span-parser

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
src
 
 
 
 
 
 

Span-Based Constituency Parser

This is an implementation of the span-based natural language constituency parser described in the paper Span-Based Constituency Parsing with a Structure-Label System and Provably Optimal Dynamic Oracles, which will appear in EMNLP (2016).

Required Dependencies

  • Python 2.7
  • NumPy
  • DyNet (To ensure compatibility, check out and compile commit 71fc893eda8e3f3fccc77b9a4ae942dce77ba368)

Vocabulary Files

Vocabulary may be loaded every time from a training tree file, or it may be stored (separately from a trained model) in a JSON file, which is much faster and recommended. To learn the vocabulary from a file with training trees and write a JSON file, use a command such as the following:

python src/main.py --train data/02-21.10way.clean --write-vocab data/vocab.json

Training

Training requires a file containing training trees (--train) and a file containg validation trees (--dev), which are parsed four times per training epoch to determine which model to keep. A file name must also be provided to store the saved model (--model). The following is an example of a command to train a model with all of the default settings:

python src/main.py --train data/02-21.10way.clean --dev data/22.auto.clean --vocab data/vocab.json --model data/my_model

The following table provides an overview of additional training options:

Argument Description Default
--dynet-mem Memory (MB) to allocate for DyNet 2000
--dynet-l2 L2 regularization factor 0
--dynet-seed Seed for random parameter initialization random
--word-dims Word embedding dimensions 50
--tag-dims POS embedding dimensions 20
--lstm-units LSTM units (per direction, for each of 2 layers) 200
--hidden-units Units for ReLU FC layer (each of 2 action types) 200
--epochs Number of training epochs 10
--batch-size Number of sentences per training update 10
--droprate Dropout probability 0.5
--unk-param Parameter z for random UNKing 0.8375
--alpha Softmax weight for exploration 1.0
--beta Oracle action override probability 0.0
--np-seed Seed for shuffling and softmax sampling random

Test Evaluation

There is also a facility to directly evaluate a model agaist a reference corpus, by supplying the --test argument:

python src/main.py --test data/23.auto.clean --vocab data/vocab.json --model data/my_model

Citation

If you use this software for research, we would appreciate a citation to our paper:

@inproceedings{cross2016span,
  title={Span-Based Constituency Parsing with a Structure-Label System and Provably Optimal Dynamic Oracles},
  author={Cross, James and Huang, Liang},
  journal={Empirical Methods in Natural Language Processing (EMNLP)},
  year={2016}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages