An shift-reduce based RST parser
Python
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
doc update Nov 6, 2014
examples
LICENSE
README.md
TODO.md
buildtree.py
contributors.txt
data.py
datastructure.py update Nov 6, 2014
evalparser.py
evaluation.py
feature.py
featurelist.md
learn.py
main.py
model.py
parser.py
tree.py
util.py Finish the pipeline, add more features in future Sep 16, 2014

README.md

RST Parser

If you are looking for a RST paper which is ready to use, please check out this repository

If you need a framework to develop your own RST parser, please keep reading :-)

Basic Description

RST parser for document-level discourse parsing. The parsing algorithm is shift-reduce parsing, and the parsing model is a offline trained multi-class classifier.

To obtain a good performance, you can:

  • add more features into the feature generator (in feature.py)
  • tune the parameters in parsing model (in model.py). For now, I simply use LinearSVC with default parameter setting.

Demo

Start from "main.py" for a demo

Modules

  • tree: any operation about an RST tree is included in this module. For example
    • Build general/binary RST tree from annotated file
    • Binarize a general RST tree to the binary form (original RST trees in the RST treebank may not in the binary form)
    • Generate bracketing sequence for evaluation
    • Write an RST tree into file (not implemented yet)
    • Generate Shift-reduce parsing action examples
    • Get all EDUs from the RST tree
  • parser: an implementation of the shift-reduce parsing algorithm, including following functions:
    • Initialize parsing status given a sequence of texts
    • Change the status according to a specific parsing action
    • Get the status of stack/queue
    • Check whether should stop parsing
  • model: an parsing model module, where a trained parsing model could predict parsing actions. This module includes:
    • Batch training on the data generated by the data module
    • Predict parsing actions for a given feature set
    • Save/load parsing model
  • feature: an feature generator, which can generate features from current stack/queue status.
  • data: generate training data for offline training

Main Classes

(For all the following functions, please refer to the code for more explanation)

  • RSTTree (in tree module):
    • build(): Build an binary RST tree from an annotated discourse file
    • generate_sample(): Generate a sequence of parsing actions and the corresponding training examples, which can be used for offline training on parsing model
    • getedutext(): Get a sequence of EDU texts from the given RST tree
    • bracketing: Generate bracketing sequence for evaluation
  • SRParser (in parser module):
    • init(texts): Initialize the queue status from the given text sequence. Each element in this sequence will be treated as an EDU
    • operate(action_tuple): Change the queue/stack according to the action tuple, for example, the operation (Shift, None, None) will move one element from the head of the queue to the top of the stack
    • getparsetree(): Return the entire RST tree
  • FeatureGenerator (in feature module):
    • features(): the major generator which could extract all the necessary features from current queue/stack. You can extend this generator by calling other sub-functions in it.
  • ParsingModel (in model module):
    • train(trnM, trnL): Offline training on the parsing model (aka, a multi-class classifier) from the given training data trnM and corresponding labels trnL
    • predict(features): Predict a parsing action according to the given feature generator
    • sr_parse(texts): Performing shift-reduce RST parsing on the given text sequence. Each element in this sequence will be treated as an EDU
  • Data (in data module):
    • buildvocab(thresh): Build feature vocab by removing some low-frequency features. The same vocab will also be used for future parsing work in test stage.
    • buildmatrix(): Build data matrix for offline training
    • savematrix(fname): Save data matrix and corresponding labels into fname
    • getvocab(): Get feature vocab
    • savevocab(fname): Save feature vocab and relation mapping (from relations to indices) into fname

Reference