Code written to run under Python 2.7 with the argparse
module.
The below assumes you have the Penn WSJ Treebank corpus in the folder
data
Quick start to parse and run CYK interactive:
$ python run.py -t data/wsj.02-21.training -cyk
Quick start to parse and run Viterbi on a file of sentences:
$ python run.py -t data/wsj.02-21.training -cyk data/test.sentence.23 -test
run.py
refers to the modules grammar.py
and cyk.py
. All options can be controlled through
passed arguments as detailed in the help. To print the run $ python run.py --help.
A grammar can be built with the command:
$ python run.py -t data/wsj.02-21.training -s
Then with the generated grammar a set of covering productions can be generated with the command:
$ python run.py -p grammar.pkl -cyk data/test.sentence.23
These covering productions will be ourput to the file covering_productions.txt
where each line is of the form:
XP1 YP1#XP2 YP2
where XPi YPi are the left hand sides of the covering productions and # indicate the start of a new covering produciton.
Or, a grammar can be built and a set of covering productions can be generated with the singe command:
$ python run.py -t data/wsj.02-21.training -cyk data/test.sentence.23
To run the parser on a sentence file given a pickle grammar:
$ python run.py -p grammar.pkl -cyk data/test.sentence.23 -test
By default the covering productions are output to covering_productions.txt, formatted as:
- each line corresponds to a line in the input test file
- each line is an array of arrays where the first member is A, the second B and the third C for the rule A -> BC
For reference the help output ($ python run.py -h
) is displayed below:
usage: run.py [-h] [-t TREEBANK] [-p PICKLE] [-gl GRAMMAR_LIMIT] [-s [SAVE]]
[-c] [-v] [-vv] [-m [MOST_LIKELY_PRODUCTIONS]] [-cyk [CYK]]
[-pl PARSER_TEST_LIMIT] [-a [AMBIGUOUS]] [-l] [-test]
optional arguments:
-h, --help show this help message and exit
-t TREEBANK, --treebank TREEBANK
Generate a grammar from the passed treebank file.
-p PICKLE, --pickle PICKLE
Load a grammar from the passed Python Pickle file.
-gl GRAMMAR_LIMIT, --grammar-limit GRAMMAR_LIMIT
Limit the grammar to including the passed number of
sentences.
-s [SAVE], --save [SAVE]
Save the generated grammar to the passed file.
-c, --check Check for consistency and print 4 syntactically
ambiguous terminals.
-v, --verbose Print verbose output.
-vv, --very-verbose Print very verbose output.
-m [MOST_LIKELY_PRODUCTIONS], --most-likely-productions [MOST_LIKELY_PRODUCTIONS]
Print the most likely productions for the passed non-
terminals.
-cyk [CYK], --cyk [CYK]
Run the CYK parser on given test corpus or from std
input.
-pl PARSER_TEST_LIMIT, --parser-test-limit PARSER_TEST_LIMIT
Limit the parsed test sentences to including the
passed number of sentences.
-a [AMBIGUOUS], --ambiguous [AMBIGUOUS]
Display ambiguous words.
-l, --lower-case Lower case words in building grammar and parsing
sentences.
-test, --test Build output test sentences. If passed we don't build
covering.
Extracts a Penn Tree Bank style grammar from a passed open file.
Constructor takes an open file handle, an optional boolean limit of the number of lines to read (default no limit), and an optional boolean of whether to print verbose output or not (default False).
Public variables:
cfg_l2r
, dictionary of sets which maps nonterminals to terminals and nonterminalscfg_r2l
, dictionary of sets which maps terminals and nonterminals to nonterminalspcfg
, dictionary of floats which maps tuples nonterminals, terminals and nonterminals to (0,1]
Public functions:
check_pcfg_sums
, no arguments, return the sum of the probablities of all productions for each nonterminalambiguous
, no arguments, return the set of syntactically ambiuous terminalsmost_likely_productions
, nonterminal or array of nonterminals, return the most likely productions for nonterminals
Generate parse forests given a sentence and a grammar.
Constructor takes a grammar of type grammar, a sentence of type string, and an optional boolean vervose indicating whether to print verbose output or not (default False).
Public variables:
chart
, the created chart parse
Public functions:
covering_productions
, return the set of productions TOP => XP and TOP => XP YP that cover the sentencecovering_productions_str
, return the set of covering production formatted as a stringviterbi_parse
, return the parse generated by Viterbi probabilities
Written for Elements of Language Processing and Learning 2010, University of Amsterdam. Peter Lubell-Doughtie, Davide Modolo.
Copyright (c) 2010 Peter Lubell-Doughtie. All rights reserved. Licensed under the BSD license.