Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
A lazy decoder for syntax http://kheafield.com/professional/
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
This code takes in a hypergraph and a language model then outputs a sentence. It is split into a library (search/) and a standalone wrapper (alone/). The library is also in Moses (-search-algorithm 5) and cdec (--incremental_search $lm). COMPILING Requires Boost >= 1.41. Tested on linux. Compile with ./bjam USAGE After compiling, the decoder is bin/decode. Run without an argument for help. To run, you will need one language model, feature weights, and hypergraphs. The language model must be in ARPA or KenLM format. Pass -l lm where lm is the file name. Feaure weights can be specified in a file using -w or on the command line with -W. Weights are key=value pairs like cdec. The hard-coded features are LanguageModel, LanguageModel_OOV, and WordPenalty. WordPenalty is word count times -1/ln(10) for odd historical reasons dating back to Hiero. The feature definitions are compatible with Moses and cdec. Hypergraphs are stored in a directory with one file per sentence. The files are named starting with 0. The first line of each file is total_vertex_count total_edge_count Then the file enumerates each vertex in bottom-up order (i.e. they can only reference vertices that have already been defined). A vertex is simply a list of competing ways to derive it (downward edges). The first line lists the number of edges. An edge looks like foo  bar   baz ||| Feature=5 AnotherFeature=10 where foo, bar, and baz are literal words and [n] references vertex n. Edges can have arbitrary arity (i.e. as many references as desired). The tokens <s> and </s> should appear explicitly; they are not added by the decoder. A complete example: 7 13 1 <s> ||| Quux=10 2  le ||| Distance=1.5  la ||| Distance=1.1 2  petit ||| Distance=0.0  peti ||| Distance=3.0 Foo=4 3  chas ||| Distance=1.1  char  ||| Distance=0.8  chat ||| Distance=1.0 2  est ||| Distance=2.0  Est ||| Distance=0.0 2  more ||| Distance=1.0  mort ||| Distance=0.0 1  </s> ||| This is the format produced by cdec's --show_target_graph option. But if you're using cdec, the code has already been natively ported and can be accessed using --incremental_search lm. DIRECTORY LAYOUT util and lm: copied from KenLM search: core search algorithm and portable to other decoders. alone: a standalone wrapper around the search implementation.