Software for unsupervised word segmentation and language model learning using lattices
C++ Perl Makefile
Switch branches/tags
Nothing to show
Latest commit 9235945 Aug 17, 2016 @neubig neubig Merge pull request #3 from knok/annealing-default-value
fix default value on the help

README

------------------------------------------
-              latticelm                 -
-           by Graham Neubig             -
------------------------------------------

This is a program to learn a word-based language model from a collection of 
sentences or lattices. It uses non-parametric Bayesian techniques, specifically
the Pitman-Yor language model, to find a good segmentation. The technical 
details and motivation behind the model is described in:

Graham Neubig, Masato Mimura, Shinsuke Mori, Tatsuya Kawahara
"Learning a Language Model from Continuous Speech"
In proceedings for InterSpeech 2010

~~~ Installation ~~~

Before compiling the program, you must first install the OpenFST library.
http://www.openfst.org
(The current version has been confirmed to work with OpenFST 1.4.1)

Once this is installed, simply type
> make

If OpenFST isn't in your compilation path, open Makefile and point the
FSTPATH variable to your installation of OpenFST.

Compilation has been confirmed on Debian Wheezy and MacOS but it should work 
on most recent flavors of linux. If compilation works, the "latticelm" program
will be output in this directory.

Note that the code is distributed under the Apache License Version 2.

~~~ Usage ~~~

There are two tutorials included in the "tutorial" directory.
The first explains how to create a segmentation and language model from text.
The second explains how to create a segmentation and language model from
speech recognition lattices.

~~~ Options ~~~

Usage: latticelm -prefix out/ input.txt
Options:
  -burnin:       The number of iteration to execute as burn-in (20)
  -annealsteps:  The number of annealing steps to perform (3)
                 See Goldwater+ 2009 for details on annealing.
  -anneallength: The length of each annealing step in iterations (5)
  -samps:        The number of samples to take (100)
  -samprate:     The frequency (in iterations) at which to take samples (1)
  -knownn:       The n-gram length of the language model (3)
  -unkn:         The n-gram length of the spelling model (3)
  -prune:        If this is activated, paths that are worse than the
                 best path by at least this much will be pruned.
  -input:        The type of input (text/fst, default text).
  -filelist:     A list of input files, one file per line.
                 For fst input, files must be in OpenFST binary 
                 format, tropical semiring. Text files consist of one
                 sentence per line.
  -symbolfile:   The symbol file for the WFSTs, not used for text input.
  -prefix:       The prefix under which to print all output.
  -separator:    The string to use to separate 'characters'.
  -cacheinput:   For WFST input, cache the WFSTs in memory (otherwise
                 they will be loaded from disk every iteration).