Software for unsupervised word segmentation and language model learning using lattices
C++ Perl Makefile
Switch branches/tags
Nothing to show
Latest commit 9235945 Aug 17, 2016 @neubig committed on GitHub Merge pull request #3 from knok/annealing-default-value
fix default value on the help
Permalink
Failed to load latest commit information.
tutorial define local variables Jun 10, 2016
.gitignore Added path to openfst Dec 10, 2014
Makefile Added path to openfst Dec 10, 2014
README Fixed the README, etc Dec 10, 2014
latticelm.h
lexfst.h Updated to warn against badly formed files May 18, 2012
mainlatticelm.cc Initial commit to git May 1, 2012
pylm.h Fixed the README, etc Dec 10, 2014
pylmfst.h Update to work on clang Nov 15, 2014
sampgen.h Update to work on clang Nov 15, 2014
singlesample.h Initial commit to git May 1, 2012
util.h Added util May 18, 2012
weighted-mapper.h Fixed latticelm so it works with 1.3.2 Jun 5, 2012

README

------------------------------------------
-              latticelm                 -
-           by Graham Neubig             -
------------------------------------------

This is a program to learn a word-based language model from a collection of 
sentences or lattices. It uses non-parametric Bayesian techniques, specifically
the Pitman-Yor language model, to find a good segmentation. The technical 
details and motivation behind the model is described in:

Graham Neubig, Masato Mimura, Shinsuke Mori, Tatsuya Kawahara
"Learning a Language Model from Continuous Speech"
In proceedings for InterSpeech 2010

~~~ Installation ~~~

Before compiling the program, you must first install the OpenFST library.
http://www.openfst.org
(The current version has been confirmed to work with OpenFST 1.4.1)

Once this is installed, simply type
> make

If OpenFST isn't in your compilation path, open Makefile and point the
FSTPATH variable to your installation of OpenFST.

Compilation has been confirmed on Debian Wheezy and MacOS but it should work 
on most recent flavors of linux. If compilation works, the "latticelm" program
will be output in this directory.

Note that the code is distributed under the Apache License Version 2.

~~~ Usage ~~~

There are two tutorials included in the "tutorial" directory.
The first explains how to create a segmentation and language model from text.
The second explains how to create a segmentation and language model from
speech recognition lattices.

~~~ Options ~~~

Usage: latticelm -prefix out/ input.txt
Options:
  -burnin:       The number of iteration to execute as burn-in (20)
  -annealsteps:  The number of annealing steps to perform (3)
                 See Goldwater+ 2009 for details on annealing.
  -anneallength: The length of each annealing step in iterations (5)
  -samps:        The number of samples to take (100)
  -samprate:     The frequency (in iterations) at which to take samples (1)
  -knownn:       The n-gram length of the language model (3)
  -unkn:         The n-gram length of the spelling model (3)
  -prune:        If this is activated, paths that are worse than the
                 best path by at least this much will be pruned.
  -input:        The type of input (text/fst, default text).
  -filelist:     A list of input files, one file per line.
                 For fst input, files must be in OpenFST binary 
                 format, tropical semiring. Text files consist of one
                 sentence per line.
  -symbolfile:   The symbol file for the WFSTs, not used for text input.
  -prefix:       The prefix under which to print all output.
  -separator:    The string to use to separate 'characters'.
  -cacheinput:   For WFST input, cache the WFSTs in memory (otherwise
                 they will be loaded from disk every iteration).