n gram model format

epico edited this page Jun 29, 2011 · 6 revisions

Proposals of new n-gram model format

1. Goals

  1. a formal n-gram model textual format for exchanging, as required by the policy of Fedora, Debian etc.
  2. extensible, even when new smoothing methods added to n-gram or doing prune, the n-gram format is still the same.
  3. simple, will be very easy to write a parser, also some tool will be provided to ease this.

2. Syntax of new n-gram model format

a. token

  1. line type: from beginning of every line, with leading “\”. this is used to claim which kind of data current line has.
  2. <…>: used for special token, like <english>, <unknown>, <start>, <end>…, which represents english word, unknown word, start of the line, end of the line.
  3. normal word, which follows the line type, the number of normal words is dictated by the line type.
  4. tagname tagvalue, these tokens follows normal word, and always appears as a pair, we treat this as a hash. (this is extensible.)
  5. “…”: this can be used in special token, normal word, tagname tagvalue, only when you need input escape string sequence.

b. grammar

  1. every line is a entity and described by line type.
    The following are all possible line types.
  2. \data, the begin of the data.
    allowed tag: model (interpolation/back-off)
  3. \end, the end of data.
  4. \<n>-gram, begin of n-gram where n is from <n>:
    \<n>-gram {tagname tagvalue}+
    possible tags are count, how many items.
    This affects the following line types: \<n>-param, \item.
    The allowed \<n>-param is from 0 to n-1, where n from <n>-gram.
  5. \<n>-param, describes additional param for n-gram, like bow value in back-off model.
    \<n>-param is followed by n normal word, when n is zero no normal word following.
    additional tagnames and tagvalues is possible.
  6. \item a single item in n-gram, followed by n normal words which depends on \<n>-gram, additional tagnames and tagvalues are possible.

3. Examples:

a. interplotion
\\data model interpolation
\\1-gram count 100
\\0-param lambda-interpolation 0.6711
\\item <start> count 66
...
\\2-gram count 2000
\\item 中国 人 count 100
...
\\end

b. back-off
\\data model back-off
\\1-gram count 100 
\\item <start> freq 0.066 bow 0.1 back-off-level 0 back-off-index 0
...
\\2-gram count 2000
\\item 中国 人 freq 0.1 bow 0.2 back-off-level 1 back-off-index 3355
...
\\3-gram count 10000
\\item <start> 中国 人 freq 0.2 back-off-level 2 back-off-index 2000
\\end

c. k mixture model
\\data model "k mixture model" count 1000 N 10 total_freq 1100
\\1-gram
\\item <start> count 50 freq 51
...
\\2-gram
\\item 你好 啊 count 3 T 3 N_n_0 2 n_1 1 Mr 2
...
\\end

4. tools:
  1. export tools for exporting from interpolations and back-off model.
  2. import tools for importing to various models, produce error when missing required tagname or tagvalue.

Refer URL:
http://www-speech.sri.com/projects/srilm/manpages/ngram-format.5.html