# n gram model format

##### Clone this wiki locally

Proposals of new n-gram model format

# 1. Goals

1. a formal n-gram model textual format for exchanging, as required by the policy of Fedora, Debian etc.
2. extensible, even when new smoothing methods added to n-gram or doing prune, the n-gram format is still the same.
3. simple, will be very easy to write a parser, also some tool will be provided to ease this.

# 2. Syntax of new n-gram model format

## a. token

1. line type: from beginning of every line, with leading “\”. this is used to claim which kind of data current line has.
2. <…>: used for special token, like <english>, <unknown>, <start>, <end>…, which represents english word, unknown word, start of the line, end of the line.
3. normal word, which follows the line type, the number of normal words is dictated by the line type.
4. tagname tagvalue, these tokens follows normal word, and always appears as a pair, we treat this as a hash. (this is extensible.)
5. “…”: this can be used in special token, normal word, tagname tagvalue, only when you need input escape string sequence.

## b. grammar

1. every line is a entity and described by line type.
The following are all possible line types.
2. \data, the begin of the data.
allowed tag: model (interpolation/back-off)
3. \end, the end of data.
4. \<n>-gram, begin of n-gram where n is from <n>:
\<n>-gram {tagname tagvalue}+
possible tags are count, how many items.
This affects the following line types: \<n>-param, \item.
The allowed \<n>-param is from 0 to n-1, where n from <n>-gram.
5. \<n>-param, describes additional param for n-gram, like bow value in back-off model.
\<n>-param is followed by n normal word, when n is zero no normal word following.
additional tagnames and tagvalues is possible.
6. \item a single item in n-gram, followed by n normal words which depends on \<n>-gram, additional tagnames and tagvalues are possible.

# 4. tools: 1. export tools for exporting from interpolations and back-off model. 2. import tools for importing to various models, produce error when missing required tagname or tagvalue.

Refer URL:
http://www-speech.sri.com/projects/srilm/manpages/ngram-format.5.html