New Word Recognizer

Jump to bottom

epico edited this page Feb 25, 2013 · 1 revision

New Word Criteria

The word should occur multiple times, estimated by frequency;
The word should follow/before many words, estimated by information entropy;

Tools

prepary.py

create the initial sqlite3 databases; "CREATE TABLE ngram (words TEXT NOT NULL, freq INTEGER NOT NULL);"

populate.py

convert the corpus files into ngram table in sqlite3 databases;

partialword.py

estimate the frequency threshold;
- get all frequency for existing words;
- get the threshold by the frequency of word in the 60% position.
get all paritial words;
- get all word pairs whose frequency is above the frequency threshold;
- recursively merge the word pairs in all sqlite3 databases.
  - from higher-gram to lower-gram like n-gram ⇒ n-1-gram, …, 3-gram ⇒ 2-gram, 2-gram ⇒ 1-gram;

newword.py

estimate the information entropy threshold;
- get all prefix information entropy for existing words;
- get the prefix threshold by the prefix information entropy of word in the 69% position.
- get all postfix information entropy for existing words;
- get the postfix threshold by the postfix information entropy of word in the 69% position.
filter all partial words to get new words;
- only keep the word whose information entropy above both prefix and postfix threshold.

markpinyin.py

guess the best pinyin of new words by the merged word sequence;