Skip to content

New Word Recognizer

epico edited this page Feb 25, 2013 · 1 revision

New Word Criteria

  1. The word should occur multiple times, estimated by frequency;

  2. The word should follow/before many words, estimated by information entropy;

Tools

prepary.py

  • create the initial sqlite3 databases; "CREATE TABLE ngram (words TEXT NOT NULL, freq INTEGER NOT NULL);"

populate.py

  • convert the corpus files into ngram table in sqlite3 databases;

partialword.py

  • estimate the frequency threshold;

    • get all frequency for existing words;

    • get the threshold by the frequency of word in the 60% position.

  • get all paritial words;

    • get all word pairs whose frequency is above the frequency threshold;

    • recursively merge the word pairs in all sqlite3 databases.

      • from higher-gram to lower-gram like n-gram ⇒ n-1-gram, …​, 3-gram ⇒ 2-gram, 2-gram ⇒ 1-gram;

newword.py

  • estimate the information entropy threshold;

    • get all prefix information entropy for existing words;

    • get the prefix threshold by the prefix information entropy of word in the 69% position.

    • get all postfix information entropy for existing words;

    • get the postfix threshold by the postfix information entropy of word in the 69% position.

  • filter all partial words to get new words;

    • only keep the word whose information entropy above both prefix and postfix threshold.

markpinyin.py

  • guess the best pinyin of new words by the merged word sequence;