Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Branch: master
Commits on Feb 13, 2008
Commits on Feb 12, 2008
  1. wordProb prototype change

Commits on Feb 11, 2008
  1. train: Use closed vocab

Commits on Feb 10, 2008
  1. vspell-report: fix --detail

  2. Use ${O}.vocab2 for LM generation

    ${O}.vocab2 does contain special tokens like <opaque>
    while ${O}.vocab does not
Commits on Feb 5, 2008
  1. Echo running commands

Commits on Feb 3, 2008
Commits on Feb 21, 2007
  1. Updated train script

     - Include timing for long processings
     - Use sc-train --replay
     - Accept two parameters, the first one will be PREFIX.
       The second one is a number
  2. Prevent overflows in sc2wngram

    <s> <digit>, <s> <opaque> and <s> <punct> may well run over
    int limit (about 2G). So long long int is used
Commits on Dec 2, 2006
  1. Fixed leaks in LM::operator[](const char *) and LM::clear_oov()

    LM::clear_oov() (also known as clear_rest) resizes LM::oov without
    release memory allocated for its strings
    LM::operator[](const char *) abused LM::lm->HT to index LM::oov.
    This has a nasty effect that lm->HT keeps growing no matter you call
    LM::clear_oov(). With lm->HT's growing bigger and bigger, hash lookup
    slows down significantly.
    The new implementation uses another hash for index oov and free it
    when LM::clear_oov() is called
Commits on Dec 1, 2006
  1. Preserve \n in std-syllable output

    It greatly helps preparing the wordlist
Commits on Nov 30, 2006
  1. Reworked WordArchive::load to use new wordlist format

    The new format is as same as CMU SLM's vocab format. It should
    be noted that words in wordlist are standardized ones.
    If WordArchive::load() is called with NULL as argument, it'll
    then use wordlist from struct lm_t inside class LM. So if you
    already load an LM, call warch.load(NULL) to save I/O.
Commits on Nov 29, 2006
  1. Added record/replay mode to sc-train to speed up the process

    According to sysprof, a large amount of time was spent for
    input processing (operator >> Lattice& and friends). This mode
    tries to eliminate that work.
    Record mode runs as usual without real calculation. It
    outputs what steps needed to calculate the final results.
    The format is as follow (for bigrams only):
    <dag count> <node begin id> <node end id>
    <L|R> <v> <vv> <word1> <word2>
    <D> 0 0 none none
    Replay mode reads record mode's output and do the rest of work.
    It allocates Sleft, Sright, fill them up and output the counts.
    Record mode may take as long as normal mode. But replay mode is
    much faster (about fifteen minutes while normal mode may take
    ninety to one hundred and twenty minutes). Record mode seems,
    however, to output three times bigger than lattice output (approx
    gzipped 300MB)
Commits on Nov 28, 2006
  1. Use double for softcounting as float is too small

    Some values reached 1e-47 which seems too large for float
    Also added some checking to make sure we get noticed if
    some values are overflowed
Commits on Nov 21, 2006
  1. Replaced bare Sentence pointer in Lattice class with boost::shared_ptr

    This should fix the leak produced by commit
Commits on Nov 13, 2006
  1. Avoid signed/unsigned char pitfalls when calling viet_is* functions

    Strings are by default char*. That means there are some negative
    character values (such as 'đ' - 0xf0). If these values are passed
    to viet_is* they will reference to undetermined places because
    viet_is* don't check for negative values.
    This caused Sentence::tokenize() to ignore 'đủ'
Commits on Nov 12, 2006
  1. Always ignore 0-weighted edges in PFS. It's too perfect to be true.

    This adds a trampoline to avoid nasty bugs such as
  2. LM::wordProb should return a large value in bad cases

    Returning 0 means it's the best LogP out there, which is
    obviously wrong. The selected value is -9999.0
Something went wrong with that request. Please try again.