Permalink
Commits on Feb 13, 2008
  1. softcount: tolerate zero ngrams

    committed Feb 13, 2008
  2. Compress sort's temporary files

    committed Feb 13, 2008
Commits on Feb 12, 2008
  1. wordProb prototype change

    committed Feb 12, 2008
Commits on Feb 11, 2008
  1. train: Use closed vocab

    committed Feb 11, 2008
Commits on Feb 10, 2008
  1. vspell-report: fix --detail

    committed Feb 10, 2008
  2. Use ${O}.vocab2 for LM generation

    ${O}.vocab2 does contain special tokens like <opaque>
    while ${O}.vocab does not
    committed Feb 10, 2008
Commits on Feb 5, 2008
  1. Echo running commands

    committed Feb 5, 2008
Commits on Feb 3, 2008
  1. Temporary compile fix for 33a8726

    committed Feb 3, 2008
Commits on Feb 21, 2007
  1. Updated train script

     - Include timing for long processings
     - Use sc-train --replay
     - Accept two parameters, the first one will be PREFIX.
       The second one is a number
    committed Feb 21, 2007
  2. Prevent overflows in sc2wngram

    <s> <digit>, <s> <opaque> and <s> <punct> may well run over
    int limit (about 2G). So long long int is used
    committed Feb 21, 2007
Commits on Dec 2, 2006
  1. Fixed leaks in LM::operator[](const char *) and LM::clear_oov()

    LM::clear_oov() (also known as clear_rest) resizes LM::oov without
    release memory allocated for its strings
    
    LM::operator[](const char *) abused LM::lm->HT to index LM::oov.
    This has a nasty effect that lm->HT keeps growing no matter you call
    LM::clear_oov(). With lm->HT's growing bigger and bigger, hash lookup
    slows down significantly.
    
    The new implementation uses another hash for index oov and free it
    when LM::clear_oov() is called
    committed Dec 2, 2006
Commits on Dec 1, 2006
  1. Preserve \n in std-syllable output

    It greatly helps preparing the wordlist
    committed Dec 1, 2006
Commits on Nov 30, 2006
  1. Reworked WordArchive::load to use new wordlist format

    The new format is as same as CMU SLM's vocab format. It should
    be noted that words in wordlist are standardized ones.
    
    If WordArchive::load() is called with NULL as argument, it'll
    then use wordlist from struct lm_t inside class LM. So if you
    already load an LM, call warch.load(NULL) to save I/O.
    committed Nov 30, 2006
Commits on Nov 29, 2006
  1. Added record/replay mode to sc-train to speed up the process

    According to sysprof, a large amount of time was spent for
    input processing (operator >> Lattice& and friends). This mode
    tries to eliminate that work.
    
    Record mode runs as usual without real calculation. It
    outputs what steps needed to calculate the final results.
    The format is as follow (for bigrams only):
    <dag count> <node begin id> <node end id>
    <L|R> <v> <vv> <word1> <word2>
    <D> 0 0 none none
    
    Replay mode reads record mode's output and do the rest of work.
    It allocates Sleft, Sright, fill them up and output the counts.
    
    Record mode may take as long as normal mode. But replay mode is
    much faster (about fifteen minutes while normal mode may take
    ninety to one hundred and twenty minutes). Record mode seems,
    however, to output three times bigger than lattice output (approx
    gzipped 300MB)
    committed Nov 29, 2006
Commits on Nov 28, 2006
  1. Use double for softcounting as float is too small

    Some values reached 1e-47 which seems too large for float
    Also added some checking to make sure we get noticed if
    some values are overflowed
    committed Nov 28, 2006
Commits on Nov 21, 2006
  1. Replaced bare Sentence pointer in Lattice class with boost::shared_ptr

    This should fix the leak produced by commit
    c735117
    committed Nov 21, 2006
Commits on Nov 13, 2006
  1. Avoid signed/unsigned char pitfalls when calling viet_is* functions

    Strings are by default char*. That means there are some negative
    character values (such as 'đ' - 0xf0). If these values are passed
    to viet_is* they will reference to undetermined places because
    viet_is* don't check for negative values.
    
    This caused Sentence::tokenize() to ignore 'đủ'
    committed Nov 13, 2006
Commits on Nov 12, 2006
  1. Always ignore 0-weighted edges in PFS. It's too perfect to be true.

    This adds a trampoline to avoid nasty bugs such as
    caf35e6
    committed Nov 12, 2006
  2. LM::wordProb should return a large value in bad cases

    Returning 0 means it's the best LogP out there, which is
    obviously wrong. The selected value is -9999.0
    committed Nov 12, 2006