Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
branch: master

Feb 13, 2008

  1. Nguyễn Thái Ngọc Duy

    softcount: tolerate zero ngrams

    authored
  2. Nguyễn Thái Ngọc Duy

    Compress sort's temporary files

    authored
  3. Nguyễn Thái Ngọc Duy

    Rewrite train for easier maintainance

    authored

Feb 12, 2008

  1. Nguyễn Thái Ngọc Duy

    wordProb prototype change

    authored

Feb 11, 2008

  1. Nguyễn Thái Ngọc Duy

    train: Use closed vocab

    authored

Feb 10, 2008

  1. Nguyễn Thái Ngọc Duy

    vspell-report: fix --detail

    authored
  2. Nguyễn Thái Ngọc Duy

    Use ${O}.vocab2 for LM generation

    ${O}.vocab2 does contain special tokens like <opaque>
    while ${O}.vocab does not
    authored

Feb 05, 2008

  1. Nguyễn Thái Ngọc Duy

    Updated config.h.in from autoconf

    authored
  2. Nguyễn Thái Ngọc Duy

    Simplify WordArchive::add_*_entry()

    authored
  3. Nguyễn Thái Ngọc Duy

    Added assertions to get_id() and get_special_node()

    authored
  4. Nguyễn Thái Ngọc Duy

    Added assertions to make sure DNNode is valid

    authored
  5. Nguyễn Thái Ngọc Duy

    softcount: Count as long double instead of double

    authored
  6. Nguyễn Thái Ngọc Duy

    train: Use PREFIX.vocab as wordlist

    authored
  7. Nguyễn Thái Ngọc Duy

    Echo running commands

    authored
  8. Nguyễn Thái Ngọc Duy

    sc-train: Only load wordlist on first count, specified by --wordlist

    authored

Feb 03, 2008

  1. Nguyễn Thái Ngọc Duy

    Temporary compile fix for 33a8726

    authored
  2. Nguyễn Thái Ngọc Duy

    Fix broken Makefile caused by 0a4039c

    authored

Feb 21, 2007

  1. Nguyễn Thái Ngọc Duy

    Show invalid word when informing users "Invalid Entry"

    authored
  2. Nguyễn Thái Ngọc Duy

    Updated train script

     - Include timing for long processings
     - Use sc-train --replay
     - Accept two parameters, the first one will be PREFIX.
       The second one is a number
    authored
  3. Nguyễn Thái Ngọc Duy

    Prevent overflows in sc2wngram

    <s> <digit>, <s> <opaque> and <s> <punct> may well run over
    int limit (about 2G). So long long int is used
    authored

Dec 02, 2006

  1. Nguyễn Thái Ngọc Duy

    Block LM's wordlist and free OOV words after processing

    authored
  2. Nguyễn Thái Ngọc Duy

    Fixed leaks in LM::operator[](const char *) and LM::clear_oov()

    LM::clear_oov() (also known as clear_rest) resizes LM::oov without
    release memory allocated for its strings
    
    LM::operator[](const char *) abused LM::lm->HT to index LM::oov.
    This has a nasty effect that lm->HT keeps growing no matter you call
    LM::clear_oov(). With lm->HT's growing bigger and bigger, hash lookup
    slows down significantly.
    
    The new implementation uses another hash for index oov and free it
    when LM::clear_oov() is called
    authored

Dec 01, 2006

  1. Nguyễn Thái Ngọc Duy

    Softcount: Stop processing if sum is zero

    authored
  2. Nguyễn Thái Ngọc Duy

    Preserve \n in std-syllable output

    It greatly helps preparing the wordlist
    authored

Nov 30, 2006

  1. Nguyễn Thái Ngọc Duy

    Reworked WordArchive::load to use new wordlist format

    The new format is as same as CMU SLM's vocab format. It should
    be noted that words in wordlist are standardized ones.
    
    If WordArchive::load() is called with NULL as argument, it'll
    then use wordlist from struct lm_t inside class LM. So if you
    already load an LM, call warch.load(NULL) to save I/O.
    authored

Nov 29, 2006

  1. Nguyễn Thái Ngọc Duy

    Added record/replay mode to sc-train to speed up the process

    According to sysprof, a large amount of time was spent for
    input processing (operator >> Lattice& and friends). This mode
    tries to eliminate that work.
    
    Record mode runs as usual without real calculation. It
    outputs what steps needed to calculate the final results.
    The format is as follow (for bigrams only):
    <dag count> <node begin id> <node end id>
    <L|R> <v> <vv> <word1> <word2>
    <D> 0 0 none none
    
    Replay mode reads record mode's output and do the rest of work.
    It allocates Sleft, Sright, fill them up and output the counts.
    
    Record mode may take as long as normal mode. But replay mode is
    much faster (about fifteen minutes while normal mode may take
    ninety to one hundred and twenty minutes). Record mode seems,
    however, to output three times bigger than lattice output (approx
    gzipped 300MB)
    authored

Nov 28, 2006

  1. Nguyễn Thái Ngọc Duy

    Use double for softcounting as float is too small

    Some values reached 1e-47 which seems too large for float
    Also added some checking to make sure we get noticed if
    some values are overflowed
    authored

Nov 21, 2006

  1. Nguyễn Thái Ngọc Duy

    Replaced bare Sentence pointer in Lattice class with boost::shared_ptr

    This should fix the leak produced by commit
    c735117
    authored
  2. Nguyễn Thái Ngọc Duy

    Appended \n after stderr messages

    authored

Nov 13, 2006

  1. Nguyễn Thái Ngọc Duy

    Adjusted BranchNNode::add_path() to take const vector<> instead of ve…

    …ctor<>
    authored
  2. Nguyễn Thái Ngọc Duy

    Do exact string comparation in lm/hash.c

    authored
  3. Nguyễn Thái Ngọc Duy

    Avoid signed/unsigned char pitfalls when calling viet_is* functions

    Strings are by default char*. That means there are some negative
    character values (such as 'đ' - 0xf0). If these values are passed
    to viet_is* they will reference to undetermined places because
    viet_is* don't check for negative values.
    
    This caused Sentence::tokenize() to ignore 'đủ'
    authored

Nov 12, 2006

  1. Nguyễn Thái Ngọc Duy

    Build libvspell as a shared library

    authored
  2. Nguyễn Thái Ngọc Duy

    Always ignore 0-weighted edges in PFS. It's too perfect to be true.

    This adds a trampoline to avoid nasty bugs such as
    caf35e6
    authored
  3. Nguyễn Thái Ngọc Duy

    LM::wordProb should return a large value in bad cases

    Returning 0 means it's the best LogP out there, which is
    obviously wrong. The selected value is -9999.0
    authored
Something went wrong with that request. Please try again.