LM::clear_oov() (also known as clear_rest) resizes LM::oov without release memory allocated for its strings LM::operator(const char *) abused LM::lm->HT to index LM::oov. This has a nasty effect that lm->HT keeps growing no matter you call LM::clear_oov(). With lm->HT's growing bigger and bigger, hash lookup slows down significantly. The new implementation uses another hash for index oov and free it when LM::clear_oov() is called
The new format is as same as CMU SLM's vocab format. It should be noted that words in wordlist are standardized ones. If WordArchive::load() is called with NULL as argument, it'll then use wordlist from struct lm_t inside class LM. So if you already load an LM, call warch.load(NULL) to save I/O.
According to sysprof, a large amount of time was spent for input processing (operator >> Lattice& and friends). This mode tries to eliminate that work. Record mode runs as usual without real calculation. It outputs what steps needed to calculate the final results. The format is as follow (for bigrams only): <dag count> <node begin id> <node end id> <L|R> <v> <vv> <word1> <word2> <D> 0 0 none none Replay mode reads record mode's output and do the rest of work. It allocates Sleft, Sright, fill them up and output the counts. Record mode may take as long as normal mode. But replay mode is much faster (about fifteen minutes while normal mode may take ninety to one hundred and twenty minutes). Record mode seems, however, to output three times bigger than lattice output (approx gzipped 300MB)
This should fix the leak produced by commit c735117
Strings are by default char*. That means there are some negative character values (such as 'đ' - 0xf0). If these values are passed to viet_is* they will reference to undetermined places because viet_is* don't check for negative values. This caused Sentence::tokenize() to ignore 'đủ'
This adds a trampoline to avoid nasty bugs such as caf35e6
Returning 0 means it's the best LogP out there, which is obviously wrong. The selected value is -9999.0