- When using --train-fast, remove the "flushing cache" message when done - Word tokenizer: * Improve tokenization of email addresses * Use backspace instead of escape as a magic character when capitalizing text in multiple passes, since it's less likely to appear in tokens. * Preserve casing of words like "ATMs"
- Scored engine: Prefer shorter replies, like MegaHAL/cobe do - Word tokenizer: * Improve matching/capitalization of filenames and domain names * Match timestamps as single tokens * Match IRC nicks (<foobar>, <@foobar>, etc) as single tokens * Match IRC channel names (#foo, &bar, +baz) * Match various prefixes and postfixes with numbers * Match "#1" and "#1234" as single tokens * Match </foo> as a single token - Depend on MouseX::Getopt 0.33 to fix test failures
It's not that useful anyway.
I changed the way input is processed, so that we can match whitespace in tokens. This allows matching paths with spaces in them, as well as IRC nicks from irssi such as < literal>.
Due to how the tokenizer works, at least one of the tokens will always have normal spacing.
- Speed up the learning of repetitive sentences by caching more - Added Hailo::Engine::Scored, which generates multiple replies (limited by time or number of iterations) and returns the best one. Based on code from Peter Teichman's Cobe project. - Fixed a bug which caused the tokenizer to be very slow at capitalizing replies which contain things like "script/osm-to-tilenumbers.pl" - Speed up learning quite a bit (up to 25%) by using more efficient SQL. - Add --train-fast to speed up learning by up to an additional 45% on large brains by using aggressive caching. This uses a lot of memory. Almost 600MB with SQLite on a 64bit machine for a brain which eventually takes 134MB on disk (trained from a 350k line IRC log). - Word tokenizer: * Preserve casing of Emacs key sequences like "C-u" * Don't capitalize words after ellipses (e.g. "Wait... what?") * When adding a full stop to paragraphs which end with a quoted word, add it inside the quotes (e.g. "I heard him say 'hello there.'") * Make it work correctly when the input has newlines