-
Notifications
You must be signed in to change notification settings - Fork 149
Label guessing and management optimizations #362
Comments
Hm, there is something wrong with label guessing right now. Just after making a new label and putting it on few documents:
After rebuilding the indexes from scratch:
|
https://pypi.python.org/pypi/simplebayes/1.5.8
|
Reference statistics for testing (2015/11/16 11:50):
|
With simplebayes (branch wip-labels):
|
With simplebayes + custom tokenizer (split_words() in paperwork/backend/utils):
I guess I won't use it ... :-) |
Or maybe I will: With simplebayes + custom tokenizer + weights added to scores, things get more interesting: With a weight of 4.0:
Best I found is with a weight of 5.0:
Increasing the weight to 6 decrease the accuracy for negative label prediction to < 99% (which is bad):
|
Without the custom tokenizer but with the weights:
Ok ... I guess I will remove the custom tokenizer ... ^^ |
Just need a migration process |
Problems:
Label management is painfully slow
When the label guessing screw up and user has to fix 5 labels, it currently means 5 index updates. Whoosh index updates are slooowwww. The GUI needs to allow to change them all in one shot.
See this idea from Mathieu Jourdan for instance.
Label guessing accuracy is not good enough
This is good, but maybe not good enough. Some fine-tunings to try:
Label guessing is slow
When reindexing all the documents, label guessing index update is painfully.
In the GUI, special mechanism had to be implemented because label guessing wasn't fast enough to keep the GUI reactive as-is.
Could a (custom ?) C lib be faster ?
Label guessing index update cannot be rolled back
When updating the index, python-whoosh has operation "commit()" and "rollback()" available. This is handy for instance if the index update is interrupted in the middle. It allows to keep a consistent and well-known index state.
However, the libs used for label guessing don't provide them. So label guessing indexes can end up in a weird state.
--> either we implement a commit/rollback mechanism on top of the current libs, or we implement a custom lib
The text was updated successfully, but these errors were encountered: