Label guessing and management optimizations #362

jflesch · 2015-01-17T22:26:41Z

Problems:

Label management is painfully slow

When the label guessing screw up and user has to fix 5 labels, it currently means 5 index updates. Whoosh index updates are slooowwww. The GUI needs to allow to change them all in one shot.
See this idea from Mathieu Jourdan for instance.

Label guessing accuracy is not good enough

Average accuracy of label prediction (global): 98%
Average accuracy of label prediction (positive): 88%
Average accuracy of label prediction (negative): 99%

This is good, but maybe not good enough. Some fine-tunings to try:

Spellchecking
Dropping words too short
Dropping diacritics (not done already ?)

Label guessing is slow

When reindexing all the documents, label guessing index update is painfully.
In the GUI, special mechanism had to be implemented because label guessing wasn't fast enough to keep the GUI reactive as-is.
Could a (custom ?) C lib be faster ?

Label guessing index update cannot be rolled back

When updating the index, python-whoosh has operation "commit()" and "rollback()" available. This is handy for instance if the index update is interrupted in the middle. It allows to keep a consistent and well-known index state.

However, the libs used for label guessing don't provide them. So label guessing indexes can end up in a weird state.

--> either we implement a commit/rollback mechanism on top of the current libs, or we implement a custom lib

The text was updated successfully, but these errors were encountered:

jflesch · 2015-01-21T23:19:06Z

Hm, there is something wrong with label guessing right now.

Just after making a new label and putting it on few documents:

Statistics
==========
Total number of documents: 991
Total number of pages: 1854
Total number of words: 383754
Total words len: 2760727
Total number of unique words: 54579
===
Maximum number of pages in one document: 75
Maximum word length: 179
Average word length: 7.194002
Average number of words per page: 206.987055
Average number of words per document: 387.239152
Average number of pages per document: 1.870838
Average number of unique words per document: 203.152371
Average accuracy of label prediction (global): 93%
Average accuracy of label prediction (positive): 80%
Average accuracy of label prediction (negative): 94%

After rebuilding the indexes from scratch:

Statistics
==========
Total number of documents: 991
Total number of pages: 1854
Total number of words: 383754
Total words len: 2760727
Total number of unique words: 54579
===
Maximum number of pages in one document: 75
Maximum word length: 179
Average word length: 7.194002
Average number of words per page: 206.987055
Average number of words per document: 387.239152
Average number of pages per document: 1.870838
Average number of unique words per document: 203.152371
Average accuracy of label prediction (global): 98%
Average accuracy of label prediction (positive): 87%
Average accuracy of label prediction (negative): 99%

jflesch · 2015-08-27T13:53:05Z

https://pypi.python.org/pypi/simplebayes/1.5.8
Looks like a good candidate:

It seems python3-ready
It's lightweight (no dependency as far as I can tell)
The API is really clear

jflesch · 2015-11-16T13:37:30Z

Reference statistics for testing (2015/11/16 11:50):

Statistics
==========
Total number of documents: 1169
Total number of pages: 2275
Total number of words: 466532
Total words len: 3363270
Total number of unique words: 63930
===
Maximum number of pages in one document: 75
Maximum word length: 222
Average word length: 7.209087
Average number of words per page: 205.069011
Average number of words per document: 399.086399
Average number of pages per document: 1.946108
Average number of unique words per document: 209.521814
Average accuracy of label prediction (global): 98%
Average accuracy of label prediction (positive): 89%
Average accuracy of label prediction (negative): 99%

jflesch · 2015-11-16T14:57:42Z

With simplebayes (branch wip-labels):

Statistics
==========
Total number of documents: 1169
Total number of pages: 2275
Total number of words: 466532
Total words len: 3363270
Total number of unique words: 63930
===
Maximum number of pages in one document: 75
Maximum word length: 222
Average word length: 7.209087
Average number of words per page: 205.069011
Average number of words per document: 399.086399
Average number of pages per document: 1.946108
Average number of unique words per document: 209.521814
Average accuracy of label prediction (global): 95%
Average accuracy of label prediction (positive): 40%
Average accuracy of label prediction (negative): 100%

jflesch · 2015-11-16T15:16:46Z

With simplebayes + custom tokenizer (split_words() in paperwork/backend/utils):

Statistics
==========
Total number of documents: 1169
Total number of pages: 2275
Total number of words: 466531
Total words len: 3363266
Total number of unique words: 63929
===
Maximum number of pages in one document: 75
Maximum word length: 222
Average word length: 7.209094
Average number of words per page: 205.068571
Average number of words per document: 399.085543
Average number of pages per document: 1.946108
Average number of unique words per document: 209.520958
Average accuracy of label prediction (global): 94%
Average accuracy of label prediction (positive): 24%
Average accuracy of label prediction (negative): 100%

I guess I won't use it ... :-)

jflesch · 2015-11-16T15:41:02Z

Or maybe I will:

With simplebayes + custom tokenizer + weights added to scores, things get more interesting:

With a weight of 4.0:

Average accuracy of label prediction (global): 98%
Average accuracy of label prediction (positive): 82%
Average accuracy of label prediction (negative): 99%

Best I found is with a weight of 5.0:

Average accuracy of label prediction (global): 98%
Average accuracy of label prediction (positive): 87%
Average accuracy of label prediction (negative): 99%

Increasing the weight to 6 decrease the accuracy for negative label prediction to < 99% (which is bad):

Average accuracy of label prediction (global): 96%
Average accuracy of label prediction (positive): 97%
Average accuracy of label prediction (negative): 96%

jflesch · 2015-11-16T15:47:00Z

Without the custom tokenizer but with the weights:

Average accuracy of label prediction (global): 99%
Average accuracy of label prediction (positive): 94%
Average accuracy of label prediction (negative): 99%

Ok ... I guess I will remove the custom tokenizer ... ^^

jflesch · 2015-11-16T15:51:50Z

Just need a migration process

jflesch · 2015-11-16T17:17:31Z

4a9c686
a5ec3e7
eada91e
221ebf4
0b3bce0
742e4f7

jflesch added to study gui improvement optim labels Jan 17, 2015

jflesch added this to the 0.3-unstable milestone Jan 17, 2015

jflesch changed the title ~~Label detection and management optimizations~~ Label guessing and management optimizations Jan 17, 2015

tYYGH mentioned this issue Feb 13, 2015

Paperwork 0.2 UX issues #356

Closed

jflesch modified the milestones: 0.4-unstable, 0.3-unstable Oct 9, 2015

jflesch modified the milestones: 0.3-unstable, 0.4-unstable Nov 16, 2015

jflesch closed this as completed Nov 16, 2015

tYYGH mentioned this issue Nov 16, 2015

[unstable] regression in labels caused by os.path.sep #410

Closed

jflesch removed the to study label Feb 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Label guessing and management optimizations #362

Label guessing and management optimizations #362

jflesch commented Jan 17, 2015

jflesch commented Jan 21, 2015

jflesch commented Aug 27, 2015

jflesch commented Nov 16, 2015

jflesch commented Nov 16, 2015

jflesch commented Nov 16, 2015

jflesch commented Nov 16, 2015

jflesch commented Nov 16, 2015

jflesch commented Nov 16, 2015

jflesch commented Nov 16, 2015

Label guessing and management optimizations #362

Label guessing and management optimizations #362

Comments

jflesch commented Jan 17, 2015

Label management is painfully slow

Label guessing accuracy is not good enough

Label guessing is slow

Label guessing index update cannot be rolled back

jflesch commented Jan 21, 2015

jflesch commented Aug 27, 2015

jflesch commented Nov 16, 2015

jflesch commented Nov 16, 2015

jflesch commented Nov 16, 2015

jflesch commented Nov 16, 2015

jflesch commented Nov 16, 2015

jflesch commented Nov 16, 2015

jflesch commented Nov 16, 2015