Skip to content
This repository has been archived by the owner on Dec 18, 2019. It is now read-only.

Label guessing and management optimizations #362

Closed
jflesch opened this issue Jan 17, 2015 · 9 comments
Closed

Label guessing and management optimizations #362

jflesch opened this issue Jan 17, 2015 · 9 comments

Comments

@jflesch
Copy link
Member

jflesch commented Jan 17, 2015

Problems:

Label management is painfully slow

When the label guessing screw up and user has to fix 5 labels, it currently means 5 index updates. Whoosh index updates are slooowwww. The GUI needs to allow to change them all in one shot.
See this idea from Mathieu Jourdan for instance.

Label guessing accuracy is not good enough

Average accuracy of label prediction (global): 98%
Average accuracy of label prediction (positive): 88%
Average accuracy of label prediction (negative): 99%

This is good, but maybe not good enough. Some fine-tunings to try:

  • Spellchecking
  • Dropping words too short
  • Dropping diacritics (not done already ?)

Label guessing is slow

When reindexing all the documents, label guessing index update is painfully.
In the GUI, special mechanism had to be implemented because label guessing wasn't fast enough to keep the GUI reactive as-is.
Could a (custom ?) C lib be faster ?

Label guessing index update cannot be rolled back

When updating the index, python-whoosh has operation "commit()" and "rollback()" available. This is handy for instance if the index update is interrupted in the middle. It allows to keep a consistent and well-known index state.

However, the libs used for label guessing don't provide them. So label guessing indexes can end up in a weird state.

--> either we implement a commit/rollback mechanism on top of the current libs, or we implement a custom lib

@jflesch jflesch added this to the 0.3-unstable milestone Jan 17, 2015
@jflesch jflesch changed the title Label detection and management optimizations Label guessing and management optimizations Jan 17, 2015
@jflesch
Copy link
Member Author

jflesch commented Jan 21, 2015

Hm, there is something wrong with label guessing right now.

Just after making a new label and putting it on few documents:

Statistics
==========
Total number of documents: 991
Total number of pages: 1854
Total number of words: 383754
Total words len: 2760727
Total number of unique words: 54579
===
Maximum number of pages in one document: 75
Maximum word length: 179
Average word length: 7.194002
Average number of words per page: 206.987055
Average number of words per document: 387.239152
Average number of pages per document: 1.870838
Average number of unique words per document: 203.152371
Average accuracy of label prediction (global): 93%
Average accuracy of label prediction (positive): 80%
Average accuracy of label prediction (negative): 94%

After rebuilding the indexes from scratch:

Statistics
==========
Total number of documents: 991
Total number of pages: 1854
Total number of words: 383754
Total words len: 2760727
Total number of unique words: 54579
===
Maximum number of pages in one document: 75
Maximum word length: 179
Average word length: 7.194002
Average number of words per page: 206.987055
Average number of words per document: 387.239152
Average number of pages per document: 1.870838
Average number of unique words per document: 203.152371
Average accuracy of label prediction (global): 98%
Average accuracy of label prediction (positive): 87%
Average accuracy of label prediction (negative): 99%

@jflesch
Copy link
Member Author

jflesch commented Aug 27, 2015

https://pypi.python.org/pypi/simplebayes/1.5.8
Looks like a good candidate:

  • It seems python3-ready
  • It's lightweight (no dependency as far as I can tell)
  • The API is really clear

@jflesch jflesch modified the milestones: 0.4-unstable, 0.3-unstable Oct 9, 2015
@jflesch jflesch modified the milestones: 0.3-unstable, 0.4-unstable Nov 16, 2015
@jflesch
Copy link
Member Author

jflesch commented Nov 16, 2015

Reference statistics for testing (2015/11/16 11:50):

Statistics
==========
Total number of documents: 1169
Total number of pages: 2275
Total number of words: 466532
Total words len: 3363270
Total number of unique words: 63930
===
Maximum number of pages in one document: 75
Maximum word length: 222
Average word length: 7.209087
Average number of words per page: 205.069011
Average number of words per document: 399.086399
Average number of pages per document: 1.946108
Average number of unique words per document: 209.521814
Average accuracy of label prediction (global): 98%
Average accuracy of label prediction (positive): 89%
Average accuracy of label prediction (negative): 99%

@jflesch
Copy link
Member Author

jflesch commented Nov 16, 2015

With simplebayes (branch wip-labels):

Statistics
==========
Total number of documents: 1169
Total number of pages: 2275
Total number of words: 466532
Total words len: 3363270
Total number of unique words: 63930
===
Maximum number of pages in one document: 75
Maximum word length: 222
Average word length: 7.209087
Average number of words per page: 205.069011
Average number of words per document: 399.086399
Average number of pages per document: 1.946108
Average number of unique words per document: 209.521814
Average accuracy of label prediction (global): 95%
Average accuracy of label prediction (positive): 40%
Average accuracy of label prediction (negative): 100%

@jflesch
Copy link
Member Author

jflesch commented Nov 16, 2015

With simplebayes + custom tokenizer (split_words() in paperwork/backend/utils):

Statistics
==========
Total number of documents: 1169
Total number of pages: 2275
Total number of words: 466531
Total words len: 3363266
Total number of unique words: 63929
===
Maximum number of pages in one document: 75
Maximum word length: 222
Average word length: 7.209094
Average number of words per page: 205.068571
Average number of words per document: 399.085543
Average number of pages per document: 1.946108
Average number of unique words per document: 209.520958
Average accuracy of label prediction (global): 94%
Average accuracy of label prediction (positive): 24%
Average accuracy of label prediction (negative): 100%

I guess I won't use it ... :-)

@jflesch
Copy link
Member Author

jflesch commented Nov 16, 2015

Or maybe I will:

With simplebayes + custom tokenizer + weights added to scores, things get more interesting:

With a weight of 4.0:

Average accuracy of label prediction (global): 98%
Average accuracy of label prediction (positive): 82%
Average accuracy of label prediction (negative): 99%

Best I found is with a weight of 5.0:

Average accuracy of label prediction (global): 98%
Average accuracy of label prediction (positive): 87%
Average accuracy of label prediction (negative): 99%

Increasing the weight to 6 decrease the accuracy for negative label prediction to < 99% (which is bad):

Average accuracy of label prediction (global): 96%
Average accuracy of label prediction (positive): 97%
Average accuracy of label prediction (negative): 96%

@jflesch
Copy link
Member Author

jflesch commented Nov 16, 2015

Without the custom tokenizer but with the weights:

Average accuracy of label prediction (global): 99%
Average accuracy of label prediction (positive): 94%
Average accuracy of label prediction (negative): 99%

Ok ... I guess I will remove the custom tokenizer ... ^^

@jflesch
Copy link
Member Author

jflesch commented Nov 16, 2015

Just need a migration process

@jflesch
Copy link
Member Author

jflesch commented Nov 16, 2015

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant