New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Improved processing for automatic matching #1609
Conversation
Pull Request Test Coverage Report for Build 3113778764
💛 - Coveralls |
This is awesome stumpy! I cant say that its in my wheelhouse from a programming perspective but as an end-user it sounds great. One question I had, what happens when the user is using a language other than English? Should Also, I had a trouble tracking down a list of supported languages for NLTK, maybe we could link to that in the docs if there is one. |
That's a great idea and I've incorporated it. The NLTK language is no longer configurable and is instead based on the Tesseract language and the set of NLTK languages supported by the 3 tools used. It's annoying there isn't a clear list of languages, so I downloaded the data and created the mapping manually. And if the language isn't a supported one, the processing falls back to basically the exact same as before. I only took one machine learning class in university, but I do recall the preparation of the input made the biggest impact on the accuracy of the results. I have some ideas about how to quantify the changes, once I have some time to do it. |
Pretty big thing, yet it is completely under the hood. Maybe something for 2.0 release? To me, the mechanism which is used today works pretty fine, but why not optimize it with a better technology when needed. Thanks! |
That is true, it seems to work pretty well even in the basic form. It would be pretty easy to stick this behind a flag, so those with more powerful hardware and a desire to use it can, but defaulting to the normal method. The only expense then would be the nltk library install size, but the data download could also be skipped, so a very minor size increase overall. |
Unless we find this causes a really significant performance hit I think we should keep it on by default but sure, disable-able. My feeling is processing of docs is more important to be accurate / useful than 'fast'. Love the new way language stuff works, thanks! |
Does this work smoothly when several OCR languages are set, like "deu+eng"? |
The NLTK language selection is based off |
855be6a
to
22dbfd8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again works well in my testing, great work as always!
I finally got some data about this using a dataset from Kaggle for multi-label classification (ie tags).
So overall, it's not an incredible increase, but the increased time and resources are also pretty moderate. |
a1f09d1
to
e4bbd89
Compare
…, with tokenization, stemming and stop word removal
…with fallback to the default processing
3f6a4df
to
7cf1e56
Compare
Any tweaks left or anything? Otherwise lets do it, tired of merge conflicts 😝 |
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns. |
Proposed change
The existing machine learning processing for document content is pretty barebones. The content is lowercased and multiple whitespace characters are condensed to a single space. To me, this has one glaring issue: it makes no attempt to distill the content down to only meaningful words.
With this PR, the Natural Language Toolkit is utilized to process text further, in the hopes of only using the meaningful data of a document for its matching. Now, the processing will tokenize, remove stop words, and the remaining words are stemmed. All this new processing is from the NLTK.
In my own testing, the processing works just fine, with items matched as I would expect. It's always hard to quantify something like this. Maybe some future work can have a training set and test set to check the accuracy quantitatively.
Reading:
1. Dropping common terms: stop words
2. Snowball Stemmer
TODOs:
Type of change
Checklist:
pre-commit
hooks, see documentation.