Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Improved processing for automatic matching #1609

Merged
merged 11 commits into from Oct 10, 2022

Conversation

stumpylog
Copy link
Member

@stumpylog stumpylog commented Sep 16, 2022

Proposed change

The existing machine learning processing for document content is pretty barebones. The content is lowercased and multiple whitespace characters are condensed to a single space. To me, this has one glaring issue: it makes no attempt to distill the content down to only meaningful words.

With this PR, the Natural Language Toolkit is utilized to process text further, in the hopes of only using the meaningful data of a document for its matching. Now, the processing will tokenize, remove stop words, and the remaining words are stemmed. All this new processing is from the NLTK.

In my own testing, the processing works just fine, with items matched as I would expect. It's always hard to quantify something like this. Maybe some future work can have a training set and test set to check the accuracy quantitatively.

Reading:
1. Dropping common terms: stop words
2. Snowball Stemmer

TODOs:

  • Bare metal instructions

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Other (please explain)

Checklist:

  • I have read & agree with the contributing guidelines.
  • If applicable, I have tested my code for new features & regressions on both mobile & desktop devices, using the latest version of major browsers.
  • If applicable, I have checked that all tests pass, see documentation.
  • I have run all pre-commit hooks, see documentation.
  • I have made corresponding changes to the documentation as needed.
  • I have checked my modifications for any breaking changes.

@paperless-ngx-secretary paperless-ngx-secretary bot added backend ci-cd documentation Improvements or additions to documentation non-trivial Requires approval by several team members labels Sep 16, 2022
@coveralls
Copy link

coveralls commented Sep 16, 2022

Pull Request Test Coverage Report for Build 3113778764

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 53 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.3%) to 92.088%

Files with Coverage Reduction New Missed Lines %
documents/classifier.py 23 88.99%
paperless/settings.py 30 79.62%
Totals Coverage Status
Change from base Build 3068778301: -0.3%
Covered Lines: 4923
Relevant Lines: 5346

💛 - Coveralls

@shamoon
Copy link
Member

shamoon commented Sep 17, 2022

This is awesome stumpy! I cant say that its in my wheelhouse from a programming perspective but as an end-user it sounds great.

One question I had, what happens when the user is using a language other than English? Should PAPERLESS_NLTK_LANG default to PAPERLESS_OCR_LANGUAGE? And what happens if you parse a non-english document with english stopwords, maybe thats not a problem since it presumably wont find them? Or do we explicitly want to skip this step if the language isnt supported?

Also, I had a trouble tracking down a list of supported languages for NLTK, maybe we could link to that in the docs if there is one.

@stumpylog stumpylog added enhancement New feature work-in-progress Pull request which needs some changes before being able to merge and removed documentation Improvements or additions to documentation non-trivial Requires approval by several team members ci-cd labels Sep 18, 2022
@stumpylog
Copy link
Member Author

stumpylog commented Sep 18, 2022

That's a great idea and I've incorporated it. The NLTK language is no longer configurable and is instead based on the Tesseract language and the set of NLTK languages supported by the 3 tools used. It's annoying there isn't a clear list of languages, so I downloaded the data and created the mapping manually.

And if the language isn't a supported one, the processing falls back to basically the exact same as before.

I only took one machine learning class in university, but I do recall the preparation of the input made the biggest impact on the accuracy of the results. I have some ideas about how to quantify the changes, once I have some time to do it.

@shamoon shamoon added this to the Future Release milestone Sep 19, 2022
@sukisoft
Copy link

Pretty big thing, yet it is completely under the hood. Maybe something for 2.0 release?

To me, the mechanism which is used today works pretty fine, but why not optimize it with a better technology when needed. Thanks!

@stumpylog
Copy link
Member Author

To me, the mechanism which is used today works pretty fine, but why not optimize it with a better technology when needed.

That is true, it seems to work pretty well even in the basic form. It would be pretty easy to stick this behind a flag, so those with more powerful hardware and a desire to use it can, but defaulting to the normal method.

The only expense then would be the nltk library install size, but the data download could also be skipped, so a very minor size increase overall.

@shamoon
Copy link
Member

shamoon commented Sep 20, 2022

Unless we find this causes a really significant performance hit I think we should keep it on by default but sure, disable-able. My feeling is processing of docs is more important to be accurate / useful than 'fast'. Love the new way language stuff works, thanks!

@tooomm
Copy link
Contributor

tooomm commented Sep 26, 2022

Does this work smoothly when several OCR languages are set, like "deu+eng"?

@stumpylog
Copy link
Member Author

The NLTK language selection is based off PAPERLESS_OCR_LANGUAGE instead of PAPERLESS_OCR_LANGUAGES (with the s). That value is a single language code, not anything with pluses. I know, not at all confusing names.

@stumpylog stumpylog force-pushed the feature-improve-machine-learning branch from 855be6a to 22dbfd8 Compare September 26, 2022 16:09
@stumpylog stumpylog changed the title RFC: Feature: Improved processing for automatic matching Feature: Improved processing for automatic matching Sep 26, 2022
@stumpylog stumpylog marked this pull request as ready for review September 26, 2022 16:09
@stumpylog stumpylog requested review from a team as code owners September 26, 2022 16:09
@stumpylog stumpylog self-assigned this Sep 26, 2022
@stumpylog stumpylog removed the work-in-progress Pull request which needs some changes before being able to merge label Sep 27, 2022
Copy link
Member

@shamoon shamoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again works well in my testing, great work as always!

@stumpylog
Copy link
Member Author

I finally got some data about this using a dataset from Kaggle for multi-label classification (ie tags).

  • It does take a little longer to train. For the whole training dataset, 32s vs 24s
  • A moderate amount of more memory 45MB vs 29 MB during training
  • +1.48% precision, +1.85% recall

So overall, it's not an incredible increase, but the increased time and resources are also pretty moderate.

@stumpylog stumpylog force-pushed the feature-improve-machine-learning branch from a1f09d1 to e4bbd89 Compare October 5, 2022 17:58
@stumpylog stumpylog force-pushed the feature-improve-machine-learning branch from 3f6a4df to 7cf1e56 Compare October 10, 2022 15:13
@shamoon
Copy link
Member

shamoon commented Oct 10, 2022

Any tweaks left or anything? Otherwise lets do it, tired of merge conflicts 😝

@stumpylog stumpylog merged commit dafefa3 into dev Oct 10, 2022
@stumpylog stumpylog deleted the feature-improve-machine-learning branch October 10, 2022 15:58
@github-actions
Copy link
Contributor

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

5 participants