Feature: Improved processing for automatic matching #1609

stumpylog · 2022-09-16T22:25:41Z

Proposed change

The existing machine learning processing for document content is pretty barebones. The content is lowercased and multiple whitespace characters are condensed to a single space. To me, this has one glaring issue: it makes no attempt to distill the content down to only meaningful words.

With this PR, the Natural Language Toolkit is utilized to process text further, in the hopes of only using the meaningful data of a document for its matching. Now, the processing will tokenize, remove stop words, and the remaining words are stemmed. All this new processing is from the NLTK.

In my own testing, the processing works just fine, with items matched as I would expect. It's always hard to quantify something like this. Maybe some future work can have a training set and test set to check the accuracy quantitatively.

Reading:
1. Dropping common terms: stop words
2. Snowball Stemmer

TODOs:

Bare metal instructions

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Other (please explain)

Checklist:

I have read & agree with the contributing guidelines.
If applicable, I have tested my code for new features & regressions on both mobile & desktop devices, using the latest version of major browsers.
If applicable, I have checked that all tests pass, see documentation.
I have run all pre-commit hooks, see documentation.
I have made corresponding changes to the documentation as needed.
I have checked my modifications for any breaking changes.

coveralls · 2022-09-16T22:33:59Z

Pull Request Test Coverage Report for Build 3113778764

0 of 0 changed or added relevant lines in 0 files are covered.
53 unchanged lines in 2 files lost coverage.
Overall coverage decreased (-0.3%) to 92.088%

Files with Coverage Reduction	New Missed Lines	%
documents/classifier.py	23	88.99%
paperless/settings.py	30	79.62%

Totals
Change from base Build 3068778301:	-0.3%
Covered Lines:	4923
Relevant Lines:	5346

💛 - Coveralls

shamoon · 2022-09-17T15:33:36Z

This is awesome stumpy! I cant say that its in my wheelhouse from a programming perspective but as an end-user it sounds great.

One question I had, what happens when the user is using a language other than English? Should PAPERLESS_NLTK_LANG default to PAPERLESS_OCR_LANGUAGE? And what happens if you parse a non-english document with english stopwords, maybe thats not a problem since it presumably wont find them? Or do we explicitly want to skip this step if the language isnt supported?

Also, I had a trouble tracking down a list of supported languages for NLTK, maybe we could link to that in the docs if there is one.

stumpylog · 2022-09-18T15:52:17Z

That's a great idea and I've incorporated it. The NLTK language is no longer configurable and is instead based on the Tesseract language and the set of NLTK languages supported by the 3 tools used. It's annoying there isn't a clear list of languages, so I downloaded the data and created the mapping manually.

And if the language isn't a supported one, the processing falls back to basically the exact same as before.

I only took one machine learning class in university, but I do recall the preparation of the input made the biggest impact on the accuracy of the results. I have some ideas about how to quantify the changes, once I have some time to do it.

sukisoft · 2022-09-19T11:26:41Z

Pretty big thing, yet it is completely under the hood. Maybe something for 2.0 release?

To me, the mechanism which is used today works pretty fine, but why not optimize it with a better technology when needed. Thanks!

stumpylog · 2022-09-20T14:27:27Z

To me, the mechanism which is used today works pretty fine, but why not optimize it with a better technology when needed.

That is true, it seems to work pretty well even in the basic form. It would be pretty easy to stick this behind a flag, so those with more powerful hardware and a desire to use it can, but defaulting to the normal method.

The only expense then would be the nltk library install size, but the data download could also be skipped, so a very minor size increase overall.

shamoon · 2022-09-20T16:33:57Z

Unless we find this causes a really significant performance hit I think we should keep it on by default but sure, disable-able. My feeling is processing of docs is more important to be accurate / useful than 'fast'. Love the new way language stuff works, thanks!

tooomm · 2022-09-26T06:53:54Z

Does this work smoothly when several OCR languages are set, like "deu+eng"?

stumpylog · 2022-09-26T13:45:58Z

The NLTK language selection is based off PAPERLESS_OCR_LANGUAGE instead of PAPERLESS_OCR_LANGUAGES (with the s). That value is a single language code, not anything with pluses. I know, not at all confusing names.

shamoon

Again works well in my testing, great work as always!

stumpylog · 2022-09-29T22:58:15Z

I finally got some data about this using a dataset from Kaggle for multi-label classification (ie tags).

It does take a little longer to train. For the whole training dataset, 32s vs 24s
A moderate amount of more memory 45MB vs 29 MB during training
+1.48% precision, +1.85% recall

So overall, it's not an incredible increase, but the increased time and resources are also pretty moderate.

…, with tokenization, stemming and stop word removal

…with fallback to the default processing

… data

shamoon · 2022-10-10T15:57:25Z

Any tweaks left or anything? Otherwise lets do it, tired of merge conflicts 😝

github-actions · 2023-04-17T10:04:02Z

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

paperless-ngx-secretary bot added backend ci-cd documentation Improvements or additions to documentation non-trivial Requires approval by several team members labels Sep 16, 2022

stumpylog added enhancement New feature work-in-progress Pull request which needs some changes before being able to merge and removed documentation Improvements or additions to documentation non-trivial Requires approval by several team members ci-cd labels Sep 18, 2022

shamoon added this to the Future Release milestone Sep 19, 2022

stumpylog force-pushed the feature-improve-machine-learning branch from 855be6a to 22dbfd8 Compare September 26, 2022 16:09

stumpylog changed the title ~~RFC: Feature: Improved processing for automatic matching~~ Feature: Improved processing for automatic matching Sep 26, 2022

stumpylog marked this pull request as ready for review September 26, 2022 16:09

stumpylog requested review from a team as code owners September 26, 2022 16:09

stumpylog self-assigned this Sep 26, 2022

stumpylog removed the work-in-progress Pull request which needs some changes before being able to merge label Sep 27, 2022

shamoon approved these changes Sep 29, 2022

View reviewed changes

stumpylog force-pushed the feature-improve-machine-learning branch from a1f09d1 to e4bbd89 Compare October 5, 2022 17:58

stumpylog added 2 commits October 10, 2022 07:30

Updates the pre-processing of document content to be much more robust…

de390f3

…, with tokenization, stemming and stop word removal

Fixes CI unit testing

1c78022

stumpylog added 9 commits October 10, 2022 07:30

Mock out the nltk portions so the data doesn't need to be downloaded

8268739

Missed one mock

84d3a82

Fixes the download and usage of the downloaded data

f89196f

Allows configuration of the NLTK processing language

6e2b0a7

Changes the NLTK language to be based on the Tesseract OCR language, …

04a042d

…with fallback to the default processing

Allows disabling NLTK, adds it as a consideration for low power devices

5328322

Account for plusses in the OCR language setting

4582736

Adds skipping of NLTK data download if the feature appears disabled

2cfd75e

Adds step to bare metal setup regarding downloading the required NLTK…

7cf1e56

… data

stumpylog force-pushed the feature-improve-machine-learning branch from 3f6a4df to 7cf1e56 Compare October 10, 2022 15:13

stumpylog merged commit dafefa3 into dev Oct 10, 2022

stumpylog deleted the feature-improve-machine-learning branch October 10, 2022 15:58

github-actions bot locked as resolved and limited conversation to collaborators Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Improved processing for automatic matching #1609

Feature: Improved processing for automatic matching #1609

stumpylog commented Sep 16, 2022 •

edited

coveralls commented Sep 16, 2022 •

edited

shamoon commented Sep 17, 2022

stumpylog commented Sep 18, 2022 •

edited

sukisoft commented Sep 19, 2022

stumpylog commented Sep 20, 2022

shamoon commented Sep 20, 2022

tooomm commented Sep 26, 2022

stumpylog commented Sep 26, 2022

shamoon left a comment

stumpylog commented Sep 29, 2022

shamoon commented Oct 10, 2022

github-actions bot commented Apr 17, 2023

Feature: Improved processing for automatic matching #1609

Feature: Improved processing for automatic matching #1609

Conversation

stumpylog commented Sep 16, 2022 • edited

Proposed change

Type of change

Checklist:

coveralls commented Sep 16, 2022 • edited

Pull Request Test Coverage Report for Build 3113778764

💛 - Coveralls

shamoon commented Sep 17, 2022

stumpylog commented Sep 18, 2022 • edited

sukisoft commented Sep 19, 2022

stumpylog commented Sep 20, 2022

shamoon commented Sep 20, 2022

tooomm commented Sep 26, 2022

stumpylog commented Sep 26, 2022

shamoon left a comment

Choose a reason for hiding this comment

stumpylog commented Sep 29, 2022

shamoon commented Oct 10, 2022

github-actions bot commented Apr 17, 2023

stumpylog commented Sep 16, 2022 •

edited

coveralls commented Sep 16, 2022 •

edited

stumpylog commented Sep 18, 2022 •

edited