Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Barcode splitting not detecting all barcodes #2385

Closed
Rondal opened this issue Jan 8, 2023 · 14 comments · Fixed by #2468
Closed

[BUG] Barcode splitting not detecting all barcodes #2385

Rondal opened this issue Jan 8, 2023 · 14 comments · Fixed by #2468
Labels
bug Bug report or a Bug-fix dependencies Pull requests that update a dependency file
Milestone

Comments

@Rondal
Copy link

Rondal commented Jan 8, 2023

Description

With Paperless 1.11.3 barcode splitting is no longer working for me. It was working with my specific setup as of 1.6.0 or 1.7.0. I can't say for sure for the versions between. I mostly worked through old files which were already splitted.

The split is just not happening and I do not see any detection in the log. Interestingly: If I scan a document that has barcodes in it (unrelated) they are detected. Can't show these because they are payment codes with personal data in them.

I'm aware of the work that was done for example #1953 by @stumpylog so I also tested with the most recent dev Image on docker-hub. No change.

The pdfminer.six error visible in the log is not happening if I refeed the finished and downloaded document. The problem stays the same.

Steps to reproduce

  • Scan the file
  • Let it process
  • See that it does not get split

Webserver logs

webserver_1  | [2023-01-08 20:59:15,444] [INFO] [celery.apps.worker] celery@39f45dcafe55 ready.
webserver_1  | [2023-01-08 21:00:00,049] [INFO] [celery.beat] Scheduler: Sending due task Check all e-mail accounts (paperless_mail.tasks.process_mail_accounts)
webserver_1  | [2023-01-08 21:00:00,071] [INFO] [celery.worker.strategy] Task paperless_mail.tasks.process_mail_accounts[cb89622b-a6cb-4a16-beaa-ca9a91229c2b] received
webserver_1  | [2023-01-08 21:00:00,094] [INFO] [celery.app.trace] Task paperless_mail.tasks.process_mail_accounts[cb89622b-a6cb-4a16-beaa-ca9a91229c2b] succeeded in 0.02098793600453064s: 'No new documents were added.'
webserver_1  | [2023-01-08 21:00:23,937] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/consume/ads2800w_au_20230108_002067.pdf to the task queue.
webserver_1  | [2023-01-08 21:00:23,986] [INFO] [celery.worker.strategy] Task documents.tasks.consume_file[5a9d7200-b165-4787-8ab7-7235d142ff07] received
webserver_1  | [2023-01-08 21:00:26,655] [INFO] [paperless.consumer] Consuming ads2800w_au_20230108_002067.pdf
webserver_1  | [2023-01-08 21:00:27,024] [WARNING] [paperless.parsing.tesseract] Error while getting text from PDF document with pdfminer.six
webserver_1  | Traceback (most recent call last):
webserver_1  |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 140, in extract_text
webserver_1  |     lang = detect(stripped)
webserver_1  |   File "/usr/local/lib/python3.9/site-packages/langdetect/detector_factory.py", line 130, in detect
webserver_1  |     return detector.detect()
webserver_1  |   File "/usr/local/lib/python3.9/site-packages/langdetect/detector.py", line 136, in detect
webserver_1  |     probabilities = self.get_probabilities()
webserver_1  |   File "/usr/local/lib/python3.9/site-packages/langdetect/detector.py", line 143, in get_probabilities
webserver_1  |     self._detect_block()
webserver_1  |   File "/usr/local/lib/python3.9/site-packages/langdetect/detector.py", line 150, in _detect_block
webserver_1  |     raise LangDetectException(ErrorCode.CantDetectError, 'No features in text.')
webserver_1  | langdetect.lang_detect_exception.LangDetectException: No features in text.
webserver_1  | [2023-01-08 21:00:27,273] [INFO] [ocrmypdf._sync] Start processing 2 pages concurrently
webserver_1  | [2023-01-08 21:00:31,194] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Too few characters. Skipping this page
webserver_1  | [2023-01-08 21:00:31,195] [ERROR] [ocrmypdf._exec.tesseract] [tesseract] Error during processing.
webserver_1  | [2023-01-08 21:00:31,195] [INFO] [ocrmypdf._pipeline] page is facing ⇧, confidence 0.00 - no change
webserver_1  | [2023-01-08 21:00:31,489] [INFO] [ocrmypdf._pipeline] page is facing ⇧, confidence 16.50 - rotation appears correct
webserver_1  | [2023-01-08 21:01:00,133] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Image too small to scale!! (1x36 vs min width of 3)
webserver_1  | [2023-01-08 21:01:00,133] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Line cannot be recognized!!
webserver_1  | [2023-01-08 21:01:00,133] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Image too small to scale!! (1x36 vs min width of 3)
webserver_1  | [2023-01-08 21:01:00,133] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Line cannot be recognized!!
webserver_1  | [2023-01-08 21:01:00,133] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Image too small to scale!! (1x36 vs min width of 3)
webserver_1  | [2023-01-08 21:01:00,133] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Line cannot be recognized!!
webserver_1  | [2023-01-08 21:01:01,206] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
webserver_1  | [2023-01-08 21:01:01,207] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Image too small to scale!! (1x36 vs min width of 3)
webserver_1  | [2023-01-08 21:01:01,207] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Line cannot be recognized!!
webserver_1  | [2023-01-08 21:01:01,207] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Image too small to scale!! (1x36 vs min width of 3)
webserver_1  | [2023-01-08 21:01:01,207] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Line cannot be recognized!!
webserver_1  | [2023-01-08 21:01:04,057] [INFO] [ocrmypdf._pipeline] page is facing ⇧, confidence 16.41 - rotation appears correct
webserver_1  | [2023-01-08 21:01:30,932] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Image too small to scale!! (2x36 vs min width of 3)
webserver_1  | [2023-01-08 21:01:30,932] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Line cannot be recognized!!
webserver_1  | [2023-01-08 21:01:30,932] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Image too small to scale!! (2x36 vs min width of 3)
webserver_1  | [2023-01-08 21:01:30,932] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Line cannot be recognized!!
webserver_1  | [2023-01-08 21:01:30,948] [INFO] [ocrmypdf._sync] Postprocessing...
webserver_1  | [2023-01-08 21:01:37,320] [INFO] [ocrmypdf._pipeline] Optimize ratio: 1.41 savings: 28.9%
webserver_1  | [2023-01-08 21:01:37,326] [INFO] [ocrmypdf._sync] Output file is a PDF/A-2B (as expected)
webserver_1  | [2023-01-08 21:01:42,968] [INFO] [paperless.consumer] Document 2022-04-20 ads2800w_au_20230108_002067 consumption finished
webserver_1  | [2023-01-08 21:01:42,984] [INFO] [celery.app.trace] Task documents.tasks.consume_file[5a9d7200-b165-4787-8ab7-7235d142ff07] succeeded in 78.99541163595859s: 'Success. New document id 1825 created'
webserver_1  | [2023-01-08 21:05:00,097] [INFO] [celery.beat] Scheduler: Sending due task Train the classifier (documents.tasks.train_classifier)
webserver_1  | [2023-01-08 21:05:00,099] [INFO] [celery.worker.strategy] Task documents.tasks.train_classifier[91ff8d02-1b68-461e-92b6-826456a299c6] received
webserver_1  | [2023-01-08 21:05:18,581] [INFO] [celery.app.trace] Task documents.tasks.train_classifier[91ff8d02-1b68-461e-92b6-826456a299c6] succeeded in 18.479739824018907s: None

Browser logs

No response

Paperless-ngx version

1.11.3

Host OS

Ubuntu 22.04.1 LTS / docker

Installation method

Docker - official image

Browser

Chrome

Configuration changes

No response

Other

No response

@Rondal Rondal added bug Bug report or a Bug-fix unconfirmed labels Jan 8, 2023
@Rondal
Copy link
Author

Rondal commented Jan 8, 2023

Relevant environment:

dms@paperless:~/paperless-ngx$ cat docker-compose.env | grep CONSUMER
PAPERLESS_CONSUMER_RECURSIVE=true
PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS=true
PAPERLESS_CONSUMER_POLLING=5
PAPERLESS_CONSUMER_IGNORE_PATTERNS=[".DS_STORE/*", "._*", ".stfolder/*", "@eaDir/*", ".DS_Store"]
PAPERLESS_CONSUMER_DELETE_DUPLICATES=true
PAPERLESS_CONSUMER_ENABLE_BARCODES=true
PAPERLESS_CONSUMER_BARCODE_STRING=ADAR-NEXTDOC
PAPERLESS_CONSUMER_USE_LEGACY_DETECTION=true

@Rondal
Copy link
Author

Rondal commented Jan 8, 2023

@stumpylog
Copy link
Member

Probably some image format pikepdf thinks it handles, but doesn't actually. It's getting annoying at this point.

@Rondal
Copy link
Author

Rondal commented Jan 8, 2023

Shouldn‘t your PR mentioned above fix this?

@stumpylog
Copy link
Member

There's no exception raised during the scanning, so there's no reason to fallback.

This seems like a weird pyzbar issue actually. The image extracts fine:
page_1_image_0

But nothing is found

>>> from pyzbar import pyzbar
>>> from PIL import Image
>>> image = Image.open("page_1_image_0.png")
>>> pyzbar.decode(image)
[]

Honestly, even more annoying

@Rondal
Copy link
Author

Rondal commented Jan 8, 2023

Just once I‘d like to open a bug just to be told that I‘m just made a stupid mistake ;)

So… it was working in the past with that specific page. I don‘t think it‘s the scanner because I do see barcodes parsed correctly with the same settings (mentioned above). So something with this specific page? Too many barcodes? But why is there no message at all?

@Rondal
Copy link
Author

Rondal commented Jan 8, 2023

Fun…

Found two bug reports for pyzbar:

NaturalHistoryMuseum/pyzbar#75

and

NaturalHistoryMuseum/pyzbar#63

Both haven‘t been worked on but the second one gave me the impression that it might habe to do with the dimensions of the pictures. I grabbed the png from above and quickly resized it from 600dpi to 300 and 150.

Result below. Interesting detail that it found even more on the smallest one. So it seems to be stopping after a specific amount of… pixels?

>>> from PIL import Image
>>> from pyzbar import pyzbar
>>> l = Image.open("600dpi.png")
>>> m = Image.open("300dpi.png")
>>> s = Image.open("150dpi.png")
>>> pyzbar.decode(l)
[]
>>> pyzbar.decode(m)
[Decoded(data=b'ADAR-NEXTDOC', type='QRCODE', rect=Rect(left=945, top=848, width=533, height=526), polygon=[Point(x=945, y=848), Point(x=945, y=1374), Point(x=1478, y=1374), Point(x=1478, y=848)], quality=1, orientation='UP'), Decoded(data=b'ADAR-NEXTDOC', type='QRCODE', rect=Rect(left=1661, top=273, width=537, height=524), polygon=[Point(x=1661, y=273), Point(x=1663, y=797), Point(x=2198, y=797), Point(x=2195, y=273)], quality=1, orientation='UP')]
>>> pyzbar.decode(s)
[Decoded(data=b'ADAR-NEXTDOC', type='QRCODE', rect=Rect(left=831, top=136, width=267, height=263), polygon=[Point(x=831, y=136), Point(x=831, y=399), Point(x=1098, y=398), Point(x=1097, y=136)], quality=1, orientation='UP'), Decoded(data=b'ADAR-NEXTDOC', type='QRCODE', rect=Rect(left=111, top=136, width=268, height=264), polygon=[Point(x=111, y=136), Point(x=113, y=400), Point(x=379, y=400), Point(x=378, y=136)], quality=1, orientation='UP'), Decoded(data=b'ADAR-NEXTDOC', type='QRCODE', rect=Rect(left=472, top=424, width=267, height=263), polygon=[Point(x=472, y=424), Point(x=473, y=687), Point(x=739, y=687), Point(x=738, y=424)], quality=1, orientation='UP'), Decoded(data=b'ADAR-NEXTDOC', type='QRCODE', rect=Rect(left=120, top=875, width=392, height=386), polygon=[Point(x=120, y=876), Point(x=121, y=1261), Point(x=512, y=1261), Point(x=512, y=875)], quality=1, orientation='UP'), Decoded(data=b'ADAR-NEXTDOC', type='QRCODE', rect=Rect(left=621, top=1104, width=472, height=461), polygon=[Point(x=621, y=1105), Point(x=622, y=1565), Point(x=1093, y=1564), Point(x=1090, y=1104)], quality=1, orientation='UP'), Decoded(data=b'ADAR-NEXTDOC', type='CODE128', rect=Rect(left=169, top=910, width=860, height=485), polygon=[Point(x=169, y=1371), Point(x=169, y=1395), Point(x=539, y=1366), Point(x=1029, y=1002), Point(x=1029, y=910), Point(x=658, y=957)], quality=57, orientation='UP')]
>>> 

@stumpylog
Copy link
Member

Sigh. That's a unique finding. I checked the extracted image with zbarimg, which is also using libzbar for barcodes. It fails to find anything, unless I run some random convert commands against the extracted file.

I'm not sure yet what to do, but I'll have to have a think on it

@Rondal
Copy link
Author

Rondal commented Jan 9, 2023

I did a bit more testing, based on yesterdays script. I used 3 different images (original as above, one with only one QR upper left, one with only one QR lower right). All images in 4 different resolutions (600, 300, 150 and 72dpi). Then I counted how many barcodes were detected:

Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyzbar import pyzbar
>>> from PIL import Image
>>> list = ['600dpi', '300dpi', '150dpi', '72dpi', 'lower_right_600dpi', 'lower_right_300dpi', 'lower_right_150dpi', 'lower_right_72dpi', 'upper_left_600dpi', 'upper_left_300dpi', 'upper_left_150dpi', 'upper_left_72dpi']
>>> for name in list:
...     img = Image.open(name + ".png")
...     print(name + ":")
...     print(len(pyzbar.decode(img)))
...
600dpi:
0
300dpi:
3
150dpi:
6
72dpi:
4
lower_right_600dpi:
1
lower_right_300dpi:
1
lower_right_150dpi:
1
lower_right_72dpi:
1
upper_left_600dpi:
0
upper_left_300dpi:
1
upper_left_150dpi:
1
upper_left_72dpi:
1
>>>

Please note that the dpi are not necessarily correct. I basically reduced the original (600dpi) by .5, .25, .125 to get my downsized versions.

Interesting is that with the original (7 barcodes) it does not get it right at all. With 150 it detects all but one. But with 72 it only detects 4. The 72dpi version is still quite readable and my mobile has no problem scanning the codes. With the other two examples it gets the lower right example correct. But the upper left one not in all cases.

I have no idea what it is doing there. For me it seems that the following factors are relevant:

  • Size of the image
  • Size of the code
  • Position of the code
  • Number of codes

The issue trackers of pyzbar and libzbar are mentioning issues which seem related but I found nothing that was worked on or actionable infos in there.

All test files attached.

upper_left_300dpi
upper_left_150dpi
upper_left_72dpi
72dpi
150dpi
300dpi
600dpi
lower_right_600dpi
lower_right_300dpi
lower_right_150dpi
lower_right_72dpi
upper_left_600dpi

Edit: Disabled image previews.

@Rondal
Copy link
Author

Rondal commented Jan 10, 2023

All right. I did some more tests and added yet another file to the test pool. I modified the above test script to see how to get the best results:

from pyzbar import pyzbar
from PIL import Image, ImageFilter

list = ['600dpi', '300dpi', '150dpi', '72dpi', 'lower_right_600dpi', 'lower_right_300dpi', 'lower_right_150dpi', 'lower_right_72dpi', 'upper_left_600dpi', 'upper_left_300dpi', 'upper_left_150dpi', 'upper_left_72dpi', 'small_600dpi', 'small_300dpi', 'small_150dpi']

for name in list:
    print(name + ":")
    img = Image.open(name + ".png")
    img.filter(ImageFilter.GaussianBlur(2))
    img.thumbnail((1800,1800))
    print(len(pyzbar.decode(img)))

With this combination of resize and blur I got most of the cases right:

600dpi:
6
300dpi:
6
150dpi:
6
72dpi:
4
lower_right_600dpi:
1
lower_right_300dpi:
1
lower_right_150dpi:
1
lower_right_72dpi:
1
upper_left_600dpi:
1
upper_left_300dpi:
1
upper_left_150dpi:
1
upper_left_72dpi:
1
small_600dpi:
1
small_300dpi:
0
small_150dpi:
1

The new 'small' variant is an image with one small QR code on it. This result seems to be okay, but still with the original image it only detects 6 out of 7 barcodes. Execution time for the blur and resize is a factor here but maybe not really important in the overall execution time of document parsing:

With filter:

$ time python3 time1.py
600dpi:
6

real	0m2.017s
user	0m1.859s
sys	0m0.156s

Without filter:

$ time python3 time2.py
600dpi:
0

real	0m1.263s
user	0m1.179s
sys	0m0.084s

Newly created small test files:
small_300dpi
small_600dpi
small_150dpi

@Rondal
Copy link
Author

Rondal commented Jan 10, 2023

Not sure if the following change has side effects, but it could help a bit with the problem even though this is NOT a fix.

diff --git a/src/documents/barcodes.py b/src/documents/barcodes.py
--- src/documents/barcodes.py
+++ src/documents/barcodes.py
@@ -15,8 +15,9 @@
 from pikepdf import Pdf
 from pikepdf import PdfImage
 from PIL import Image
 from PIL import ImageSequence
+from PIL import ImageFilter
 from pyzbar import pyzbar
 
 logger = logging.getLogger("paperless.barcodes")
 
@@ -45,8 +46,12 @@
     Read any barcodes contained in image
     Returns a list containing all found barcodes
     """
     barcodes = []
+    # Prepare image for decode
+    image.filter(ImageFilter.GaussianBlur(2))
+    image.thumbnail((1800,1800))
+    
     # Decode the barcode image
     detected_barcodes = pyzbar.decode(image)
 
     if detected_barcodes:

@stumpylog stumpylog added dependencies Pull requests that update a dependency file and removed unconfirmed labels Jan 10, 2023
@stumpylog stumpylog added this to the Next Release milestone Jan 11, 2023
@stumpylog
Copy link
Member

I think this is essentially NaturalHistoryMuseum/pyzbar/issues/63, given how playing around with scaling and size seems to be the largest effect.

And an interesting blog post on the subject: https://kdmurray.id.au/post/2022-03-21_decode-qrcodes/

I'm looking into what can be done on our side, ideally without turning into an image processor and taking up lots of time and memory...

@stumpylog stumpylog changed the title [BUG] Barcode splitting no longer working [BUG] Barcode splitting not detecting all barcodes Jan 17, 2023
@stumpylog
Copy link
Member

With #2468, 6 of the 7 barcodes are detected now and all existing barcodes also worked. Unfortunately, unless libzbar and/or pyzbar get some bugfixes around image size, I think that's the best result we'll get.

@github-actions
Copy link
Contributor

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bug report or a Bug-fix dependencies Pull requests that update a dependency file
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants