[BUG] Barcode splitting not detecting all barcodes #2385

Rondal · 2023-01-08T20:08:59Z

Description

With Paperless 1.11.3 barcode splitting is no longer working for me. It was working with my specific setup as of 1.6.0 or 1.7.0. I can't say for sure for the versions between. I mostly worked through old files which were already splitted.

The split is just not happening and I do not see any detection in the log. Interestingly: If I scan a document that has barcodes in it (unrelated) they are detected. Can't show these because they are payment codes with personal data in them.

I'm aware of the work that was done for example #1953 by @stumpylog so I also tested with the most recent dev Image on docker-hub. No change.

The pdfminer.six error visible in the log is not happening if I refeed the finished and downloaded document. The problem stays the same.

Steps to reproduce

Scan the file
Let it process
See that it does not get split

Webserver logs

webserver_1  | [2023-01-08 20:59:15,444] [INFO] [celery.apps.worker] celery@39f45dcafe55 ready.
webserver_1  | [2023-01-08 21:00:00,049] [INFO] [celery.beat] Scheduler: Sending due task Check all e-mail accounts (paperless_mail.tasks.process_mail_accounts)
webserver_1  | [2023-01-08 21:00:00,071] [INFO] [celery.worker.strategy] Task paperless_mail.tasks.process_mail_accounts[cb89622b-a6cb-4a16-beaa-ca9a91229c2b] received
webserver_1  | [2023-01-08 21:00:00,094] [INFO] [celery.app.trace] Task paperless_mail.tasks.process_mail_accounts[cb89622b-a6cb-4a16-beaa-ca9a91229c2b] succeeded in 0.02098793600453064s: 'No new documents were added.'
webserver_1  | [2023-01-08 21:00:23,937] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/consume/ads2800w_au_20230108_002067.pdf to the task queue.
webserver_1  | [2023-01-08 21:00:23,986] [INFO] [celery.worker.strategy] Task documents.tasks.consume_file[5a9d7200-b165-4787-8ab7-7235d142ff07] received
webserver_1  | [2023-01-08 21:00:26,655] [INFO] [paperless.consumer] Consuming ads2800w_au_20230108_002067.pdf
webserver_1  | [2023-01-08 21:00:27,024] [WARNING] [paperless.parsing.tesseract] Error while getting text from PDF document with pdfminer.six
webserver_1  | Traceback (most recent call last):
webserver_1  |   File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 140, in extract_text
webserver_1  |     lang = detect(stripped)
webserver_1  |   File "/usr/local/lib/python3.9/site-packages/langdetect/detector_factory.py", line 130, in detect
webserver_1  |     return detector.detect()
webserver_1  |   File "/usr/local/lib/python3.9/site-packages/langdetect/detector.py", line 136, in detect
webserver_1  |     probabilities = self.get_probabilities()
webserver_1  |   File "/usr/local/lib/python3.9/site-packages/langdetect/detector.py", line 143, in get_probabilities
webserver_1  |     self._detect_block()
webserver_1  |   File "/usr/local/lib/python3.9/site-packages/langdetect/detector.py", line 150, in _detect_block
webserver_1  |     raise LangDetectException(ErrorCode.CantDetectError, 'No features in text.')
webserver_1  | langdetect.lang_detect_exception.LangDetectException: No features in text.
webserver_1  | [2023-01-08 21:00:27,273] [INFO] [ocrmypdf._sync] Start processing 2 pages concurrently
webserver_1  | [2023-01-08 21:00:31,194] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Too few characters. Skipping this page
webserver_1  | [2023-01-08 21:00:31,195] [ERROR] [ocrmypdf._exec.tesseract] [tesseract] Error during processing.
webserver_1  | [2023-01-08 21:00:31,195] [INFO] [ocrmypdf._pipeline] page is facing ⇧, confidence 0.00 - no change
webserver_1  | [2023-01-08 21:00:31,489] [INFO] [ocrmypdf._pipeline] page is facing ⇧, confidence 16.50 - rotation appears correct
webserver_1  | [2023-01-08 21:01:00,133] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Image too small to scale!! (1x36 vs min width of 3)
webserver_1  | [2023-01-08 21:01:00,133] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Line cannot be recognized!!
webserver_1  | [2023-01-08 21:01:00,133] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Image too small to scale!! (1x36 vs min width of 3)
webserver_1  | [2023-01-08 21:01:00,133] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Line cannot be recognized!!
webserver_1  | [2023-01-08 21:01:00,133] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Image too small to scale!! (1x36 vs min width of 3)
webserver_1  | [2023-01-08 21:01:00,133] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Line cannot be recognized!!
webserver_1  | [2023-01-08 21:01:01,206] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
webserver_1  | [2023-01-08 21:01:01,207] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Image too small to scale!! (1x36 vs min width of 3)
webserver_1  | [2023-01-08 21:01:01,207] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Line cannot be recognized!!
webserver_1  | [2023-01-08 21:01:01,207] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Image too small to scale!! (1x36 vs min width of 3)
webserver_1  | [2023-01-08 21:01:01,207] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Line cannot be recognized!!
webserver_1  | [2023-01-08 21:01:04,057] [INFO] [ocrmypdf._pipeline] page is facing ⇧, confidence 16.41 - rotation appears correct
webserver_1  | [2023-01-08 21:01:30,932] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Image too small to scale!! (2x36 vs min width of 3)
webserver_1  | [2023-01-08 21:01:30,932] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Line cannot be recognized!!
webserver_1  | [2023-01-08 21:01:30,932] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Image too small to scale!! (2x36 vs min width of 3)
webserver_1  | [2023-01-08 21:01:30,932] [INFO] [ocrmypdf._exec.tesseract] [tesseract] Line cannot be recognized!!
webserver_1  | [2023-01-08 21:01:30,948] [INFO] [ocrmypdf._sync] Postprocessing...
webserver_1  | [2023-01-08 21:01:37,320] [INFO] [ocrmypdf._pipeline] Optimize ratio: 1.41 savings: 28.9%
webserver_1  | [2023-01-08 21:01:37,326] [INFO] [ocrmypdf._sync] Output file is a PDF/A-2B (as expected)
webserver_1  | [2023-01-08 21:01:42,968] [INFO] [paperless.consumer] Document 2022-04-20 ads2800w_au_20230108_002067 consumption finished
webserver_1  | [2023-01-08 21:01:42,984] [INFO] [celery.app.trace] Task documents.tasks.consume_file[5a9d7200-b165-4787-8ab7-7235d142ff07] succeeded in 78.99541163595859s: 'Success. New document id 1825 created'
webserver_1  | [2023-01-08 21:05:00,097] [INFO] [celery.beat] Scheduler: Sending due task Train the classifier (documents.tasks.train_classifier)
webserver_1  | [2023-01-08 21:05:00,099] [INFO] [celery.worker.strategy] Task documents.tasks.train_classifier[91ff8d02-1b68-461e-92b6-826456a299c6] received
webserver_1  | [2023-01-08 21:05:18,581] [INFO] [celery.app.trace] Task documents.tasks.train_classifier[91ff8d02-1b68-461e-92b6-826456a299c6] succeeded in 18.479739824018907s: None

Browser logs

No response

Paperless-ngx version

1.11.3

Host OS

Ubuntu 22.04.1 LTS / docker

Installation method

Docker - official image

Browser

Chrome

Configuration changes

No response

Other

No response

Rondal · 2023-01-08T20:10:51Z

Relevant environment:

dms@paperless:~/paperless-ngx$ cat docker-compose.env | grep CONSUMER
PAPERLESS_CONSUMER_RECURSIVE=true
PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS=true
PAPERLESS_CONSUMER_POLLING=5
PAPERLESS_CONSUMER_IGNORE_PATTERNS=[".DS_STORE/*", "._*", ".stfolder/*", "@eaDir/*", ".DS_Store"]
PAPERLESS_CONSUMER_DELETE_DUPLICATES=true
PAPERLESS_CONSUMER_ENABLE_BARCODES=true
PAPERLESS_CONSUMER_BARCODE_STRING=ADAR-NEXTDOC
PAPERLESS_CONSUMER_USE_LEGACY_DETECTION=true

Rondal · 2023-01-08T20:11:45Z

Input and Output files from above.
output_2022-04-20 ads2800w_au_20230108_002067.pdf
input_ads2800w_au_20230108_002067.pdf

stumpylog · 2023-01-08T21:07:23Z

Probably some image format pikepdf thinks it handles, but doesn't actually. It's getting annoying at this point.

Rondal · 2023-01-08T21:23:25Z

Shouldn‘t your PR mentioned above fix this?

stumpylog · 2023-01-08T21:30:36Z

There's no exception raised during the scanning, so there's no reason to fallback.

This seems like a weird pyzbar issue actually. The image extracts fine:

But nothing is found

>>> from pyzbar import pyzbar
>>> from PIL import Image
>>> image = Image.open("page_1_image_0.png")
>>> pyzbar.decode(image)
[]

Honestly, even more annoying

Rondal · 2023-01-08T22:24:13Z

Just once I‘d like to open a bug just to be told that I‘m just made a stupid mistake ;)

So… it was working in the past with that specific page. I don‘t think it‘s the scanner because I do see barcodes parsed correctly with the same settings (mentioned above). So something with this specific page? Too many barcodes? But why is there no message at all?

Rondal · 2023-01-08T22:47:34Z

Fun…

Found two bug reports for pyzbar:

NaturalHistoryMuseum/pyzbar#75

and

NaturalHistoryMuseum/pyzbar#63

Both haven‘t been worked on but the second one gave me the impression that it might habe to do with the dimensions of the pictures. I grabbed the png from above and quickly resized it from 600dpi to 300 and 150.

Result below. Interesting detail that it found even more on the smallest one. So it seems to be stopping after a specific amount of… pixels?

>>> from PIL import Image
>>> from pyzbar import pyzbar
>>> l = Image.open("600dpi.png")
>>> m = Image.open("300dpi.png")
>>> s = Image.open("150dpi.png")
>>> pyzbar.decode(l)
[]
>>> pyzbar.decode(m)
[Decoded(data=b'ADAR-NEXTDOC', type='QRCODE', rect=Rect(left=945, top=848, width=533, height=526), polygon=[Point(x=945, y=848), Point(x=945, y=1374), Point(x=1478, y=1374), Point(x=1478, y=848)], quality=1, orientation='UP'), Decoded(data=b'ADAR-NEXTDOC', type='QRCODE', rect=Rect(left=1661, top=273, width=537, height=524), polygon=[Point(x=1661, y=273), Point(x=1663, y=797), Point(x=2198, y=797), Point(x=2195, y=273)], quality=1, orientation='UP')]
>>> pyzbar.decode(s)
[Decoded(data=b'ADAR-NEXTDOC', type='QRCODE', rect=Rect(left=831, top=136, width=267, height=263), polygon=[Point(x=831, y=136), Point(x=831, y=399), Point(x=1098, y=398), Point(x=1097, y=136)], quality=1, orientation='UP'), Decoded(data=b'ADAR-NEXTDOC', type='QRCODE', rect=Rect(left=111, top=136, width=268, height=264), polygon=[Point(x=111, y=136), Point(x=113, y=400), Point(x=379, y=400), Point(x=378, y=136)], quality=1, orientation='UP'), Decoded(data=b'ADAR-NEXTDOC', type='QRCODE', rect=Rect(left=472, top=424, width=267, height=263), polygon=[Point(x=472, y=424), Point(x=473, y=687), Point(x=739, y=687), Point(x=738, y=424)], quality=1, orientation='UP'), Decoded(data=b'ADAR-NEXTDOC', type='QRCODE', rect=Rect(left=120, top=875, width=392, height=386), polygon=[Point(x=120, y=876), Point(x=121, y=1261), Point(x=512, y=1261), Point(x=512, y=875)], quality=1, orientation='UP'), Decoded(data=b'ADAR-NEXTDOC', type='QRCODE', rect=Rect(left=621, top=1104, width=472, height=461), polygon=[Point(x=621, y=1105), Point(x=622, y=1565), Point(x=1093, y=1564), Point(x=1090, y=1104)], quality=1, orientation='UP'), Decoded(data=b'ADAR-NEXTDOC', type='CODE128', rect=Rect(left=169, top=910, width=860, height=485), polygon=[Point(x=169, y=1371), Point(x=169, y=1395), Point(x=539, y=1366), Point(x=1029, y=1002), Point(x=1029, y=910), Point(x=658, y=957)], quality=57, orientation='UP')]
>>>

stumpylog · 2023-01-08T23:35:05Z

Sigh. That's a unique finding. I checked the extracted image with zbarimg, which is also using libzbar for barcodes. It fails to find anything, unless I run some random convert commands against the extracted file.

I'm not sure yet what to do, but I'll have to have a think on it

Rondal · 2023-01-09T19:02:29Z

I did a bit more testing, based on yesterdays script. I used 3 different images (original as above, one with only one QR upper left, one with only one QR lower right). All images in 4 different resolutions (600, 300, 150 and 72dpi). Then I counted how many barcodes were detected:

Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyzbar import pyzbar
>>> from PIL import Image
>>> list = ['600dpi', '300dpi', '150dpi', '72dpi', 'lower_right_600dpi', 'lower_right_300dpi', 'lower_right_150dpi', 'lower_right_72dpi', 'upper_left_600dpi', 'upper_left_300dpi', 'upper_left_150dpi', 'upper_left_72dpi']
>>> for name in list:
...     img = Image.open(name + ".png")
...     print(name + ":")
...     print(len(pyzbar.decode(img)))
...
600dpi:
0
300dpi:
3
150dpi:
6
72dpi:
4
lower_right_600dpi:
1
lower_right_300dpi:
1
lower_right_150dpi:
1
lower_right_72dpi:
1
upper_left_600dpi:
0
upper_left_300dpi:
1
upper_left_150dpi:
1
upper_left_72dpi:
1
>>>

Please note that the dpi are not necessarily correct. I basically reduced the original (600dpi) by .5, .25, .125 to get my downsized versions.

Interesting is that with the original (7 barcodes) it does not get it right at all. With 150 it detects all but one. But with 72 it only detects 4. The 72dpi version is still quite readable and my mobile has no problem scanning the codes. With the other two examples it gets the lower right example correct. But the upper left one not in all cases.

I have no idea what it is doing there. For me it seems that the following factors are relevant:

Size of the image
Size of the code
Position of the code
Number of codes

The issue trackers of pyzbar and libzbar are mentioning issues which seem related but I found nothing that was worked on or actionable infos in there.

All test files attached.

upper_left_300dpi
upper_left_150dpi
upper_left_72dpi
72dpi
150dpi
300dpi
600dpi
lower_right_600dpi
lower_right_300dpi
lower_right_150dpi
lower_right_72dpi
upper_left_600dpi

Edit: Disabled image previews.

Rondal · 2023-01-10T10:50:13Z

All right. I did some more tests and added yet another file to the test pool. I modified the above test script to see how to get the best results:

from pyzbar import pyzbar
from PIL import Image, ImageFilter

list = ['600dpi', '300dpi', '150dpi', '72dpi', 'lower_right_600dpi', 'lower_right_300dpi', 'lower_right_150dpi', 'lower_right_72dpi', 'upper_left_600dpi', 'upper_left_300dpi', 'upper_left_150dpi', 'upper_left_72dpi', 'small_600dpi', 'small_300dpi', 'small_150dpi']

for name in list:
    print(name + ":")
    img = Image.open(name + ".png")
    img.filter(ImageFilter.GaussianBlur(2))
    img.thumbnail((1800,1800))
    print(len(pyzbar.decode(img)))

With this combination of resize and blur I got most of the cases right:

600dpi:
6
300dpi:
6
150dpi:
6
72dpi:
4
lower_right_600dpi:
1
lower_right_300dpi:
1
lower_right_150dpi:
1
lower_right_72dpi:
1
upper_left_600dpi:
1
upper_left_300dpi:
1
upper_left_150dpi:
1
upper_left_72dpi:
1
small_600dpi:
1
small_300dpi:
0
small_150dpi:
1

The new 'small' variant is an image with one small QR code on it. This result seems to be okay, but still with the original image it only detects 6 out of 7 barcodes. Execution time for the blur and resize is a factor here but maybe not really important in the overall execution time of document parsing:

With filter:

$ time python3 time1.py
600dpi:
6

real	0m2.017s
user	0m1.859s
sys	0m0.156s

Without filter:

$ time python3 time2.py
600dpi:
0

real	0m1.263s
user	0m1.179s
sys	0m0.084s

Newly created small test files:
small_300dpi
small_600dpi
small_150dpi

Rondal · 2023-01-10T10:55:00Z

Not sure if the following change has side effects, but it could help a bit with the problem even though this is NOT a fix.

diff --git a/src/documents/barcodes.py b/src/documents/barcodes.py
--- src/documents/barcodes.py
+++ src/documents/barcodes.py
@@ -15,8 +15,9 @@
 from pikepdf import Pdf
 from pikepdf import PdfImage
 from PIL import Image
 from PIL import ImageSequence
+from PIL import ImageFilter
 from pyzbar import pyzbar
 
 logger = logging.getLogger("paperless.barcodes")
 
@@ -45,8 +46,12 @@
     Read any barcodes contained in image
     Returns a list containing all found barcodes
     """
     barcodes = []
+    # Prepare image for decode
+    image.filter(ImageFilter.GaussianBlur(2))
+    image.thumbnail((1800,1800))
+    
     # Decode the barcode image
     detected_barcodes = pyzbar.decode(image)
 
     if detected_barcodes:

stumpylog · 2023-01-11T16:17:21Z

I think this is essentially NaturalHistoryMuseum/pyzbar/issues/63, given how playing around with scaling and size seems to be the largest effect.

And an interesting blog post on the subject: https://kdmurray.id.au/post/2022-03-21_decode-qrcodes/

I'm looking into what can be done on our side, ideally without turning into an image processor and taking up lots of time and memory...

stumpylog · 2023-01-24T18:35:03Z

With #2468, 6 of the 7 barcodes are detected now and all existing barcodes also worked. Unfortunately, unless libzbar and/or pyzbar get some bugfixes around image size, I think that's the best result we'll get.

github-actions · 2023-04-15T02:17:07Z

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

Rondal added bug Bug report or a Bug-fix unconfirmed labels Jan 8, 2023

stumpylog added dependencies Pull requests that update a dependency file and removed unconfirmed labels Jan 10, 2023

stumpylog added this to the Next Release milestone Jan 11, 2023

stumpylog changed the title ~~[BUG] Barcode splitting no longer working~~ [BUG] Barcode splitting not detecting all barcodes Jan 17, 2023

stumpylog mentioned this issue Jan 19, 2023

Bugfix: Rescales images for better barcode locating #2468

Merged

10 tasks

stumpylog closed this as completed Jan 24, 2023

github-actions bot locked as resolved and limited conversation to collaborators Apr 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Barcode splitting not detecting all barcodes #2385

[BUG] Barcode splitting not detecting all barcodes #2385

Rondal commented Jan 8, 2023

Rondal commented Jan 8, 2023

Rondal commented Jan 8, 2023

stumpylog commented Jan 8, 2023

Rondal commented Jan 8, 2023

stumpylog commented Jan 8, 2023

Rondal commented Jan 8, 2023

Rondal commented Jan 8, 2023

stumpylog commented Jan 8, 2023

Rondal commented Jan 9, 2023 •

edited

Rondal commented Jan 10, 2023

Rondal commented Jan 10, 2023

stumpylog commented Jan 11, 2023

stumpylog commented Jan 24, 2023

github-actions bot commented Apr 15, 2023

[BUG] Barcode splitting not detecting all barcodes #2385

[BUG] Barcode splitting not detecting all barcodes #2385

Comments

Rondal commented Jan 8, 2023

Description

Steps to reproduce

Webserver logs

Browser logs

Paperless-ngx version

Host OS

Installation method

Browser

Configuration changes

Other

Rondal commented Jan 8, 2023

Rondal commented Jan 8, 2023

stumpylog commented Jan 8, 2023

Rondal commented Jan 8, 2023

stumpylog commented Jan 8, 2023

Rondal commented Jan 8, 2023

Rondal commented Jan 8, 2023

stumpylog commented Jan 8, 2023

Rondal commented Jan 9, 2023 • edited

Rondal commented Jan 10, 2023

Rondal commented Jan 10, 2023

stumpylog commented Jan 11, 2023

stumpylog commented Jan 24, 2023

github-actions bot commented Apr 15, 2023

Rondal commented Jan 9, 2023 •

edited