Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Polling Consumer adds duplicates to Work Queue #1760

Closed
Thomas-Ganter opened this issue Oct 7, 2022 · 9 comments
Closed

[BUG] Polling Consumer adds duplicates to Work Queue #1760

Thomas-Ganter opened this issue Oct 7, 2022 · 9 comments
Labels
bug Bug report or a Bug-fix unconfirmed

Comments

@Thomas-Ganter
Copy link

Description

I struggle singe two days to get my containerized Paperless-NGX 1.9.2 working properly in conjunction with my flatbed scanner. Now I am reaching out for your help.

Files get picked up from the Consume directory (a NFS share in my case) multiple times, being added as duplicates to the work queue, and (after being removed by the first successful processing) resulting in loads and loads of errors.

The Log below is the result of me putting 4 files into the consume directory, resulting in 11 Errors.

Steps to reproduce

Install from docker image.

This is the config I use, but I also tried with PostGreSQL first and had the same result:

kind: ConfigMap
apiVersion: v1
metadata:
  name: paperless-config
  namespace: paperless
data:
  PAPERLESS_URL: "https://documents.k.fami.ga"
  PAPERLESS_TASK_WORKERS: "1"
  PAPERLESS_THREADS_PER_WORKER: "1"
  PAPERLESS_SECRET_KEY: "…"
  PAPERLESS_DBHOST: paperless-db
  PAPERLESS_DBPORT: "3306"
  PAPERLESS_DBENGINE: "mariadb"
  PAPERLESS_REDIS: "redis://paperless-broker:6379"
  PAPERLESS_TIKA_ENABLED: "1"
  PAPERLESS_TIKA_ENDPOINT: http://paperless-ocr:9998
  PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://paperless-ocr:3000
  PAPERLESS_TIME_ZONE: Europe/Berlin
  PAPERLESS_OCR_LANGUAGE: "deu+eng"
  PAPERLESS_CONSUMER_IGNORE_PATTERNS: "[\".DS_Store\", \".DS_STORE/*\", \"._*\", \".stfolder/*\", \".stversions/*\", \".localized/*\", \"desktop.ini\"]"
  PAPERLESS_CONSUMER_POLLING: "60"
  PAPERLESS_CONSUMER_POLLING_DELAY: "4"
  PAPERLESS_CONSUMER_POLLING_RETRY_COUNT: "2"
  PAPERLESS_CONSUMER_RECURSIVE: "false"
  PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS: "true"
  PAPERLESS_CONSUMER_ENABLE_BARCODES: "true"
  USERMAP_UID: "1000"
  USERMAP_GID: "100"

I also fiddled with the Polling Interval, Delay and Retries, the Worker Count and the Threads, but never yield and satisfactory behaviour.

Webserver logs

[2022-10-07 19:48:41,747] [DEBUG] [paperless.management.consumer] Waiting for file /usr/src/paperless/consume/Zeit Rezept 2013-30 Aprikosenschelte.pdf to remain unmodified
[2022-10-07 19:48:41,748] [DEBUG] [paperless.management.consumer] Waiting for file /usr/src/paperless/consume/Zeit Rezept 2013-31 Chinesische Erbsen.pdf to remain unmodified
[2022-10-07 19:48:41,749] [DEBUG] [paperless.management.consumer] Waiting for file /usr/src/paperless/consume/Zeit Rezept 2013-29 Staudensellerie.pdf to remain unmodified
[2022-10-07 19:48:41,751] [DEBUG] [paperless.management.consumer] Waiting for file /usr/src/paperless/consume/Zeit Rezept 2013-32 echter Nizzasalat.pdf to remain unmodified
[2022-10-07 19:48:45,759] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/consume/Zeit Rezept 2013-30 Aprikosenschelte.pdf to the task queue.
[2022-10-07 19:48:45,769] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/consume/Zeit Rezept 2013-29 Staudensellerie.pdf to the task queue.
[2022-10-07 19:48:45,770] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/consume/Zeit Rezept 2013-31 Chinesische Erbsen.pdf to the task queue.
[2022-10-07 19:48:45,774] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/consume/Zeit Rezept 2013-32 echter Nizzasalat.pdf to the task queue.
[2022-10-07 19:48:46,428] [DEBUG] [paperless.barcodes] Detected mime type: application/pdf
[2022-10-07 19:48:46,885] [INFO] [paperless.consumer] Consuming Zeit Rezept 2013-30 Aprikosenschelte.pdf
[2022-10-07 19:48:46,892] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-10-07 19:48:46,906] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-10-07 19:48:46,920] [DEBUG] [paperless.consumer] Parsing Zeit Rezept 2013-30 Aprikosenschelte.pdf...
[2022-10-07 19:48:47,014] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/consume/Zeit Rezept 2013-30 Aprikosenschelte.pdf
[2022-10-07 19:48:47,237] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/usr/src/paperless/consume/Zeit Rezept 2013-30 Aprikosenschelte.pdf'), 'output_file': '/tmp/paperless/paperless-x3iv9wio/archive.pdf', 'use_threads': True, 'jobs': '1', 'language': 'deu+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-x3iv9wio/sidecar.txt'}
[2022-10-07 19:49:20,695] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file
[2022-10-07 19:49:20,702] [DEBUG] [paperless.consumer] Generating thumbnail for Zeit Rezept 2013-30 Aprikosenschelte.pdf...
[2022-10-07 19:49:20,719] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-x3iv9wio/archive.pdf[0] /tmp/paperless/paperless-x3iv9wio/convert.webp
[2022-10-07 19:49:24,389] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
[2022-10-07 19:49:24,412] [DEBUG] [paperless.consumer] Saving record to database
[2022-10-07 19:49:24,417] [DEBUG] [paperless.consumer] Creation date from st_mtime: 2013-10-18 23:36:25+02:00
[2022-10-07 19:49:24,864] [DEBUG] [paperless.consumer] Deleting file /usr/src/paperless/consume/Zeit Rezept 2013-30 Aprikosenschelte.pdf
[2022-10-07 19:49:24,874] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-x3iv9wio
[2022-10-07 19:49:24,881] [INFO] [paperless.consumer] Document 2013-10-18 Zeit Rezept 2013-30 Aprikosenschelte consumption finished
[2022-10-07 19:49:27,509] [DEBUG] [paperless.barcodes] Detected mime type: application/pdf
[2022-10-07 19:49:27,929] [INFO] [paperless.consumer] Consuming Zeit Rezept 2013-29 Staudensellerie.pdf
[2022-10-07 19:49:27,938] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-10-07 19:49:27,944] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-10-07 19:49:27,958] [DEBUG] [paperless.consumer] Parsing Zeit Rezept 2013-29 Staudensellerie.pdf...
[2022-10-07 19:49:28,050] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/consume/Zeit Rezept 2013-29 Staudensellerie.pdf
[2022-10-07 19:49:28,262] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/usr/src/paperless/consume/Zeit Rezept 2013-29 Staudensellerie.pdf'), 'output_file': '/tmp/paperless/paperless-b5k8blse/archive.pdf', 'use_threads': True, 'jobs': '1', 'language': 'deu+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-b5k8blse/sidecar.txt'}
[2022-10-07 19:49:31,470] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/consume/Zeit Rezept 2013-29 Staudensellerie.pdf to the task queue.
[2022-10-07 19:49:31,542] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/consume/Zeit Rezept 2013-31 Chinesische Erbsen.pdf to the task queue.
[2022-10-07 19:49:31,574] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/consume/Zeit Rezept 2013-32 echter Nizzasalat.pdf to the task queue.
[2022-10-07 19:49:31,604] [INFO] [paperless.management.consumer] Polling directory for changes: /usr/src/paperless/consume
[2022-10-07 19:49:32,296] [DEBUG] [paperless.barcodes] Detected mime type: application/pdf
[2022-10-07 19:49:32,728] [INFO] [paperless.consumer] Consuming Zeit Rezept 2013-29 Staudensellerie.pdf
[2022-10-07 19:49:32,737] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-10-07 19:49:32,749] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-10-07 19:49:32,764] [DEBUG] [paperless.consumer] Parsing Zeit Rezept 2013-29 Staudensellerie.pdf...
[2022-10-07 19:49:32,860] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/consume/Zeit Rezept 2013-29 Staudensellerie.pdf
[2022-10-07 19:49:33,093] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/usr/src/paperless/consume/Zeit Rezept 2013-29 Staudensellerie.pdf'), 'output_file': '/tmp/paperless/paperless-7t1o85yx/archive.pdf', 'use_threads': True, 'jobs': '1', 'language': 'deu+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-7t1o85yx/sidecar.txt'}
[2022-10-07 19:49:39,945] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/consume/Zeit Rezept 2013-29 Staudensellerie.pdf to the task queue.
[2022-10-07 19:49:39,983] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/consume/Zeit Rezept 2013-31 Chinesische Erbsen.pdf to the task queue.
[2022-10-07 19:49:40,008] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/consume/Zeit Rezept 2013-32 echter Nizzasalat.pdf to the task queue.
[2022-10-07 19:49:40,030] [INFO] [paperless.management.consumer] Polling directory for changes: /usr/src/paperless/consume
[2022-10-07 19:49:40,885] [DEBUG] [paperless.barcodes] Detected mime type: application/pdf
[2022-10-07 19:49:41,302] [INFO] [paperless.consumer] Consuming Zeit Rezept 2013-29 Staudensellerie.pdf
[2022-10-07 19:49:41,312] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-10-07 19:49:41,326] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-10-07 19:49:41,340] [DEBUG] [paperless.consumer] Parsing Zeit Rezept 2013-29 Staudensellerie.pdf...
[2022-10-07 19:49:41,460] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/consume/Zeit Rezept 2013-29 Staudensellerie.pdf
[2022-10-07 19:49:41,722] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/usr/src/paperless/consume/Zeit Rezept 2013-29 Staudensellerie.pdf'), 'output_file': '/tmp/paperless/paperless-kmoqmc2s/archive.pdf', 'use_threads': True, 'jobs': '1', 'language': 'deu+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-kmoqmc2s/sidecar.txt'}
[2022-10-07 19:49:45,704] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/consume/Zeit Rezept 2013-29 Staudensellerie.pdf to the task queue.
[2022-10-07 19:49:45,780] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/consume/Zeit Rezept 2013-31 Chinesische Erbsen.pdf to the task queue.
[2022-10-07 19:49:45,810] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/consume/Zeit Rezept 2013-32 echter Nizzasalat.pdf to the task queue.
[2022-10-07 19:49:45,833] [INFO] [paperless.management.consumer] Polling directory for changes: /usr/src/paperless/consume
[2022-10-07 19:49:46,530] [DEBUG] [paperless.barcodes] Detected mime type: application/pdf
[2022-10-07 19:49:46,948] [INFO] [paperless.consumer] Consuming Zeit Rezept 2013-29 Staudensellerie.pdf
[2022-10-07 19:49:46,958] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-10-07 19:49:46,971] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-10-07 19:49:46,985] [DEBUG] [paperless.consumer] Parsing Zeit Rezept 2013-29 Staudensellerie.pdf...
[2022-10-07 19:49:47,079] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/consume/Zeit Rezept 2013-29 Staudensellerie.pdf
[2022-10-07 19:49:47,301] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/usr/src/paperless/consume/Zeit Rezept 2013-29 Staudensellerie.pdf'), 'output_file': '/tmp/paperless/paperless-t96he3kv/archive.pdf', 'use_threads': True, 'jobs': '1', 'language': 'deu+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-t96he3kv/sidecar.txt'}
[2022-10-07 19:50:01,612] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file
[2022-10-07 19:50:01,619] [DEBUG] [paperless.consumer] Generating thumbnail for Zeit Rezept 2013-29 Staudensellerie.pdf...
[2022-10-07 19:50:01,636] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-b5k8blse/archive.pdf[0] /tmp/paperless/paperless-b5k8blse/convert.webp
[2022-10-07 19:50:05,405] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
[2022-10-07 19:50:05,437] [DEBUG] [paperless.consumer] Saving record to database
[2022-10-07 19:50:05,446] [DEBUG] [paperless.consumer] Creation date from st_mtime: 2013-10-18 23:36:25+02:00
[2022-10-07 19:50:05,837] [DEBUG] [paperless.consumer] Deleting file /usr/src/paperless/consume/Zeit Rezept 2013-29 Staudensellerie.pdf
[2022-10-07 19:50:05,847] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-b5k8blse
[2022-10-07 19:50:05,855] [INFO] [paperless.consumer] Document 2013-10-18 Zeit Rezept 2013-29 Staudensellerie consumption finished
[2022-10-07 19:50:07,584] [DEBUG] [paperless.barcodes] Detected mime type: application/pdf
[2022-10-07 19:50:08,042] [INFO] [paperless.consumer] Consuming Zeit Rezept 2013-31 Chinesische Erbsen.pdf
[2022-10-07 19:50:08,050] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-10-07 19:50:08,057] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-10-07 19:50:08,071] [DEBUG] [paperless.consumer] Parsing Zeit Rezept 2013-31 Chinesische Erbsen.pdf...
[2022-10-07 19:50:08,165] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/consume/Zeit Rezept 2013-31 Chinesische Erbsen.pdf
[2022-10-07 19:50:08,386] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/usr/src/paperless/consume/Zeit Rezept 2013-31 Chinesische Erbsen.pdf'), 'output_file': '/tmp/paperless/paperless-pdhvh_t6/archive.pdf', 'use_threads': True, 'jobs': '1', 'language': 'deu+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-pdhvh_t6/sidecar.txt'}
[2022-10-07 19:50:08,779] [INFO] [paperless.consumer] Consuming Zeit Rezept 2013-32 echter Nizzasalat.pdf
[2022-10-07 19:50:08,789] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-10-07 19:50:09,342] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-10-07 19:50:09,358] [DEBUG] [paperless.consumer] Parsing Zeit Rezept 2013-32 echter Nizzasalat.pdf...
[2022-10-07 19:50:09,451] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/consume/Zeit Rezept 2013-32 echter Nizzasalat.pdf
[2022-10-07 19:50:09,693] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/usr/src/paperless/consume/Zeit Rezept 2013-32 echter Nizzasalat.pdf'), 'output_file': '/tmp/paperless/paperless-m5xtlafd/archive.pdf', 'use_threads': True, 'jobs': '1', 'language': 'deu+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-m5xtlafd/sidecar.txt'}
[2022-10-07 19:50:11,016] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
[2022-10-07 19:50:11,032] [DEBUG] [paperless.consumer] Saving record to database
[2022-10-07 19:50:11,037] [DEBUG] [paperless.consumer] Creation date from st_mtime: 2013-10-18 23:36:25+02:00
[2022-10-07 19:50:11,075] [ERROR] [paperless.consumer] The following error occurred while consuming Zeit Rezept 2013-29 Staudensellerie.pdf: (1062, "Duplicate entry '5fd5f7441742fee5d17f1f03e54791f5' for key 'documents_document_checksum_75209391_uniq'")
[2022-10-07 19:50:11,088] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-7t1o85yx
[2022-10-07 19:50:12,365] [DEBUG] [paperless.barcodes] Detected mime type: application/pdf
[2022-10-07 19:50:12,833] [INFO] [paperless.consumer] Consuming Zeit Rezept 2013-31 Chinesische Erbsen.pdf
[2022-10-07 19:50:12,841] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-10-07 19:50:12,848] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-10-07 19:50:12,863] [DEBUG] [paperless.consumer] Parsing Zeit Rezept 2013-31 Chinesische Erbsen.pdf...
[2022-10-07 19:50:12,957] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/consume/Zeit Rezept 2013-31 Chinesische Erbsen.pdf
[2022-10-07 19:50:13,191] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/usr/src/paperless/consume/Zeit Rezept 2013-31 Chinesische Erbsen.pdf'), 'output_file': '/tmp/paperless/paperless-96xa5pfm/archive.pdf', 'use_threads': True, 'jobs': '1', 'language': 'deu+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-96xa5pfm/sidecar.txt'}
[2022-10-07 19:50:13,925] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-kmoqmc2s
[2022-10-07 19:50:13,960] [ERROR] [paperless.consumer] Error while consuming document Zeit Rezept 2013-29 Staudensellerie.pdf: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ocrmypdf.io.bolqb5n8/origin.pdf'
[2022-10-07 19:50:16,028] [DEBUG] [paperless.barcodes] Detected mime type: application/pdf
[2022-10-07 19:50:16,533] [INFO] [paperless.consumer] Consuming Zeit Rezept 2013-31 Chinesische Erbsen.pdf
[2022-10-07 19:50:16,540] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-10-07 19:50:16,546] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-10-07 19:50:16,586] [DEBUG] [paperless.consumer] Parsing Zeit Rezept 2013-31 Chinesische Erbsen.pdf...
[2022-10-07 19:50:16,699] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/consume/Zeit Rezept 2013-31 Chinesische Erbsen.pdf
[2022-10-07 19:50:16,972] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/usr/src/paperless/consume/Zeit Rezept 2013-31 Chinesische Erbsen.pdf'), 'output_file': '/tmp/paperless/paperless-x19_l5f2/archive.pdf', 'use_threads': True, 'jobs': '1', 'language': 'deu+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-x19_l5f2/sidecar.txt'}
[2022-10-07 19:50:18,685] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-t96he3kv
[2022-10-07 19:50:18,720] [ERROR] [paperless.consumer] Error while consuming document Zeit Rezept 2013-29 Staudensellerie.pdf: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ocrmypdf.io.81tk65l4/origin.pdf'
[2022-10-07 19:50:21,582] [DEBUG] [paperless.barcodes] Detected mime type: application/pdf
[2022-10-07 19:50:22,045] [INFO] [paperless.consumer] Consuming Zeit Rezept 2013-31 Chinesische Erbsen.pdf
[2022-10-07 19:50:22,053] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-10-07 19:50:22,061] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-10-07 19:50:22,077] [DEBUG] [paperless.consumer] Parsing Zeit Rezept 2013-31 Chinesische Erbsen.pdf...
[2022-10-07 19:50:22,172] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/consume/Zeit Rezept 2013-31 Chinesische Erbsen.pdf
[2022-10-07 19:50:22,395] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/usr/src/paperless/consume/Zeit Rezept 2013-31 Chinesische Erbsen.pdf'), 'output_file': '/tmp/paperless/paperless-csu1csoo/archive.pdf', 'use_threads': True, 'jobs': '1', 'language': 'deu+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-csu1csoo/sidecar.txt'}
[2022-10-07 19:50:43,071] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file
[2022-10-07 19:50:43,078] [DEBUG] [paperless.consumer] Generating thumbnail for Zeit Rezept 2013-31 Chinesische Erbsen.pdf...
[2022-10-07 19:50:43,096] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-pdhvh_t6/archive.pdf[0] /tmp/paperless/paperless-pdhvh_t6/convert.webp
[2022-10-07 19:50:46,771] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
[2022-10-07 19:50:46,787] [DEBUG] [paperless.consumer] Saving record to database
[2022-10-07 19:50:46,792] [DEBUG] [paperless.consumer] Creation date from st_mtime: 2013-10-18 23:36:25+02:00
[2022-10-07 19:50:49,448] [DEBUG] [paperless.consumer] Deleting file /usr/src/paperless/consume/Zeit Rezept 2013-31 Chinesische Erbsen.pdf
[2022-10-07 19:50:49,561] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-pdhvh_t6
[2022-10-07 19:50:49,567] [INFO] [paperless.consumer] Document 2013-10-18 Zeit Rezept 2013-31 Chinesische Erbsen consumption finished
[2022-10-07 19:50:51,737] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
[2022-10-07 19:50:51,753] [DEBUG] [paperless.consumer] Saving record to database
[2022-10-07 19:50:51,758] [DEBUG] [paperless.consumer] Creation date from st_mtime: 2013-10-18 23:36:25+02:00
[2022-10-07 19:50:51,793] [ERROR] [paperless.consumer] The following error occurred while consuming Zeit Rezept 2013-31 Chinesische Erbsen.pdf: (1062, "Duplicate entry 'd418d080ead24a8a258a55eb87e11101' for key 'documents_document_checksum_75209391_uniq'")
[2022-10-07 19:50:51,806] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-96xa5pfm
[2022-10-07 19:50:52,135] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file
[2022-10-07 19:50:52,145] [DEBUG] [paperless.consumer] Generating thumbnail for Zeit Rezept 2013-31 Chinesische Erbsen.pdf...
[2022-10-07 19:50:52,163] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-x19_l5f2/archive.pdf[0] /tmp/paperless/paperless-x19_l5f2/convert.webp
[2022-10-07 19:50:52,436] [DEBUG] [paperless.barcodes] Detected mime type: application/pdf
[2022-10-07 19:50:52,620] [DEBUG] [paperless.barcodes] Detected mime type: application/pdf
[2022-10-07 19:50:52,891] [INFO] [paperless.consumer] Consuming Zeit Rezept 2013-32 echter Nizzasalat.pdf
[2022-10-07 19:50:52,898] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-10-07 19:50:52,905] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-10-07 19:50:52,919] [DEBUG] [paperless.consumer] Parsing Zeit Rezept 2013-32 echter Nizzasalat.pdf...
[2022-10-07 19:50:53,012] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/consume/Zeit Rezept 2013-32 echter Nizzasalat.pdf
[2022-10-07 19:50:53,098] [INFO] [paperless.consumer] Consuming Zeit Rezept 2013-32 echter Nizzasalat.pdf
[2022-10-07 19:50:53,106] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-10-07 19:50:53,112] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-10-07 19:50:53,126] [DEBUG] [paperless.consumer] Parsing Zeit Rezept 2013-32 echter Nizzasalat.pdf...
[2022-10-07 19:50:53,227] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/usr/src/paperless/consume/Zeit Rezept 2013-32 echter Nizzasalat.pdf'), 'output_file': '/tmp/paperless/paperless-a9dfujsm/archive.pdf', 'use_threads': True, 'jobs': '1', 'language': 'deu+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-a9dfujsm/sidecar.txt'}
[2022-10-07 19:50:53,220] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/consume/Zeit Rezept 2013-32 echter Nizzasalat.pdf
[2022-10-07 19:50:53,447] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/usr/src/paperless/consume/Zeit Rezept 2013-32 echter Nizzasalat.pdf'), 'output_file': '/tmp/paperless/paperless-wlp0ih_r/archive.pdf', 'use_threads': True, 'jobs': '1', 'language': 'deu+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-wlp0ih_r/sidecar.txt'}
[2022-10-07 19:50:55,043] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-csu1csoo
[2022-10-07 19:50:55,078] [ERROR] [paperless.consumer] Error while consuming document Zeit Rezept 2013-31 Chinesische Erbsen.pdf: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ocrmypdf.io.c3lo_mza/origin.pdf'
[2022-10-07 19:50:55,840] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
[2022-10-07 19:50:55,856] [DEBUG] [paperless.consumer] Saving record to database
[2022-10-07 19:50:55,870] [ERROR] [paperless.consumer] The following error occurred while consuming Zeit Rezept 2013-31 Chinesische Erbsen.pdf: [Errno 2] No such file or directory: '/usr/src/paperless/consume/Zeit Rezept 2013-31 Chinesische Erbsen.pdf'
[2022-10-07 19:50:55,877] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-x19_l5f2
[2022-10-07 19:50:56,673] [DEBUG] [paperless.barcodes] Detected mime type: application/pdf
[2022-10-07 19:50:57,116] [INFO] [paperless.consumer] Consuming Zeit Rezept 2013-32 echter Nizzasalat.pdf
[2022-10-07 19:50:57,125] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-10-07 19:50:57,132] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-10-07 19:50:57,146] [DEBUG] [paperless.consumer] Parsing Zeit Rezept 2013-32 echter Nizzasalat.pdf...
[2022-10-07 19:50:57,238] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/consume/Zeit Rezept 2013-32 echter Nizzasalat.pdf
[2022-10-07 19:50:57,466] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/usr/src/paperless/consume/Zeit Rezept 2013-32 echter Nizzasalat.pdf'), 'output_file': '/tmp/paperless/paperless-cyb3xvf_/archive.pdf', 'use_threads': True, 'jobs': '1', 'language': 'deu+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-cyb3xvf_/sidecar.txt'}
[2022-10-07 19:51:01,092] [DEBUG] [paperless.barcodes] Detected mime type: application/pdf
[2022-10-07 19:51:01,550] [INFO] [paperless.consumer] Consuming Zeit Rezept 2013-32 echter Nizzasalat.pdf
[2022-10-07 19:51:01,558] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-10-07 19:51:01,565] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-10-07 19:51:01,581] [DEBUG] [paperless.consumer] Parsing Zeit Rezept 2013-32 echter Nizzasalat.pdf...
[2022-10-07 19:51:01,684] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/consume/Zeit Rezept 2013-32 echter Nizzasalat.pdf
[2022-10-07 19:51:01,946] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/usr/src/paperless/consume/Zeit Rezept 2013-32 echter Nizzasalat.pdf'), 'output_file': '/tmp/paperless/paperless-pm6q4ah0/archive.pdf', 'use_threads': True, 'jobs': '1', 'language': 'deu+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-pm6q4ah0/sidecar.txt'}
[2022-10-07 19:51:32,198] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file
[2022-10-07 19:51:32,204] [DEBUG] [paperless.consumer] Generating thumbnail for Zeit Rezept 2013-32 echter Nizzasalat.pdf...
[2022-10-07 19:51:32,221] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-wlp0ih_r/archive.pdf[0] /tmp/paperless/paperless-wlp0ih_r/convert.webp
[2022-10-07 19:51:35,920] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
[2022-10-07 19:51:35,937] [DEBUG] [paperless.consumer] Saving record to database
[2022-10-07 19:51:35,944] [DEBUG] [paperless.consumer] Creation date from st_mtime: 2013-10-18 23:36:24+02:00
[2022-10-07 19:51:36,149] [DEBUG] [paperless.consumer] Deleting file /usr/src/paperless/consume/Zeit Rezept 2013-32 echter Nizzasalat.pdf
[2022-10-07 19:51:36,260] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-a9dfujsm
[2022-10-07 19:51:36,269] [INFO] [paperless.consumer] Document 2013-10-18 Zeit Rezept 2013-32 echter Nizzasalat consumption finished
[2022-10-07 19:51:36,225] [ERROR] [paperless.consumer] The following error occurred while consuming Zeit Rezept 2013-32 echter Nizzasalat.pdf: (1062, "Duplicate entry '79b613f9e20d85a613d42ed19845ae32' for key 'documents_document_checksum_75209391_uniq'")
[2022-10-07 19:51:36,238] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-wlp0ih_r
[2022-10-07 19:51:36,245] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file
[2022-10-07 19:51:36,252] [DEBUG] [paperless.consumer] Generating thumbnail for Zeit Rezept 2013-32 echter Nizzasalat.pdf...
[2022-10-07 19:51:36,275] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-m5xtlafd/archive.pdf[0] /tmp/paperless/paperless-m5xtlafd/convert.webp
[2022-10-07 19:51:39,083] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
[2022-10-07 19:51:39,098] [DEBUG] [paperless.consumer] Saving record to database
[2022-10-07 19:51:39,104] [DEBUG] [paperless.consumer] Creation date from st_mtime: 2013-10-18 23:36:24+02:00
[2022-10-07 19:51:39,120] [ERROR] [paperless.consumer] The following error occurred while consuming Zeit Rezept 2013-32 echter Nizzasalat.pdf: [Errno 2] No such file or directory: '/usr/src/paperless/consume/Zeit Rezept 2013-32 echter Nizzasalat.pdf'
[2022-10-07 19:51:39,128] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-cyb3xvf_
[2022-10-07 19:51:39,510] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-pm6q4ah0
[2022-10-07 19:51:39,547] [ERROR] [paperless.consumer] Error while consuming document Zeit Rezept 2013-32 echter Nizzasalat.pdf: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ocrmypdf.io.l5i5kp3u/origin.pdf'
[2022-10-07 19:51:41,029] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
[2022-10-07 19:51:41,045] [DEBUG] [paperless.consumer] Saving record to database
[2022-10-07 19:51:41,062] [ERROR] [paperless.consumer] The following error occurred while consuming Zeit Rezept 2013-32 echter Nizzasalat.pdf: [Errno 2] No such file or directory: '/usr/src/paperless/consume/Zeit Rezept 2013-32 echter Nizzasalat.pdf'
[2022-10-07 19:51:41,069] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-m5xtlafd
[2022-10-07 19:57:04,419] [DEBUG] [paperless.management.consumer] Consumer exiting.
[2022-10-07 19:57:05,264] [DEBUG] [paperless.management.consumer] Consumer exiting.
[2022-10-07 19:57:05,901] [DEBUG] [paperless.management.consumer] Consumer exiting.
[2022-10-07 19:57:14,141] [DEBUG] [paperless.management.consumer] Consumer exiting.


### Paperless-ngx version

1.9.2

### Host OS

linux on arm64

### Installation method

Other (please describe above)

### Browser

_No response_

### Configuration changes

Set Polling Interval to 60 Seconds. 

### Other

Installed on kubernetes from ghcr.io/paperless-ngx/paperless-ngx:latest
@Thomas-Ganter Thomas-Ganter added bug Bug report or a Bug-fix unconfirmed labels Oct 7, 2022
@Thomas-Ganter
Copy link
Author

OK, after some more deep dive and analysis, it seems I have a race condition involving my HorizontalPodAutoscaler.

I will dig deeper tomorrow and provide an update if I can proof I am the Idiot.

@Thomas-Ganter
Copy link
Author

OK, I can happily confirm that I managed to find a way.
Seems like a hack, but this is working.

Use a StatefulSet and configure as thus:

   […]
        - name: paperless-app
          envFrom:
          - configMapRef:
              name: paperless-config   
          image: ghcr.io/paperless-ngx/paperless-ngx:latest
          command:
            - bash
            - "-c"
            - |
              set -ex
              printf 'Preparing paperless-app ... ';
              #
              # Statefulset has sticky identity, number should be last
              #
              [[ `hostname` =~ -([0-9]+)$ ]] || ( echo "Strange Hostname: $(hostname)"; exit 1 )
              ordinal=${BASH_REMATCH[1]}
              #
              if [[ $ordinal -eq 0 ]]; then
                printf 'First Instance .. '
                export PAPERLESS_CONSUMER_POLLING=${PAPERLESS_CONSUMER_POLLING_FIRST}
              else
                printf 'Supporting Instance .. '
                export PAPERLESS_CONSUMER_POLLING=${PAPERLESS_CONSUMER_POLLING_OTHER}
              fi
              printf 'PAPERLESS_CONSUMER_POLLING=%s\n\n' ${PAPERLESS_CONSUMER_POLLING}
              #
              env
              printf '\nNow handing over to normal entrypoint ... ';
              /sbin/docker-entrypoint.sh /usr/local/bin/paperless_cmd.sh
   […]

This takes two separate PAPERLESS_CONSUMER_POLLING settings from the environment, namely as configured here

  PAPERLESS_CONSUMER_POLLING_FIRST: "15"
  PAPERLESS_CONSUMER_POLLING_OTHER: "99999999"

i. e. the other containers will have a 3-year polling interval, whereas the initial Pod will use 15 Seconds.

I’m happy to discuss other/ better solutions, but this seems to work for me (based on a 30-minute test throwing a load of documents at Paperless-NGX without the errors from above showing up and with dynamic scaling of pods).

@Thomas-Ganter
Copy link
Author

Small correction. This works better in scaling situations:

  […]
                export PAPERLESS_CONSUMER_POLLING=${PAPERLESS_CONSUMER_POLLING_OTHER}
                export PAPERLESS_CONSUMPTION_DIR=/tmp/nothing-here
                mkdir      /tmp/nothing-here
                chmod 777  /tmp/nothing-here
  […]

because otherwise the autoscaled pods will pick up the documents already in the consume folder.

@falnos24865
Copy link

I am kinda curious why this has been closed. This bug has rendered my Paperless install unusable. It got well over 100k jobs for only a few hundred files. I don't see any way to clear the queue, so its kind of stuck right now.

@shamoon
Copy link
Member

shamoon commented Oct 23, 2022

I am kinda curious why this has been closed. This bug has rendered my Paperless install unusable. It got well over 100k jobs for only a few hundred files. I don't see any way to clear the queue, so its kind of stuck right now.

Well, as you can see it was closed by the original poster who left a detailed explanation of how he solved the issue.

If you have a different issue or the solution above doesnt work then you can decide what to do. My guess is it is something specific to your setup (and thus potentially can be fixed on your end, though of course not necessarily). Many, many people use this software and we havent seen any other reports of this, so Im not so sure there is a true "bug" somewhere...

@Thomas-Ganter
Copy link
Author

I closed it because I feel it was not a real bug — the installation on kubernetes with more than one container seems not officially supported.

I have created a feature request for that purpose.

@falnos24865
Copy link

I can understand that. I have a simple setup, only one instance. No Kubernetes, only 1 worker/thread. Only thing is its behind a standard reverse proxy and its using an external Redis server. I still get this issue, and have for months. I wonder if its maybe a timing issue using external Redis?

@Thomas-Ganter
Copy link
Author

I can understand that. I have a simple setup, only one instance. No Kubernetes, only 1 worker/thread. Only thing is its behind a standard reverse proxy and its using an external Redis server. I still get this issue, and have for months. I wonder if its maybe a timing issue using external Redis?

Oohh — maybe you create a separate ticket for that? Since my issue clearly was caused by my parallel operation of multiple pods and reproducibly is gone after I implemented my fix. Your issue hence is very most likely a different one.

@github-actions
Copy link
Contributor

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bug report or a Bug-fix unconfirmed
Projects
None yet
Development

No branches or pull requests

3 participants