Skip to content
This repository has been archived by the owner on Feb 16, 2023. It is now read-only.

[BUG] Installed from script and Gotenburg and Tika not working? #1594

Open
2600box opened this issue Feb 1, 2022 · 10 comments
Open

[BUG] Installed from script and Gotenburg and Tika not working? #1594

2600box opened this issue Feb 1, 2022 · 10 comments

Comments

@2600box
Copy link

2600box commented Feb 1, 2022

Hello, thanks for this great work!

I am new to paperless-ng do not normally use docker, so I may be doing something wrong.

My paperless works well, but when I try to import a .docx file for example, it fails with, Error while converting document to PDF: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office

I installed using the script, and specified to enable Tika.

Gothenburg and Tika are running according to docker ps

paperless@docker ~/paperless-ng$ docker ps
CONTAINER ID   IMAGE                              COMMAND                  CREATED          STATUS                           PORTS                                       NAMES
8a20a33aefa6   jonaswinkler/paperless-ng:latest   "/sbin/docker-entryp_"   2 minutes ago    Up 2 minutes (healthy)           0.0.0.0:8000->8000/tcp, :::8000->8000/tcp   paperless-webserver-1
b4d6babc41a2   postgres:13                        "docker-entrypoint.s_"   24 minutes ago   Up 23 minutes                    5432/tcp                                    paperless-db-1
ed4b52bfb5a4   redis:6.0                          "docker-entrypoint.s_"   24 minutes ago   Up 23 minutes                    6379/tcp                                    paperless-broker-1
d8bf67ec76c5   thecodingmachine/gotenberg         "/usr/bin/tini -- go_"   24 minutes ago   Up 23 minutes                    3000/tcp                                    paperless-gotenberg-1
85843f762418   apache/tika                        "/bin/sh -c 'exec ja_"   24 minutes ago   Up 23 minutes                    9998/tcp                                    paperless-tika-1
paperless@docker ~/paperless-ng$ docker-compose up
[+] Running 5/5
 _ Container paperless-tika-1       Running                                                                                                                                                                                                                                                                         0.0s
 _ Container paperless-gotenberg-1  Running                                                                                                                                                                                                                                                                         0.0s
 _ Container paperless-db-1         Running                                                                                                                                                                                                                                                                         0.0s
 _ Container paperless-broker-1     Running                                                                                                                                                                                                                                                                         0.0s
 _ Container paperless-webserver-1  Created                                                                                                                                                                                                                                                                         9.2s
Attaching to paperless-broker-1, paperless-db-1, paperless-gotenberg-1, paperless-tika-1, paperless-webserver-1
paperless-webserver-1  | Paperless-ng docker container starting...
paperless-webserver-1  | Creating directory /tmp/paperless
paperless-webserver-1  | Adjusting permissions of paperless files. This may take a while.
paperless-webserver-1  | Waiting for PostgreSQL to start...
paperless-webserver-1  | Apply database migrations...
paperless-webserver-1  | Operations to perform:
paperless-webserver-1  |   Apply all migrations: admin, auth, authtoken, contenttypes, django_q, documents, paperless_mail, sessions
paperless-webserver-1  | Running migrations:
paperless-webserver-1  |   No migrations to apply.
paperless-webserver-1  | Executing /usr/local/bin/supervisord -c /etc/supervisord.conf
paperless-webserver-1  | 2022-02-01 11:22:15,874 INFO Set uid to user 0 succeeded
paperless-webserver-1  | 2022-02-01 11:22:15,875 INFO supervisord started with pid 1
paperless-webserver-1  | 2022-02-01 11:22:16,877 INFO spawned: 'consumer' with pid 36
paperless-webserver-1  | 2022-02-01 11:22:16,879 INFO spawned: 'gunicorn' with pid 37
paperless-webserver-1  | 2022-02-01 11:22:16,881 INFO spawned: 'scheduler' with pid 38
paperless-webserver-1  | [2022-02-01 12:22:17 +0100] [37] [INFO] Starting gunicorn 20.1.0
paperless-webserver-1  | [2022-02-01 12:22:17 +0100] [37] [INFO] Listening at: http://0.0.0.0:8000 (37)
paperless-webserver-1  | [2022-02-01 12:22:17 +0100] [37] [INFO] Using worker: paperless.workers.ConfigurableWorker
paperless-webserver-1  | [2022-02-01 12:22:17 +0100] [37] [INFO] Server is ready. Spawning workers
paperless-webserver-1  | 12:22:17 [Q] INFO Q Cluster romeo-idaho-nine-diet starting.
paperless-webserver-1  | [2022-02-01 12:22:17,742] [INFO] [paperless.management.consumer] Using inotify to watch directory for changes: /usr/src/paperless/src/../consume
paperless-webserver-1  | 12:22:17 [Q] INFO Process-1:1 ready for work at 61
paperless-webserver-1  | 12:22:17 [Q] INFO Process-1:2 ready for work at 62
paperless-webserver-1  | 12:22:17 [Q] INFO Process-1:3 monitoring at 63
paperless-webserver-1  | 12:22:17 [Q] INFO Process-1 guarding cluster romeo-idaho-nine-diet
paperless-webserver-1  | 12:22:17 [Q] INFO Process-1:4 pushing tasks at 64
paperless-webserver-1  | 12:22:17 [Q] INFO Q Cluster romeo-idaho-nine-diet running.
paperless-webserver-1  | 2022-02-01 11:22:18,836 INFO success: consumer entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
paperless-webserver-1  | 2022-02-01 11:22:18,836 INFO success: gunicorn entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
paperless-webserver-1  | 2022-02-01 11:22:18,836 INFO success: scheduler entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
paperless-webserver-1  | 12:22:47 [Q] INFO Enqueued 1
paperless-webserver-1  | 12:22:47 [Q] INFO Process-1 created a task from schedule [Check all e-mail accounts]
paperless-webserver-1  | 12:22:47 [Q] INFO Process-1:1 processing [lithium-edward-diet-utah]
paperless-webserver-1  | /usr/local/lib/python3.9/site-packages/imap_tools/mailbox.py:214: UserWarning: seen method are deprecated and will be removed soon, use flag method instead
paperless-webserver-1  |   warnings.warn('seen method are deprecated and will be removed soon, use flag method instead')
paperless-webserver-1  | 12:22:50 [Q] INFO Process-1:1 stopped doing work
paperless-webserver-1  | 12:22:50 [Q] INFO Processed [lithium-edward-diet-utah]
paperless-webserver-1  | 12:22:50 [Q] INFO recycled worker Process-1:1
paperless-webserver-1  | 12:22:50 [Q] INFO Process-1:5 ready for work at 77
paperless-broker-1     | 1:M 01 Feb 2022 11:23:06.030 * 100 changes in 300 seconds. Saving...
paperless-broker-1     | 1:M 01 Feb 2022 11:23:06.031 * Background saving started by pid 20
paperless-broker-1     | 20:C 01 Feb 2022 11:23:06.044 * DB saved on disk
paperless-broker-1     | 20:C 01 Feb 2022 11:23:06.044 * RDB: 0 MB of memory used by copy-on-write
paperless-broker-1     | 1:M 01 Feb 2022 11:23:06.132 * Background saving terminated with success
paperless-webserver-1  | [2022-02-01 12:24:01,094] [WARNING] [django.security.SuspiciousSession] Session data corrupted
paperless-webserver-1  | [2022-02-01 12:24:01,184] [WARNING] [django.security.SuspiciousSession] Session data corrupted
paperless-webserver-1  | [2022-02-01 12:24:04,271] [WARNING] [django.security.SuspiciousSession] Session data corrupted
paperless-webserver-1  | 12:24:14 [Q] INFO Enqueued 1
paperless-webserver-1  | 12:24:14 [Q] INFO Process-1:2 processing [Dear Facilitators.docx]
paperless-webserver-1  | [2022-02-01 12:24:15,000] [INFO] [paperless.consumer] Consuming Dear Facilitators.docx
paperless-webserver-1  | [2022-02-01 12:24:15,008] [INFO] [paperless.parsing.tika] Sending /tmp/paperless/paperless-upload-zf1ilcyo to Tika server
paperless-tika-1       | INFO  [qtp2128195220-23] 11:24:15,195 org.apache.tika.server.resource.RecursiveMetadataResource rmeta/text (autodetecting type)
paperless-webserver-1  | [2022-02-01 12:24:15,631] [INFO] [paperless.parsing.tika] Converting /tmp/paperless/paperless-upload-zf1ilcyo to PDF as /tmp/paperless/paperless-agiq8vzt/convert.pdf
paperless-gotenberg-1  | {"level":"error","ts":1643714655.6423903,"logger":"api","msg":"code=404, message=Not Found","trace":"8662f7e2-1acd-4f7b-bfe0-fd235b6c1f59","remote_ip":"172.23.0.6","host":"gotenberg:3000","uri":"/convert/office","method":"POST","path":"/convert/office","referer":"","user_agent":"python-requests/2.26.0","status":404,"latency":2408520,"latency_human":"2.40852ms","bytes_in":31351,"bytes_out":9}
paperless-webserver-1  | [2022-02-01 12:24:15,647] [ERROR] [paperless.consumer] Error while consuming document Dear Facilitators.docx: Error while converting document to PDF: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office
paperless-webserver-1  | Traceback (most recent call last):
paperless-webserver-1  |   File "/usr/src/paperless/src/paperless_tika/parsers.py", line 79, in convert_to_pdf
paperless-webserver-1  |     response.raise_for_status()  # ensure we notice bad responses
paperless-webserver-1  |   File "/usr/local/lib/python3.9/site-packages/requests/models.py", line 953, in raise_for_status
paperless-webserver-1  |     raise HTTPError(http_error_msg, response=self)
paperless-webserver-1  | requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office
paperless-webserver-1  |
paperless-webserver-1  | During handling of the above exception, another exception occurred:
paperless-webserver-1  |
paperless-webserver-1  | Traceback (most recent call last):
paperless-webserver-1  |   File "/usr/src/paperless/src/documents/consumer.py", line 248, in try_consume_file
paperless-webserver-1  |     document_parser.parse(self.path, mime_type, self.filename)
paperless-webserver-1  |   File "/usr/src/paperless/src/paperless_tika/parsers.py", line 65, in parse
paperless-webserver-1  |     self.archive_path = self.convert_to_pdf(document_path, file_name)
paperless-webserver-1  |   File "/usr/src/paperless/src/paperless_tika/parsers.py", line 81, in convert_to_pdf
paperless-webserver-1  |     raise ParseError(
paperless-webserver-1  | documents.parsers.ParseError: Error while converting document to PDF: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office
paperless-webserver-1  | 12:24:15 [Q] INFO Process-1:2 stopped doing work
paperless-webserver-1  | 12:24:15 [Q] INFO recycled worker Process-1:2
paperless-webserver-1  | 12:24:15 [Q] INFO Process-1:6 ready for work at 123
paperless-webserver-1  | 12:24:15 [Q] ERROR Failed [Dear Facilitators.docx] - Dear Facilitators.docx: Error while consuming document Dear Facilitators.docx: Error while converting document to PDF: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office : Traceback (most recent call last):
paperless-webserver-1  |   File "/usr/src/paperless/src/paperless_tika/parsers.py", line 79, in convert_to_pdf
paperless-webserver-1  |     response.raise_for_status()  # ensure we notice bad responses
paperless-webserver-1  |   File "/usr/local/lib/python3.9/site-packages/requests/models.py", line 953, in raise_for_status
paperless-webserver-1  |     raise HTTPError(http_error_msg, response=self)
paperless-webserver-1  | requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office
paperless-webserver-1  |
paperless-webserver-1  | During handling of the above exception, another exception occurred:
paperless-webserver-1  |
paperless-webserver-1  | Traceback (most recent call last):
paperless-webserver-1  |   File "/usr/local/lib/python3.9/site-packages/asgiref/sync.py", line 288, in main_wrap
paperless-webserver-1  |     raise exc_info[1]
paperless-webserver-1  |   File "/usr/src/paperless/src/documents/consumer.py", line 248, in try_consume_file
paperless-webserver-1  |     document_parser.parse(self.path, mime_type, self.filename)
paperless-webserver-1  |   File "/usr/src/paperless/src/paperless_tika/parsers.py", line 65, in parse
paperless-webserver-1  |     self.archive_path = self.convert_to_pdf(document_path, file_name)
paperless-webserver-1  |   File "/usr/src/paperless/src/paperless_tika/parsers.py", line 81, in convert_to_pdf
paperless-webserver-1  |     raise ParseError(
paperless-webserver-1  | documents.parsers.ParseError: Error while converting document to PDF: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office
paperless-webserver-1  |
paperless-webserver-1  | During handling of the above exception, another exception occurred:
paperless-webserver-1  |
paperless-webserver-1  | Traceback (most recent call last):
paperless-webserver-1  |   File "/usr/local/lib/python3.9/site-packages/django_q/cluster.py", line 432, in worker
paperless-webserver-1  |     res = f(*task["args"], **task["kwargs"])
paperless-webserver-1  |   File "/usr/src/paperless/src/documents/tasks.py", line 74, in consume_file
paperless-webserver-1  |     document = Consumer().try_consume_file(
paperless-webserver-1  |   File "/usr/src/paperless/src/documents/consumer.py", line 266, in try_consume_file
paperless-webserver-1  |     self._fail(
paperless-webserver-1  |   File "/usr/src/paperless/src/documents/consumer.py", line 70, in _fail
paperless-webserver-1  |     raise ConsumerError(f"{self.filename}: {log_message or message}")
paperless-webserver-1  | documents.consumer.ConsumerError: Dear Facilitators.docx: Error while consuming document Dear Facilitators.docx: Error while converting document to PDF: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office
paperless-webserver-1  |
paperless-webserver-1  | [2022-02-01 12:24:17 +0100] [37] [CRITICAL] WORKER TIMEOUT (pid:40)
paperless-webserver-1  | [2022-02-01 12:24:17 +0100] [37] [WARNING] Worker with pid 40 was terminated due to signal 6
paperless@docker ~/paperless-ng$ cat docker-compose.yml

# docker-compose file for running paperless from the Docker Hub.
# This file contains everything paperless needs to run.
# Paperless supports amd64, arm and arm64 hardware.
#
# All compose files of paperless configure paperless in the following way:
#
# - Paperless is (re)started on system boot, if it was running before shutdown.
# - Docker volumes for storing data are managed by Docker.
# - Folders for importing and exporting files are created in the same directory
#   as this file and mounted to the correct folders inside the container.
# - Paperless listens on port 8000.
#
# In addition to that, this docker-compose file adds the following optional
# configurations:
#
# - Instead of SQLite (default), PostgreSQL is used as the database server.
# - Apache Tika and Gotenberg servers are started with paperless and paperless
#   is configured to use these services. These provide support for consuming
#   Office documents (Word, Excel, Power Point and their LibreOffice counter-
#   parts.
#
# To install and update paperless with this file, do the following:
#
# - Copy this file as 'docker-compose.yml' and the files 'docker-compose.env'
#   and '.env' into a folder.
# - Run 'docker-compose pull'.
# - Run 'docker-compose run --rm webserver createsuperuser' to create a user.
# - Run 'docker-compose up -d'.
#
# For more extensive installation and update instructions, refer to the
# documentation.

version: "3.4"
services:
  broker:
    image: redis:6.0
    restart: unless-stopped

  db:
    image: postgres:13
    restart: unless-stopped
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: paperless
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: paperless

  webserver:
    image: jonaswinkler/paperless-ng:latest
    restart: unless-stopped
    depends_on:
      - db
      - broker
      - gotenberg
      - tika
    ports:
      - 8000:8000
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000"]
      interval: 30s
      timeout: 10s
      retries: 5
    volumes:
      - data:/usr/src/paperless/data
      - media:/usr/src/paperless/media
      - ./export:/usr/src/paperless/export
      - /home/paperless/paperless-ng/consume:/usr/src/paperless/consume
    env_file: docker-compose.env
    environment:
      PAPERLESS_REDIS: redis://broker:6379
      PAPERLESS_DBHOST: db
      PAPERLESS_TIKA_ENABLED: 1
      PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
      PAPERLESS_TIKA_ENDPOINT: http://tika:9998

  gotenberg:
    image: thecodingmachine/gotenberg
    restart: unless-stopped
    environment:
      DISABLE_GOOGLE_CHROME: 1

  tika:
    image: apache/tika
    restart: unless-stopped

volumes:
  data:
  media:
  pgdata:
@greenship24
Copy link

I have/had the same issue. It looks as if gotenberg has updated their API. An image that does work is thecodingmachine/gotenberg:6.0.0 . I'm unsure when the API was updated (it appears they're not respecting semvar?) as 6.4.4 did not work with paperless-ng either.

So the WA would be to use the 6.0.0 tag.

Paperless-ng will have to be updated to use the newer api which seems to all be under localhost:3000/forms

https://gotenberg.dev/docs/modules/libreoffice

@greenship24
Copy link

2dcacae

This commit actually fixes it. Just needs to be merged from dev to master and then a new docker image built and pushed.

I'd use the workaround until the maintainers push it to master.

@tompsg-git
Copy link

As workaround you can use this

PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://paperless-gotenberg:3000/forms/libreoffice/convert#

@SB97
Copy link

SB97 commented Feb 6, 2022

As workaround you can use this

PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://paperless-gotenberg:3000/forms/libreoffice/convert#

The workaround if your setup is vanilla:
docker-compose.yml:

#     PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
      PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000/forms/libreoffice/convert#

@sense4t
Copy link

sense4t commented Feb 9, 2022

thx, that was really helpfull !

@MegamikeMUC
Copy link

Unfortunatly this workaround didn't work for me.
PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000/forms/libreoffice/convert#
ist set.
I try to import a Word-Doc and get this error:

Error while converting document to PDF: 503 Server Error: Service Unavailable for url: http://gotenberg:3000/forms/libreoffice/convert#/forms/libreoffice/convert

@iplaughlin
Copy link

I finally got gotenberg to work. The issue is that, for whatever reason, the container isn't publishing a network port.

Going into portainer and manually publishing the network port of host 3000 and container 3000 resolved the issue of gotenberg not being available. Or adding the lines

ports:
  - 3000:3000

to a docker compose file works.

setting of

PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000 should be used

@CodeBrauer
Copy link

CodeBrauer commented Apr 19, 2022

Possible solutions I already tried:

Changed endpoint to PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000/forms/libreoffice/convert#:
Resulted in the error message:

Error while converting document to PDF: 503 Server Error: Service Unavailable for url: 
http://gotenberg:3000/forms/libreoffice/convert#/forms/libreoffice/convert

So changed the endpoint back to default and added the ports like @iplaughlin wrote, error message:

Error while converting document to PDF: 503 Server Error: Service Unavailable for url: 
http://gotenberg:3000/forms/libreoffice/convert

Gotenberg log:

{
  "level": "error",
  "ts": 1650366380.1664767,
  "logger": "api",
  "msg": "convert to PDF: lock long-running LibreOffice listener: acquire LibreOffice listener lock: context deadline exceeded",
  "trace": "52da9339-8761-4dca-bb2e-8ca269ce27ea",
  "remote_ip": "172.18.0.6",
  "host": "gotenberg:3000",
  "uri": "/forms/libreoffice/convert",
  "method": "POST",
  "path": "/forms/libreoffice/convert",
  "referer": "",
  "user_agent": "python-requests/2.27.1",
  "status": 503,
  "latency": 30002593316,
  "latency_human": "30.002593316s",
  "bytes_in": 17375,
  "bytes_out": 19
}

@iplaughlin
Copy link

@CodeBrauer - I ended up spinning up gotenberg in its own container, outside of paperless.

@MegamikeMUC
Copy link

For my setup only this worked:
image: gotenberg/gotenberg:7.4
(it seems it has to be a gotenberg version higher then 7)
neither the definition of ports nor the change in endpoint where succesful

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants