Skip to content
This repository has been archived by the owner on Feb 16, 2023. It is now read-only.

[BUG] Could not parse Excel files with tika server at http://tika:9998: #1499

Open
Lucifer1590 opened this issue Dec 19, 2021 · 7 comments
Open

Comments

@Lucifer1590
Copy link

Describe the bug
I am fresh installing paperless-ng using docker on raspberry pi 3b+ but even after multiple reinstalls from docker-compose as well as portainer I cant upload office file and parce them

To Reproduce
Steps to reproduce the behavior:

  1. install paperless-ng through docker
  2. Click on 'upload'
  3. select any Microsoft office file (doc/exel)
  4. See error Could not parse /tmp/paperless/paperless-upload-4__50jys with tika server at http://tika:9998: HTTPConnectionPool(host='tika', port=9998): Max retries exceeded with url: /rmeta/text (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x71b69178>: Failed to establish a new connection: [Errno -2] Name or service not known'))

Expected behavior
The File should be uploaded and OCRed

Screenshots
.

Webserver logs

15:16:38 [Q] INFO Process-1:5 ready for work at 107
15:16:39 [Q] ERROR Failed [acknowledgement sample 03.docx] - acknowledgement sample 03.docx: Error while consuming document acknowledgement sample 03.docx: Could not parse /tmp/paperless/paperless-upload-4__50jys with tika server at http://tika:9998: HTTPConnectionPool(host='tika', port=9998): Max retries exceeded with url: /rmeta/text (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x71b69178>: Failed to establish a new connection: [Errno -2] Name or service not known')) : Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 169, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/lib/python3.9/site-packages/urllib3/util/connection.py", line 73, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/local/lib/python3.9/socket.py", line 953, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 234, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/local/lib/python3.9/http/client.py", line 1257, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1303, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1252, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1012, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.9/http/client.py", line 952, in send
    self.connect()
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 200, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 181, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x71b69178>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 755, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='tika', port=9998): Max retries exceeded with url: /rmeta/text (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x71b69178>: Failed to establish a new connection: [Errno -2] Name or service not known'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/paperless/src/paperless_tika/parsers.py", line 49, in parse
    parsed = parser.from_file(document_path, tika_server)
  File "/usr/local/lib/python3.9/site-packages/tika/parser.py", line 40, in from_file
    output = parse1(service, filename, serverEndpoint, headers=headers, config_path=config_path, requestOptions=requestOptions)
  File "/usr/local/lib/python3.9/site-packages/tika/tika.py", line 336, in parse1
    status, response = callServer('put', serverEndpoint, service, f,
  File "/usr/local/lib/python3.9/site-packages/tika/tika.py", line 554, in callServer
    resp = verbFn(serviceUrl, encodedData, **effectiveRequestOptions)
  File "/usr/local/lib/python3.9/site-packages/requests/api.py", line 132, in put
    return request('put', url, data=data, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='tika', port=9998): Max retries exceeded with url: /rmeta/text (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x71b69178>: Failed to establish a new connection: [Errno -2] Name or service not known'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/asgiref/sync.py", line 288, in main_wrap
    raise exc_info[1]
  File "/usr/src/paperless/src/documents/consumer.py", line 248, in try_consume_file
    document_parser.parse(self.path, mime_type, self.filename)
  File "/usr/src/paperless/src/paperless_tika/parsers.py", line 51, in parse
    raise ParseError(
documents.parsers.ParseError: Could not parse /tmp/paperless/paperless-upload-4__50jys with tika server at http://tika:9998: HTTPConnectionPool(host='tika', port=9998): Max retries exceeded with url: /rmeta/text (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x71b69178>: Failed to establish a new connection: [Errno -2] Name or service not known'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/django_q/cluster.py", line 432, in worker
    res = f(*task["args"], **task["kwargs"])
  File "/usr/src/paperless/src/documents/tasks.py", line 74, in consume_file
    document = Consumer().try_consume_file(
  File "/usr/src/paperless/src/documents/consumer.py", line 266, in try_consume_file
    self._fail(
  File "/usr/src/paperless/src/documents/consumer.py", line 70, in _fail
    raise ConsumerError(f"{self.filename}: {log_message or message}")
documents.consumer.ConsumerError: acknowledgement sample 03.docx: Error while consuming document acknowledgement sample 03.docx: Could not parse /tmp/paperless/paperless-upload-4__50jys with tika server at http://tika:9998: HTTPConnectionPool(host='tika', port=9998): Max retries exceeded with url: /rmeta/text (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x71b69178>: Failed to establish a new connection: [Errno -2] Name or service not known'))

Relevant information

  • Host OS of the machine running paperless: Raspbian GNU/Linux 10 (buster)
  • Raspberry Pi
  • Browser Firefox
  • Version 1.5.0
  • Installation method: docker

docker-compose

version: "3.4"
services:
  broker:
    image: redis:6.0
    restart: unless-stopped

  db:
    image: postgres:13
    restart: unless-stopped
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: paperless
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: paperless

  webserver:
    image: jonaswinkler/paperless-ng:latest
    restart: unless-stopped
    depends_on:
      - db
      - broker
      - gotenberg
      - tika
    ports:
      - 8010:8000
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000"]
      interval: 30s
      timeout: 10s
      retries: 5
    volumes:
      - data:/usr/src/paperless/data
      - media:/usr/src/paperless/media
      - ./export:/usr/src/paperless/export
      - ./consume:/usr/src/paperless/consume

    environment:
      PAPERLESS_REDIS: redis://broker:6379
      PAPERLESS_DBHOST: db
      PAPERLESS_TIKA_ENABLED: 1
      PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
      PAPERLESS_TIKA_ENDPOINT: http://tika:9998

  gotenberg:
    image: thecodingmachine/gotenberg
    restart: unless-stopped
    environment:
      DISABLE_GOOGLE_CHROME: 1

  tika:
    image: apache/tika
    restart: unless-stopped

volumes:
  data:
  media:
  pgdata:
@siancu
Copy link
Contributor

siancu commented Dec 21, 2021

The error you're seeing is because the paperless container cannot connect to the tika server, for some reason.
Try to add a container_name: tika in your docker-compose.yml (right below image: apache/tika and see if the problem is solved.

Do the same for gotenberg to be on the safe side. I know that the hostname should be set to the service name, but maybe in this case it isn't.

@Dan1001
Copy link

Dan1001 commented Dec 25, 2021

I have the same problem, and that fix doesn't solve it. Unfortunatley makes paperless-ng useless for me, as I can't search office documents.

@sense4t
Copy link

sense4t commented Feb 9, 2022

had the same issue, the solution in #1594 worked for me !

@grutzifix
Copy link

hmm didnt work for me,

Could not parse /tmp/paperless/paperless-upload-_v49baln with tika server at http://tika:9998: HTTPConnectionPool(host='tika', port=9998): Max retries exceeded with url: /rmeta/text (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xb1b6a178>: Failed to establish a new connection: [Errno -2] Name or service not known'))

@rokim
Copy link

rokim commented Feb 12, 2022

@grutzifix You're probably running the apache/tika image on a non-amd64 architecture (see #1354). I had the same problem on my RPi4 with arm64. Try changing the tika image in your docker-compose.yml to
image: abhilesh7/apache-tika-arm

You still have to fix the changed gotenberg endpoint of #1594
PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000/forms/libreoffice/convert#

btw. nice nickname, greetings from bavaria :)

@rokim
Copy link

rokim commented Feb 13, 2022

Try changing the tika image in your docker-compose.yml to image: abhilesh7/apache-tika-arm

@Lucifer1590 this should also fix your problem on RPi3.

@grutzifix
Copy link

Worked! thx <3

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants