Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Upload fails due to date parsing errors #1188

Closed
harababurel opened this issue Jul 1, 2022 · 6 comments
Closed

[BUG] Upload fails due to date parsing errors #1188

harababurel opened this issue Jul 1, 2022 · 6 comments
Labels
bug Bug report or a Bug-fix cant-reproduce

Comments

@harababurel
Copy link

Description

Tried uploading a scanned PDF document. Web UI gets stuck at Retrieving date from document..., logs show regex._regex_core.error: bad escape \d at position 7.

I tried updating dateparser, django-q, regex using pip (these packages show up in the attached stack trace), but it did not have any effect.
paperless-ngx is installed on arch-linux via AUR. This is a new problem for me, previously I was able to upload documents on the same installation.

Steps to reproduce

Upload any pdf.

Webserver logs

Jul 02 00:09:22 orion paperless-manage[1528340]: 22:09:22 [Q] ERROR Failed [some_document.pdf] - bad escape \d at position 7 : Traceback (most recent call last):
Jul 02 00:09:22 orion paperless-manage[1528340]:   File "/usr/lib/python3.10/site-packages/django_q/cluster.py", line 432, in worker
Jul 02 00:09:22 orion paperless-manage[1528340]:     res = f(*task["args"], **task["kwargs"])
Jul 02 00:09:22 orion paperless-manage[1528340]:   File "/usr/share/paperless/src/documents/tasks.py", line 298, in consume_file
Jul 02 00:09:22 orion paperless-manage[1528340]:     document = Consumer().try_consume_file(
Jul 02 00:09:22 orion paperless-manage[1528340]:   File "/usr/share/paperless/src/documents/consumer.py", line 275, in try_consume_file
Jul 02 00:09:22 orion paperless-manage[1528340]:     date = parse_date(self.filename, text)
Jul 02 00:09:22 orion paperless-manage[1528340]:   File "/usr/share/paperless/src/documents/parsers.py", line 265, in parse_date
Jul 02 00:09:22 orion paperless-manage[1528340]:     date = __parser(date_string, settings.DATE_ORDER)
Jul 02 00:09:22 orion paperless-manage[1528340]:   File "/usr/share/paperless/src/documents/parsers.py", line 223, in __parser
Jul 02 00:09:22 orion paperless-manage[1528340]:     return dateparser.parse(
Jul 02 00:09:22 orion paperless-manage[1528340]:   File "/usr/lib/python3.10/site-packages/dateparser/conf.py", line 92, in wrapper
Jul 02 00:09:22 orion paperless-manage[1528340]:     return f(*args, **kwargs)
Jul 02 00:09:22 orion paperless-manage[1528340]:   File "/usr/lib/python3.10/site-packages/dateparser/__init__.py", line 61, in parse
Jul 02 00:09:22 orion paperless-manage[1528340]:     data = parser.get_date_data(date_string, date_formats)
Jul 02 00:09:22 orion paperless-manage[1528340]:   File "/usr/lib/python3.10/site-packages/dateparser/date.py", line 428, in get_date_data
Jul 02 00:09:22 orion paperless-manage[1528340]:     parsed_date = _DateLocaleParser.parse(
Jul 02 00:09:22 orion paperless-manage[1528340]:   File "/usr/lib/python3.10/site-packages/dateparser/date.py", line 178, in parse
Jul 02 00:09:22 orion paperless-manage[1528340]:     return instance._parse()
Jul 02 00:09:22 orion paperless-manage[1528340]:   File "/usr/lib/python3.10/site-packages/dateparser/date.py", line 182, in _parse
Jul 02 00:09:22 orion paperless-manage[1528340]:     date_data = self._parsers[parser_name]()
Jul 02 00:09:22 orion paperless-manage[1528340]:   File "/usr/lib/python3.10/site-packages/dateparser/date.py", line 196, in _try_freshness_parser
Jul 02 00:09:22 orion paperless-manage[1528340]:     return freshness_date_parser.get_date_data(self._get_translated_date(), self._settings)
Jul 02 00:09:22 orion paperless-manage[1528340]:   File "/usr/lib/python3.10/site-packages/dateparser/date.py", line 234, in _get_translated_date
Jul 02 00:09:22 orion paperless-manage[1528340]:     self._translated_date = self.locale.translate(
Jul 02 00:09:22 orion paperless-manage[1528340]:   File "/usr/lib/python3.10/site-packages/dateparser/languages/locale.py", line 131, in translate
Jul 02 00:09:22 orion paperless-manage[1528340]:     relative_translations = self._get_relative_translations(settings=settings)
Jul 02 00:09:22 orion paperless-manage[1528340]:   File "/usr/lib/python3.10/site-packages/dateparser/languages/locale.py", line 158, in _get_relative_translations
Jul 02 00:09:22 orion paperless-manage[1528340]:     self._generate_relative_translations(normalize=True))
Jul 02 00:09:22 orion paperless-manage[1528340]:   File "/usr/lib/python3.10/site-packages/dateparser/languages/locale.py", line 172, in _generate_relative_translations
Jul 02 00:09:22 orion paperless-manage[1528340]:     pattern = DIGIT_GROUP_PATTERN.sub(r'?P<n>\d+', pattern)
Jul 02 00:09:22 orion paperless-manage[1528340]:   File "/usr/lib/python3.10/site-packages/regex/regex.py", line 702, in _compile_replacement_helper
Jul 02 00:09:22 orion paperless-manage[1528340]:     is_group, items = _compile_replacement(source, pattern, is_unicode)
Jul 02 00:09:22 orion paperless-manage[1528340]:   File "/usr/lib/python3.10/site-packages/regex/_regex_core.py", line 1737, in _compile_replacement
Jul 02 00:09:22 orion paperless-manage[1528340]:     raise error("bad escape \\%s" % ch, source.string, source.pos)
Jul 02 00:09:22 orion paperless-manage[1528340]: regex._regex_core.error: bad escape \d at position 7

Paperless-ngx version

1.7.1-2

Host OS

archlinux 5.18.6-arch1-1 x86_64

Installation method

Other (please describe above)

Browser

Safari

Configuration changes

Nothing except for setting the secret key.

Other

No response

@harababurel harababurel added bug Bug report or a Bug-fix unconfirmed labels Jul 1, 2022
@stumpylog
Copy link
Member

This is most likely a packaging and version issue. Can you confirm your versions of Python packages matches the versions released for 1.7.1? A pip freeze or something similar, in the paperless virtual env if it exists

@mueea001
Copy link

mueea001 commented Jul 2, 2022

I have the same issue and I've noticed that the regex version is 2022.6.2.

@jeLee6gi
Copy link

jeLee6gi commented Jul 5, 2022

python-regex dropped compatibility for python < 3.6

mrabarnett/mrab-regex#459

and dateparser resolved the issue on their end by pinning an old version of python-regex

scrapinghub/dateparser#1045
scrapinghub/dateparser#1052

@stumpylog
Copy link
Member

Thank you, that does look to be the case.

I'm going to close this as a packaging issue with the Arch package not following the requirements.txt versions the team knows work. If downgrading and otherwise matching still does work, feel free to reopen.

@harababurel
Copy link
Author

Thanks for the pointers. I was able to get it working by removing the AUR package and installing it as a docker container.

@github-actions
Copy link
Contributor

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bug report or a Bug-fix cant-reproduce
Projects
Archived in project
Development

No branches or pull requests

5 participants