Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Importer fails with Umlaut #898

Closed
HorstyS opened this issue May 8, 2022 · 21 comments
Closed

[BUG] Importer fails with Umlaut #898

HorstyS opened this issue May 8, 2022 · 21 comments
Labels

Comments

@HorstyS
Copy link

HorstyS commented May 8, 2022

Description

Hi,

I had to reinstall my document server (Ubuntu with docker containers). So i first exported everything with the exporter into a dedicated directory, which went fine.
After the reinstall of the server and the docker images via the provided script I started the importer.
I reveive now a lot of errors with e.g.: "The manifest file refers to "Kostenübernahme.pdf" which does not appear to be in the source directory". But the file is there as mentioned in the error message.

What I can see is, that this only happens for files with Umlaute in the filename. All others seem to be okay.

I have to say, I might have had version 1.6 for the export, and now I have 1.7 which was obviously released some days ago. But I also installed 1.6 and tried the import with the same result.

Any idea what might be the issue?

Thanks,
Richard

Steps to reproduce

  1. Export via exporter
  2. Import via importer

Webserver logs

No response

Paperless-ngx version

1.7.0

Host OS

Ubuntu 22.04 LTS

Installation method

Docker

Browser

Chrome

Configuration changes

No response

Other

No response

@HorstyS HorstyS added bug Bug report or a Bug-fix unconfirmed labels May 8, 2022
@stumpylog
Copy link
Member

I was able export and import again a file named Kostenübernahme.pdf Could you share the exact error and traceback if any?

@HorstyS
Copy link
Author

HorstyS commented May 9, 2022

Sure. What exactly would you need?

@stumpylog
Copy link
Member

Just any output from when you run the command document_importer somefolder

@HorstyS
Copy link
Author

HorstyS commented May 10, 2022

I use:

docker-compose exec webserver document_importer ../export

This results into:

CommandError: The manifest file refers to "2019-12-31 Antrag_auf_Kostenübernahme_einer_Individualbegleitung_für_den_Besuch_e.pdf" which does not appear to be in the source directory.

The file is in the export folder:
xxx:~/paperless-ngx$ find . -iname Antrag_auf
./export/2019-12-31 Antrag_auf_Kosten?bernahme_einer_Individualbegleitung_f?r_den_Besuch_e.pdf
./export/2019-12-31 Antrag_auf_Kosten?bernahme_einer_Individualbegleitung_f?r_den_Besuch_e.pdf-thumbnail.png
./export/2015-04-15 12_Antrag_auf_Kosten?bernahme_HPT (7).pdf
./export/2015-04-15 12_Antrag_auf_Kosten?bernahme_HPT (7).pdf-archive.pdf
./export/2019-12-31 Antrag_auf_Kosten?bernahme_einer_Individualbegleitung_f?r_den_Besuch_e.pdf-archive.pdf
./export/2015-04-15 12_Antrag_auf_Kosten?bernahme_HPT (7).pdf-thumbnail.png

@stumpylog
Copy link
Member

stumpylog commented May 10, 2022

I'm still unable to reproduce this so far. Again using the exact same name, I'm able to import and export without issue.

One this I notice is the question marks in your find output, that's not something I see in mine, either from inside the container or outside, using find and ls -ahl. What's the output of locale?

@HorstyS
Copy link
Author

HorstyS commented May 11, 2022

LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=de_DE.UTF-8
LC_TIME=de_DE.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=de_DE.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=de_DE.UTF-8
LC_NAME=de_DE.UTF-8
LC_ADDRESS=de_DE.UTF-8
LC_TELEPHONE=de_DE.UTF-8
LC_MEASUREMENT=de_DE.UTF-8
LC_IDENTIFICATION=de_DE.UTF-8
LC_ALL=

@stumpylog
Copy link
Member

That seems about correct. I think you can determine how to get just Bash displaying the special characters instead of ?, that would probably resolve the issue. I'm afraid I don't know the solution to that.

@tymbu
Copy link

tymbu commented May 12, 2022

Hi! I just wanted to report that I also have this issue. Exported the documents via exporter with paperless-ng 1.5, tried importing with paperless-ngx 1.7 installed. I receive the same CommandError.
The Umlauts are displayed correctly in the ErrorMessage itself

"CommandError: The manifest file refers to "xxxx-xx-xx Standesamt Geburtsurkunde für Elterngeld.pdf" which does not appear to be in the source directory."

as well as in the folder

"xxxx-xx-xx Standesamt Geburtsurkunde für Elterngeld.pdf"

@stumpylog
Copy link
Member

You're not going to be able to import from 1.5 anyway. An in place upgrade should work though.

Unfortunately, as I can't reproduce this at all, it's going to be hard to fix.

  1. Start from a clean slate
  2. Start the containers
  3. Upload a document
  4. set the title to something like Antrag_auf_Kostenübernahme_einer_Individualbegleitung_für_den_Besuch_e.pdf
  5. docker exec -it paperless-ngx-dev-webserver /bin/bash
  6. document_exporter ../export
  7. delete document from web ui
  8. document_importer ../export
  9. Document imports and is visible in the web ui

@tymbu
Copy link

tymbu commented May 12, 2022

Ah OK, I did the steps as you pointed out and it worked without errors. The deleted file with Umlaut was imported again without errors.

How would I then transfer from 1.5? 1.5 is running in docker on Mac, I want to use 1.7.1 on RaspberryPi. I think I need to move maybe the databases but did not find anything in the documentation.

Would it work to update first to 1.7.1 on Mac, then use the exporter. And importing on RaspberryPi would then be without issues?

Thanks!

@stumpylog
Copy link
Member

Would it work to update first to 1.7.1 on Mac, then use the exporter.

Yes, that should work just fine.

How were you running the exporter before? There must be some difference

@tymbu
Copy link

tymbu commented May 12, 2022

Hi stumpylog,
I was running the exporter before on Mac with the normal terminal command as shown in the documentation:

docker-compose exec -T webserver document_exporter ../export

As you recommended, I upgraded to 1.7.1 on Mac which worked fine. Then I exported the documents again using the same terminal command. I then transferred those files to Raspberry Pi, also running 1.7.1. When I started the importer using

docker-compose exec -T webserver document_importer ../export

again, I get the CommandError:

CommandError: The manifest file refers to "xxxx-xx-xx Übertrag Wertpapiere.pdf" which does not appear to be in the source directory.

Could it be a problem because the files were generated on Mac but imported on Raspbian?
Do you have any ideas how I could restore my database on the Raspberry?

Thank you!

@stumpylog
Copy link
Member

Hm, that could be, I've never tried between systems. The easy test would be trying to import again on the same system (such as a new setup to test with).

@HorstyS
Copy link
Author

HorstyS commented May 18, 2022

I will set up my Ubuntu server maybe this week and also try some things.
Maybe the issue relates on a code issue while transferring to a new system. I used Filezilla and UTF-8 for the file transfer.

@HorstyS
Copy link
Author

HorstyS commented May 19, 2022

I just exported and imported the documents to see what happens. I did not move the files or touched them after the export:

ExpImp

I assume this is not related with the Umlaute issue as it complains about sth different?

@stumpylog
Copy link
Member

You'd need to delete the exported documents from paperlesss or import to a fresh instance. The importer won't overwrite an existing file.

The good news is, this is after the check which originally failed, so it appears to be getting further along.

@editwentyone
Copy link

editwentyone commented Jul 8, 2022

Hi stumpylog, I was running the exporter before on Mac with the normal terminal command as shown in the documentation:

docker-compose exec -T webserver document_exporter ../export

As you recommended, I upgraded to 1.7.1 on Mac which worked fine. Then I exported the documents again using the same terminal command. I then transferred those files to Raspberry Pi, also running 1.7.1. When I started the importer using

docker-compose exec -T webserver document_importer ../export

again, I get the CommandError:

CommandError: The manifest file refers to "xxxx-xx-xx Übertrag Wertpapiere.pdf" which does not appear to be in the source directory.

Could it be a problem because the files were generated on Mac but imported on Raspbian? Do you have any ideas how I could restore my database on the Raspberry?

Thank you!

i have the same situation coming form the same hardware, I created on m1 MacBook and exported there. now I want to import it on unraid / linux system and I also think its not working because of the Umlaut problem

@editwentyone
Copy link

editwentyone commented Jul 8, 2022

You're not going to be able to import from 1.5 anyway. An in place upgrade should work though.

Unfortunately, as I can't reproduce this at all, it's going to be hard to fix.

  1. Start from a clean slate
  2. Start the containers
  3. Upload a document
  4. set the title to something like Antrag_auf_Kostenübernahme_einer_Individualbegleitung_für_den_Besuch_e.pdf
  5. docker exec -it paperless-ngx-dev-webserver /bin/bash
  6. document_exporter ../export
  7. delete document from web ui
  8. document_importer ../export
  9. Document imports and is visible in the web ui

I did that as a test with 51 documents on my MacBook m1 where I created the files. the importer imports flawlessly after exporting and deleting by webui

so there is a bug between macOS language and linux language handling with umlaut somewhere

@shamoon shamoon added the stale label Aug 2, 2022
@stale stale bot closed this as completed Aug 9, 2022
@URBANsUNITED
Copy link

Hi Have the same (?) problem:

15:10:47 [Q] INFO Process-1:11 stopped doing work
15:10:47 [Q] ERROR 'utf-8' codec can't encode character '\udce4' in position 14: surrogates not allowed
15:10:47 [Q] ERROR Failed [Rechnung Sanit\udce4tshaus am Markt Anika 2022-07-25.pdf] - Rechnung Sanit\udce4tshaus am Markt Anika 2022-07-25.pdf: The following error occurred while consuming Rechnung Sanit\udce4tshaus am Markt Anika 2022-07-25.pdf: 'utf-8' codec can't encode character '\udce4' in position 14: surrogates not allowed : Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/asgiref/sync.py", line 280, in main_wrap
raise exc_info[1]
File "/usr/src/paperless/src/documents/consumer.py", line 361, in try_consume_file
document = self._store(text=text, date=date, mime_type=mime_type)
File "/usr/src/paperless/src/documents/consumer.py", line 471, in _store
document = Document.objects.create(
File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 514, in create
obj.save(force_insert=True, using=self.db)
File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 806, in save
self.save_base(
File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 857, in save_base
updated = self._save_table(
File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 1000, in _save_table
results = self._do_insert(
File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 1041, in _do_insert
return manager._insert(
File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 1434, in _insert
return query.get_compiler(using=using).execute_sql(returning_fields)
File "/usr/local/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1621, in execute_sql
cursor.execute(sql, params)
File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 67, in execute
return self._execute_with_wrappers(
File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 80, in _execute_with_wrappers
return executor(sql, params, many, context)
File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
return self.cursor.execute(sql, params)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce4' in position 14: surrogates not allowed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/src/paperless/src/src/django-q/django_q/cluster.py", line 454, in worker
res = f(*task["args"], **task["kwargs"])
File "/usr/src/paperless/src/documents/tasks.py", line 154, in consume_file
document = Consumer().try_consume_file(
File "/usr/src/paperless/src/documents/consumer.py", line 423, in try_consume_file
self._fail(
File "/usr/src/paperless/src/documents/consumer.py", line 90, in _fail
raise ConsumerError(f"{self.filename}: {log_message or message}") from exception
documents.consumer.ConsumerError: Rechnung Sanit\udce4tshaus am Markt Anika 2022-07-25.pdf: The following error occurred while consuming Rechnung Sanit\udce4tshaus am Markt Anika 2022-07-25.pdf: 'utf-8' codec can't encode character '\udce4' in position 14: surrogates not allowed
15:10:49 [Q] INFO recycled worker Process-1:11

Umlaute can't import

My locale:
LANG=de_DE.UTF-8
LANGUAGE=
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=

@cbl789
Copy link

cbl789 commented Feb 8, 2023

@HorstyS did you get it working now? I have the same problem, migrating from a M1 Mac (Ventura 13.2) to Ubuntu (22.04.1 LTS)

@github-actions
Copy link
Contributor

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
Archived in project
Development

No branches or pull requests

8 participants