Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Displaying Russian letters (encoding problems) #152

Closed
KovalevArtem opened this issue Sep 1, 2021 · 38 comments
Closed

Displaying Russian letters (encoding problems) #152

KovalevArtem opened this issue Sep 1, 2021 · 38 comments
Labels

Comments

@KovalevArtem
Copy link

exif:
image
Russian letters look like this: ����

xmp:
image
Russian letters look like this: ????

@KovalevArtem
Copy link
Author

image
If you select Russian in the application settings, then the names and original values are displayed correctly.
image

@hvdwolf hvdwolf added the bug Something isn't working label Sep 1, 2021
@hvdwolf
Copy link
Owner

hvdwolf commented Sep 1, 2021

This looks like some UTF-8 encoding issue. I will investigate it.

@hvdwolf
Copy link
Owner

hvdwolf commented Sep 2, 2021

I'm a bit puzzled here. This was an issue before the first release which I fixed at that time. Everything works in UTF-8.
Can you please share some of your images. I assume they have Russian text strings in them as well.
I want to check whether the data in the image strings is stored as utf-8 or stored as CP 866 (Cyrillic) or something else.

@hvdwolf
Copy link
Owner

hvdwolf commented Sep 2, 2021

When I do an exiftool 131907609-b612f89f-6b8c-4cff-80bb-65a86ac10d63.jpg
I see among others:

Software                        : paint.net 4.2.16
Artist                          : ���� 1
Host Computer                   : iPhone X
Copyright                       : ���� 2

and

Creator                         : ???? 1
Description                     : Test Тест 6
Rights                          : ???? 3
Title                           : Test Тест 6

and

Credit                          : ???? 2
Label                           : ???? 5

I think paint.net is not writing utf-8 encoded data to your images.
If you take a "clean" image and write the same tags with JTG, then what do you see?

@mrtngrsbch
Copy link
Contributor

https://www.getpaint.net/ ???
sure the problem comes from that software written in .NET
please, use Open Source editors

@KovalevArtem
Copy link
Author

KovalevArtem commented Sep 3, 2021

Demo.mp4
Moscow_Original.jpg

Moscow_Original

Moscow_Edited.jpg (Modified exif data using jExifToolGUI)

Moscow_Edited

@hvdwolf
Copy link
Owner

hvdwolf commented Sep 3, 2021

So it is JTG. I will look further.

@hvdwolf
Copy link
Owner

hvdwolf commented Sep 3, 2021

I uploaded a windows 20210903 build. Please try.

https://mega.nz/folder/UFlRhYCZ#LITpkOKT79CNWtmdwCG0bw

@KovalevArtem
Copy link
Author

Everything is still
image

@hvdwolf
Copy link
Owner

hvdwolf commented Sep 3, 2021

Please type in a command box (dosbox) the following command chcp and let me know what it returns

I am wondering whether it is:
855 | Cyrillic (Russian)
866 | Russian
65001 | UTF-8
or maybe something completely different.

@KovalevArtem
Copy link
Author

866

@hvdwolf hvdwolf added the Windows Windows related label Sep 3, 2021
@hvdwolf
Copy link
Owner

hvdwolf commented Sep 3, 2021

<frustration start>Stupid Windows. Only system in the world that does not use UTF-8.<frustration end>

This requires code adaptations and a possible setting. Somewhere on startup I should check on Windows platform which code page is being used and use that one for reading and writing. Exiftool does support this as Windows requires it.

For a user only looking at his/her own data, it doesn't make a difference, but even uploading to photo sites (Flickr, Piwigio, GPhotos) etcetera might already break this as these all run on Unix platforms.
Of course this makes exchangeability very limited worldwide. I know that Greg and Martin want to use ISAD(G) and VRA-core world-wide, but with windows users all using their own code page, this interoperatability will be seriously hindered.

As microsoft understands that this limits international use, they now support applications to use (force) use of UTF-8 codepage since build 1907.
So the application should auto use the system default, unless the user specifies another code page (and in that case utf-8 should be the prefered default to enhance worldwide operatability)

This will take some time to implement I'm afraid.

@KovalevArtem
Copy link
Author

If anything, then I have this version:
Windows 10 Pro 21H1 (19043.1165)

Thank you for your work, in solving this problem and in general developing this program in general!

@mrtngrsbch
Copy link
Contributor

I really didn't expect such an answer.
WTF, windows doesn't use UTF-8 ?

@hvdwolf
Copy link
Owner

hvdwolf commented Sep 4, 2021

For some more info on how exiftool deals with this, please read answer 10 and 18 of the faq: https://exiftool.org/faq.html#Q10 and https://exiftool.org/faq.html#Q18

In this case I also have to deal with java as UI app, and exiftool, the console app, that I call from java.

@hvdwolf
Copy link
Owner

hvdwolf commented Sep 4, 2021

And another windows 20210904 build now using the "-use mwg" feature.
My hopes are low, but this is a quick fix. Otherwise I have to do the heavy lifting to get it fixed.
https://mega.nz/folder/UFlRhYCZ#LITpkOKT79CNWtmdwCG0bw

Edit: This will not work on existing images (I think) as the strings are already saved with the Russian codepage, but should work on new images

@hvdwolf
Copy link
Owner

hvdwolf commented Sep 4, 2021

This is for newly written tags?

@KovalevArtem
Copy link
Author

Unfortunately, the miracle did not happen 😔
image
I used the original image, no meanings with Russian letters ...

@hvdwolf
Copy link
Owner

hvdwolf commented Sep 4, 2021

A totally different approach: Reading and writing with default system codepage instead of trying to read and write in utf-8. On my linux machines that is utf-8 anyway. Same for macOS.

build jExifToolGUI-1.9.0.0_beta-20210904_2-win-x86_64_with-jre.zip
https://mega.nz/folder/UFlRhYCZ#LITpkOKT79CNWtmdwCG0bw

@KovalevArtem
Copy link
Author

build jExifToolGUI-1.9.0.0_beta-20210904_2-win-x86_64_with-jre.zip

Hooray, it works!🎉
image

Everything works correctly EVEN with files edited by previous builds...

@hvdwolf
Copy link
Owner

hvdwolf commented Sep 4, 2021

Good. Thanks for the feedback.
I will still try another way, where I "translate" back and forth between utf-8 for storage to enhance world-wide interoperatablity, and displaying in the default codepage.
But that will be in a few days.

@KovalevArtem
Copy link
Author

build jExifToolGUI-1.9.0.0_beta-20210904_2-win-x86_64_with-jre.zip

Found another issue in this build.

2021-09-07.10-25-31-338.mp4

The image is attached (it has not been processed) ...

Image

IMG_20210907_102041

There is no such problem in 20210904 build...

@hvdwolf
Copy link
Owner

hvdwolf commented Sep 9, 2021

I am not surprised. And that also remembers me why I alsways read UTF-8. All ExifTool internal strings are utf-8 (of course), so if I read tag strings and values, I did read those as utf8 meaning that the translated strings from Exiftool itself were correctly displayed.
When using the operating systems default codepage, those values might be read correctly but not exiftools internal strings.
I really would not know how to solve this. Windows started to support unicode under Win 7 but is still not unicode by default.
I will need to look further.
I can of course create a setting to use utf8 (default) or windows default codepage, but then the users on windows from other code pages would run into these issues.
I did read/write in utf-8.
I just fouind some code that translates every string to utf-8 after reading and before writing. That might be an extra "if windows then.." method.
I will try this weekend. After all I can test myself using Gernam, French or Spanish words (genießen, Vergnügen, goûter, façade, ¿Abrir) or copy some from the Russian translation

@mrtngrsbch
Copy link
Contributor

In Spanish use words with all accents [á, é, í, ó, ú, ñ].

pirámide
teléfono
legítimo
brócoli
Cancún
España

@hvdwolf
Copy link
Owner

hvdwolf commented Sep 10, 2021

And it's so nice to see that Windows displays utf-8 by default in their browsers and internet IIS servers. They must of course otherwise they had a world-wide issue and nobody would use Microsoft anymore. Why not be consequent and use it in your entire OS. <I will stop my frustration here 😉>

@hvdwolf
Copy link
Owner

hvdwolf commented Sep 11, 2021

I already explained earlier, that exiftool is prepared for windows, but just to show you I did below.
Using your Moscow_edited.jpg, I get the following on my utf-8 linux box using straightforward exiftool.

exiftool -exif:all ~/mnt/PUBLIC/kovalevArtem/Moscow_modified.jpg 
Software                        : Picasa
Artist                          : ����
Copyright                       : ����
Exif Version                    : 0220

When using the correct characterset (and note that characterset is different from codepage 866/851. Let's not make it too easy 😉):

exiftool -exif:all -charset exif=cp1251 ~/mnt/PUBLIC/kovalevArtem/Moscow_modified.jpg 
Software                        : Picasa
Artist                          : Тест
Copyright                       : Тест
Exif Version                    : 0220

and when using the correct characterset and language

exiftool -lang ru -exif:all -charset exif=cp1251 ~/mnt/PUBLIC/kovalevArtem/Moscow_modified.jpg
Имя и версия ПО                 : Picasa
Исполнитель                     : Тест
Владелец копирайта              : Тест
Exif версия                     : 0220

Of course that works differently in java coding handing it over to and getting it back from a commandline tool, but I hope I get that to work for both reading and writing as I found some coding that "should" do that (of course I am not the first running into this multi-platform issue)

Your 3rd image is just a black "empty" jpg.

@hvdwolf
Copy link
Owner

hvdwolf commented Sep 11, 2021

It is getting worse.
The camera has stored the values as UTF-8.
The program has stored the vales in the default codepage.

So in post 7 you state for the exif values in the OS codepage that it works, but please check for that "jExifToolGUI-1.9.0.0_beta-20210904_2-win-x86_64_with-jre.zip" also the File tags/values
I see for example for the File tags (written by the camera):
File | Exif – Порядок байтов | Порядок от младшего к старшему (Intel, II)
File | Процесс кодирования | Базовое DCT, кодирование Хаффмана
Same for camera tags. They are all in UTF-8.
When using that system code page as done in the "jExifToolGUI-1.9.0.0_beta-20210904_2-win-x86_64_with-jre.zip", those File tag values display incorrect.
Your added exif values are displayed as ���� in UTF-8 and as Тест in windows codepage.

The point is that the "western" codepages miss some of the most exotic characters but all cover the "standard" strange characters in western latin languages. Of course Cyrillic and Asian languages are completely different.
I guess that's why hardly anyone noticed so far.

Edit:
So I have unicode (utf-8) tag strings from exiftool.
I have unicode value strings from the camera,
and I have Windows code page strings from the "editor".

Even if I now start to "force write" everything in utf-8 (get string in OS code page, convert to byte array, convert to utf8 string, write to file), you will still get issues with older editied files, or files from programs that did write in the OS codepage.

@hvdwolf
Copy link
Owner

hvdwolf commented Sep 12, 2021

I am afraid that I will never get this to work.
When calling exiftool, a new shell (cmd/dosbox on windows) is automatically created which always has the default codepage. I tried using a cmd file which first sets the codepage to unicode (65001) and then calls exiftool with the commands. That still doesn't help as first the console is opened in the standard codepage with utf8 code strings malformed before handing them over to the cmd file. I could also request the user to set this cp65001 on machine level in the registry, but I don't think the users will like this as all other codepaged based console apps will not display correct anymore.
And then it turns out hat that even on windows 10 there is still an issue on "outencoding" and "inencoding" with unicode as occasionally it is still falling back to US ASCII-7.
Setting jExifToolGUI to use the default codepage, makes reading/writing of tags work but will restrict it to that codepage, making worldwide exchangability difficult unless everyone decides to use pure ansi standards (again: stupid windows). Most users will not care at all as they only deal with their own images.
But using the default codepage makes that the internal exiftool strings for displaying tag strings in your own language (russian, korean, turkish, chinese, etc.) don't display correctly as those strings are in utf8.

I really don't know how to solve this with exiftool. 😞
In java there is the ImageIO library. Next to that you have the TwelveMonkeys image library which is really java based and extends the number of image types and functionality, and there is also the Apache Commons Imaging pure java library.
I already used the TwelveMonkeys minimally before as it also extends some functionally which was completely missing in the java 8 JDK, which I now abandone.
I guess they all overcome the windows codepage issue as you start your windows java program in unicode (that is possible and I am already doing that), and as everything stays in java it remains unicode.
But it would mean a complete rewrite of the program, and all libraries have less functionality than exiftool 😒 , although that currently doesn't make a difference for the exif/xmp/gps/etc. data I am writing.
It would also be a mixed bag as renaming and geotagging is not supported and therefore must stay with exiftool. And perhaps the same for the "ExifTool Commands" tab as that might be the place where users want to write the "exotic" tags.
And I would not have a clue on how to embed the ISADG(G) and VRA-core functionality in those libraries. That is the pure strenght of exiftool. Also the current support for tag strings in your own language is not supported (as far as I can see).
And they are all a lot less simple to implement, which is another strength of exiftool.
This is really demotivating. 😞

I think I might write a PM to Phil and ask him if I overlook something (hopefully)

@KovalevArtem
Copy link
Author

File | Exif – Порядок байтов | Порядок от младшего к старшему (Intel, II)

image
File Exif – Порядок байтов Порядок от старшего к младшему (Motorola, MM)

File | Процесс кодирования | Базовое DCT, кодирование Хаффмана

image
File Процесс кодирования Базовое DCT, кодирование Хаффмана

@KovalevArtem
Copy link
Author

Is it possible to do separate processing for tags and values?
Снимок экрана 2021-09-12 203029

@hvdwolf
Copy link
Owner

hvdwolf commented Sep 13, 2021

Unfortunately that is not possible.
You simply give a command to exiftool and that delivers a big chunk of textual data (in this case in "tab" format), which I split on the tab.
So I did try to treat the first part of the string as utf-8 data, and the second as non-utf (in case of windows). That simply doesn't work as the data is completely delivered either in the code page, giving malformed tag strings, or totally as utf-8, which gives malformed data strings.

@KovalevArtem
Copy link
Author

KovalevArtem commented Sep 13, 2021

The other day I thought and searched a lot - other applications that use exif information and that are programmed in Java.
The RouteConverter application fell under this criterion.

But it has the same problem:

Program interface in Russian

image

Program interface in English

image

One can try to find how JOSM works with exif information, because this is also a project, mostly programmed in Java...

@hvdwolf
Copy link
Owner

hvdwolf commented Sep 13, 2021

I also checked FastFotoTagger, but that has the same issues (actually even more)

@hvdwolf
Copy link
Owner

hvdwolf commented Sep 23, 2021

I think I finally made it work. Please check the jExifToolGUI-1.9.0.0-beta-20210923-win-x86_64_with-jre.zip on megaNZ.

See also https://exiftool.org/forum/index.php?topic=12864.0

This will work for future images, but still not for older images written with the original windows codepage.

@KovalevArtem
Copy link
Author

jExifToolGUI-1.9.0.0-beta-20210923

2021-09-25.16-06-49-705.mp4

@hvdwolf
Copy link
Owner

hvdwolf commented Sep 26, 2021

I can't reproduce this.
Please open JTG, go to Preferences, System (3rd tab) and set log level to trace.
Restart JTG and please try again.
Close JTG and please share the log with me (next to the exe is a "logs" folder)
And better set the log level back to info or so.

@KovalevArtem
Copy link
Author

KovalevArtem commented Oct 5, 2021

In build

jExifToolGUI-1.9.0.0-beta-20210923-win-x86_64_with-jre.zip

the problem is really solved.

2021-10-05.23-55-29-202.mp4

(The problem that I wrote about a little above was resolved by itself ...)

@hvdwolf
Copy link
Owner

hvdwolf commented Oct 6, 2021

Great!
I will now work towards a new release. I had more things on my ToDo-list, but currently (since June) my daily work takes so much (over)time, that I don't have the energy or motivation to implement more new stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants