Media analysis fails on ZIP file with exotic charset #41

phantasia15 · 2019-12-30T11:25:56Z

I have several manga with Japanese character in the title.
Currently komga shows those manga in the list view but without cover image and when going to the detail pages, it shows "no chapters found" in Tachiyomi and a blank page in the web view.

If I remove non-ascii characters from the title (by renaming the folder & the .zip file in it), everything works again.

gotson · 2019-12-30T12:37:58Z

I'll test on my end. Non-ASCII shouldn't be an issue as latin accented characters work well.

gotson · 2019-12-30T13:21:07Z

I can't reproduce the issue, it works fine on my Macbook with katakana characters in the directory name, file name, or both.

In order for me to dig deeper, could you please provide:

the OS you are running Komga on
the type of filesystem your mangas are located on (hard disk, or network share like nfs, smb…)
how are you running Komga ? From the docker Image or from the jar file ?
which version of Komga you are running exactly (you can get it from the endpoint /actuator/info
the full Komga logfile of the first scan. So if the folder/files are already added, either remove the library and add it again, or remove the folders, wait for a rescan (or restart), then add the folders again, and scan (posted on a Gist/Pastebin/whatever please)

Thanks

phantasia15 · 2019-12-30T13:38:48Z

the OS you are running Komga on
how are you running Komga ? From the docker Image or from the jar file ?

I'm running Komga in a Docker container on Ubuntu 18.0.4.
Here is my docker-compose file

  komga:
    image: gotson/komga
    volumes:
      - ./data/komga/config:/config
      - ./data/sync/folders/manga:/books
    restart: unless-stopped

the type of filesystem your mangas are located on (hard disk, or network share like nfs, smb…)

The mangas are bind mounted from a folder on hard disk

which version of Komga you are running exactly (you can get it from the endpoint /actuator/info

Version:
{"git":{"branch":"v0.9.1","commit":{"id":"659cea4","time":"2019-12-18T09:09:01Z"}},"build":{"artifact":"komga","name":"komga","time":"2019-12-18T09:30:25.410Z","version":"0.9.1","group":"org.gotson"}}

the full Komga logfile of the first scan. So if the folder/files are already added, either remove the library and add it again, or remove the folders, wait for a rescan (or restart), then add the folders again, and scan (posted on a Gist/Pastebin/whatever please)

Here is the log with the exception when parsing the manga with non-ascii title
https://pastebin.com/vD4WwUEq

gotson · 2019-12-30T14:01:00Z

Thanks a lot for the information. That's an error while accessing the content of the zip file. If you are able to provide me with this particular file, I will investigate more with the debugger and try to find where it's coming from.

I have seen a few errors on archives for various reasons that are usually fixed by fixing the archive (extract files, archive again with a proper archiver). But since you mention it's working when you remove the characters it seems to be coming from something else.

phantasia15 · 2019-12-30T14:07:53Z

Here is the files.
One contains non-ascii character and the other has those character removed. The content in the archive is identical
manga.zip

gotson · 2019-12-30T14:29:54Z

Thanks.

I did a few tests, and i would say it's not coming from the file name, but from a combination of name and file. When i use the exact same name of your file with japanese characters on another of my good files, it works.

I tried repackaging your file, just extracting, then adding in a new zip, and the resulting file (with the same name as the original) parses properly.

To be honest i had a few issues with the native Java zip library, but on less than 1% of the files i tested. But those files would open nicely using other archiving utilities (like The Unarchiver). So far i have dismissed the issue, as usually the remedy is as simple as extract/archive again.

Could you try on your end to extract/archive, and see if you still have the problem ?

Also, do you have the issue with other files, or just this one ?

If the problem was more widespread, and the workaround not working, I would need to start looking at some alternative zip libraries for Java to better handle the archives.

phantasia15 · 2019-12-30T16:09:20Z

I think I have figured out the reason.
The archive was created on a Windows machine with Japanese locale, which uses Shift-jis encoding for folder and file name.
The archiver (winrar/7zip/zip) for some reason uses the OS encoding (Shift-jis) instead converting to UTF-8 for the folder/file names in the archive.
So this is not really a problem with komga but rather a problem with Windows encoding & zip file.
For the time being I will work around by archieving the image files (001.jpg, 002.jpg,...) directly instead of including the parent folder with Japanese title in it.

Edit: I noticed that Linux's unzip handle the zip file with shift-jis encoding correctly. So maybe there are some alternative java libraries that can handle this case correctly?

phantasia15 · 2019-12-30T17:06:15Z

Hi, I've just tested reading the archive with Apache common-compress and it worked correctly.

So perhaps you might consider using it instead java.util.zip? The interface is pretty similar to the java native one so it should be easy to port to this library.

gotson · 2019-12-30T23:55:07Z

I've done a bit of reading and indeed the charset of zip file is a bit confusing, mostly because you have to guess it, it's not stored in the archive.

As i mentioned in my previous post, given the error rate was small i did not look for any other solution (and it was mostly impacting me!).

I'll keep this issue open, and have a look at other zip libraries (including the one you mentioned, thanks!) to see if i can replace it.

gotson · 2019-12-31T07:10:02Z

I just tried a drop-in replacement of java.util.zip.ZipFile by org.apache.commons.compress.archivers.zip.ZipFile and it does the trick, at least for your file.

I will release a beta version and test it on my complete library, if it works i'll release that to prod.

replacement of java.util.zip.ZipFile by org.apache.commons.compress.archivers.zip.ZipFile

## [0.10.1](v0.10.0...v0.10.1) (2020-01-01) ### Bug Fixes * **webui:** remove CDN usage for icons and fonts ([c88a27c](c88a27c)), closes [#45](#45) * **webui:** show all books when browsing series ([85ca99d](85ca99d)) * **zip extractor:** better handling of exotic charsets ([0254d7d](0254d7d)), closes [#41](#41)

gotson · 2020-01-01T11:08:20Z

🎉 This issue has been resolved in version 0.10.1 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

phantasia15 · 2020-01-02T02:03:00Z

Thanks a lot !
I've tried the new version and all of my mangas are parsed correctly now.

gotson added the bug Something isn't working label Dec 30, 2019

gotson added the cannot reproduce The situation cannot be reproduced by the developers label Dec 30, 2019

gotson removed the cannot reproduce The situation cannot be reproduced by the developers label Dec 30, 2019

phantasia15 closed this as completed Dec 30, 2019

phantasia15 reopened this Dec 30, 2019

gotson changed the title ~~Error on manga with non-ascii characters in title~~ Media analysis fails on ZIP file with exotic charset Dec 30, 2019

gotson referenced this issue Dec 31, 2019

fix(zip extractor): better handling of exotic charsets

97aba7a

replacement of java.util.zip.ZipFile by org.apache.commons.compress.archivers.zip.ZipFile

gotson closed this as completed in 0254d7d Jan 1, 2020

gotson added the released label Jan 1, 2020

github-actions bot locked as resolved and limited conversation to collaborators Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Media analysis fails on ZIP file with exotic charset #41

Media analysis fails on ZIP file with exotic charset #41

phantasia15 commented Dec 30, 2019 •

edited

gotson commented Dec 30, 2019

gotson commented Dec 30, 2019

phantasia15 commented Dec 30, 2019

gotson commented Dec 30, 2019

phantasia15 commented Dec 30, 2019

gotson commented Dec 30, 2019

phantasia15 commented Dec 30, 2019 •

edited

phantasia15 commented Dec 30, 2019 •

edited

gotson commented Dec 30, 2019

gotson commented Dec 31, 2019

gotson commented Jan 1, 2020

phantasia15 commented Jan 2, 2020

Media analysis fails on ZIP file with exotic charset #41

Media analysis fails on ZIP file with exotic charset #41

Comments

phantasia15 commented Dec 30, 2019 • edited

gotson commented Dec 30, 2019

gotson commented Dec 30, 2019

phantasia15 commented Dec 30, 2019

gotson commented Dec 30, 2019

phantasia15 commented Dec 30, 2019

gotson commented Dec 30, 2019

phantasia15 commented Dec 30, 2019 • edited

phantasia15 commented Dec 30, 2019 • edited

gotson commented Dec 30, 2019

gotson commented Dec 31, 2019

gotson commented Jan 1, 2020

phantasia15 commented Jan 2, 2020

phantasia15 commented Dec 30, 2019 •

edited

phantasia15 commented Dec 30, 2019 •

edited

phantasia15 commented Dec 30, 2019 •

edited