Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Media analysis fails on ZIP file with exotic charset #41

Closed
phantasia15 opened this issue Dec 30, 2019 · 12 comments
Closed

Media analysis fails on ZIP file with exotic charset #41

phantasia15 opened this issue Dec 30, 2019 · 12 comments
Labels
bug Something isn't working released

Comments

@phantasia15
Copy link

phantasia15 commented Dec 30, 2019

I have several manga with Japanese character in the title.
Currently komga shows those manga in the list view but without cover image and when going to the detail pages, it shows "no chapters found" in Tachiyomi and a blank page in the web view.

firefox_LMlDvnR3xv

If I remove non-ascii characters from the title (by renaming the folder & the .zip file in it), everything works again.

@gotson
Copy link
Owner

gotson commented Dec 30, 2019

I'll test on my end. Non-ASCII shouldn't be an issue as latin accented characters work well.

@gotson gotson added the bug Something isn't working label Dec 30, 2019
@gotson
Copy link
Owner

gotson commented Dec 30, 2019

I can't reproduce the issue, it works fine on my Macbook with katakana characters in the directory name, file name, or both.

In order for me to dig deeper, could you please provide:

  • the OS you are running Komga on
  • the type of filesystem your mangas are located on (hard disk, or network share like nfs, smb…)
  • how are you running Komga ? From the docker Image or from the jar file ?
  • which version of Komga you are running exactly (you can get it from the endpoint /actuator/info
  • the full Komga logfile of the first scan. So if the folder/files are already added, either remove the library and add it again, or remove the folders, wait for a rescan (or restart), then add the folders again, and scan (posted on a Gist/Pastebin/whatever please)

Thanks

@gotson gotson added the cannot reproduce The situation cannot be reproduced by the developers label Dec 30, 2019
@phantasia15
Copy link
Author

the OS you are running Komga on
how are you running Komga ? From the docker Image or from the jar file ?

I'm running Komga in a Docker container on Ubuntu 18.0.4.
Here is my docker-compose file

  komga:
    image: gotson/komga
    volumes:
      - ./data/komga/config:/config
      - ./data/sync/folders/manga:/books
    restart: unless-stopped

the type of filesystem your mangas are located on (hard disk, or network share like nfs, smb…)

The mangas are bind mounted from a folder on hard disk

which version of Komga you are running exactly (you can get it from the endpoint /actuator/info

Version:
{"git":{"branch":"v0.9.1","commit":{"id":"659cea4","time":"2019-12-18T09:09:01Z"}},"build":{"artifact":"komga","name":"komga","time":"2019-12-18T09:30:25.410Z","version":"0.9.1","group":"org.gotson"}}

the full Komga logfile of the first scan. So if the folder/files are already added, either remove the library and add it again, or remove the folders, wait for a rescan (or restart), then add the folders again, and scan (posted on a Gist/Pastebin/whatever please)

Here is the log with the exception when parsing the manga with non-ascii title
https://pastebin.com/vD4WwUEq

@gotson
Copy link
Owner

gotson commented Dec 30, 2019

Thanks a lot for the information. That's an error while accessing the content of the zip file. If you are able to provide me with this particular file, I will investigate more with the debugger and try to find where it's coming from.

I have seen a few errors on archives for various reasons that are usually fixed by fixing the archive (extract files, archive again with a proper archiver). But since you mention it's working when you remove the characters it seems to be coming from something else.

@phantasia15
Copy link
Author

Here is the files.
One contains non-ascii character and the other has those character removed. The content in the archive is identical
manga.zip

@gotson
Copy link
Owner

gotson commented Dec 30, 2019

Thanks.

I did a few tests, and i would say it's not coming from the file name, but from a combination of name and file. When i use the exact same name of your file with japanese characters on another of my good files, it works.

I tried repackaging your file, just extracting, then adding in a new zip, and the resulting file (with the same name as the original) parses properly.

To be honest i had a few issues with the native Java zip library, but on less than 1% of the files i tested. But those files would open nicely using other archiving utilities (like The Unarchiver). So far i have dismissed the issue, as usually the remedy is as simple as extract/archive again.

Could you try on your end to extract/archive, and see if you still have the problem ?

Also, do you have the issue with other files, or just this one ?

If the problem was more widespread, and the workaround not working, I would need to start looking at some alternative zip libraries for Java to better handle the archives.

@gotson gotson removed the cannot reproduce The situation cannot be reproduced by the developers label Dec 30, 2019
@phantasia15
Copy link
Author

phantasia15 commented Dec 30, 2019

I think I have figured out the reason.
The archive was created on a Windows machine with Japanese locale, which uses Shift-jis encoding for folder and file name.
The archiver (winrar/7zip/zip) for some reason uses the OS encoding (Shift-jis) instead converting to UTF-8 for the folder/file names in the archive.
So this is not really a problem with komga but rather a problem with Windows encoding & zip file.
For the time being I will work around by archieving the image files (001.jpg, 002.jpg,...) directly instead of including the parent folder with Japanese title in it.

Edit: I noticed that Linux's unzip handle the zip file with shift-jis encoding correctly. So maybe there are some alternative java libraries that can handle this case correctly?

@phantasia15
Copy link
Author

phantasia15 commented Dec 30, 2019

Hi, I've just tested reading the archive with Apache common-compress and it worked correctly.

So perhaps you might consider using it instead java.util.zip? The interface is pretty similar to the java native one so it should be easy to port to this library.

@gotson
Copy link
Owner

gotson commented Dec 30, 2019

I've done a bit of reading and indeed the charset of zip file is a bit confusing, mostly because you have to guess it, it's not stored in the archive.

As i mentioned in my previous post, given the error rate was small i did not look for any other solution (and it was mostly impacting me!).

I'll keep this issue open, and have a look at other zip libraries (including the one you mentioned, thanks!) to see if i can replace it.

@gotson gotson changed the title Error on manga with non-ascii characters in title Media analysis fails on ZIP file with exotic charset Dec 30, 2019
@gotson
Copy link
Owner

gotson commented Dec 31, 2019

I just tried a drop-in replacement of java.util.zip.ZipFile by org.apache.commons.compress.archivers.zip.ZipFile and it does the trick, at least for your file.

I will release a beta version and test it on my complete library, if it works i'll release that to prod.

gotson referenced this issue Dec 31, 2019
replacement of java.util.zip.ZipFile by org.apache.commons.compress.archivers.zip.ZipFile
@gotson gotson closed this as completed in 0254d7d Jan 1, 2020
gotson pushed a commit that referenced this issue Jan 1, 2020
## [0.10.1](v0.10.0...v0.10.1) (2020-01-01)

### Bug Fixes

* **webui:** remove CDN usage for icons and fonts ([c88a27c](c88a27c)), closes [#45](#45)
* **webui:** show all books when browsing series ([85ca99d](85ca99d))
* **zip extractor:** better handling of exotic charsets ([0254d7d](0254d7d)), closes [#41](#41)
@gotson
Copy link
Owner

gotson commented Jan 1, 2020

🎉 This issue has been resolved in version 0.10.1 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@gotson gotson added the released label Jan 1, 2020
@phantasia15
Copy link
Author

Thanks a lot !
I've tried the new version and all of my mangas are parsed correctly now.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working released
Projects
None yet
Development

No branches or pull requests

2 participants