Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libzim creates (again) invalid title indexes #688

Closed
kelson42 opened this issue Apr 20, 2022 · 14 comments · Fixed by #690
Closed

libzim creates (again) invalid title indexes #688

kelson42 opened this issue Apr 20, 2022 · 14 comments · Fixed by #690
Assignees
Labels
Milestone

Comments

@kelson42
Copy link
Contributor

We have a new ifixit.com scraper which uses python-libzim. see openzim/ifixit

A first test ZIM file has been created withthe Zimfarm and is available at https://mirror.download.kiwix.org/zim/.hidden/dev/ifixit_fr_all_2022-04.zim

But the ZIM file seems to have an invalid structure. Here is the zimcheck output:

[INFO] Checking zim file ifixit_fr_all_2022-04.zim
[INFO] Zimcheck version is 3.1.0
[INFO] Verifying ZIM-archive structure integrity...
Title index is not properly sorted.
Title index is not properly sorted.
  [ERROR] ZIM file's low level structure is invalid

Obviously this is a blocker!

@kelson42 kelson42 added the bug label Apr 20, 2022
@kelson42 kelson42 added this to the 7.3.0 milestone Apr 20, 2022
@kelson42
Copy link
Contributor Author

@mgautierfr @veloman-yunkan Could on of you please have urgently a look to this. Pretty concerned that we have a libzim in the field which creates broken ZIM files.

@veloman-yunkan
Copy link
Collaborator

@kelson42 I will take care of it

@kelson42
Copy link
Contributor Author

@veloman-yunkan Merci

@veloman-yunkan
Copy link
Collaborator

The wrong order is for the pair of articles with the following titles:

  • Main-Page
  • *Read Windows EOL warning* How to install the Xbox One Wireless Receiver 1713 on Windows 7 and Windows 8.1

Looks like the asterisk symbol in the beginning of the title of the second article was ignored when sorting the title index. Now investigating why that happened.

@benoit74
Copy link

@veloman-yunkan
Copy link
Collaborator

The previous hypothesis was wrong. The root cause of the problem has something to do with the handling of redirects.

$ zim-testing-suite/scripts/inspectzim --title_index ifixit_fr_all_2022-04.zim |head -n 20
# TITLE INDEX
42398
24715
35345
58257
27071
42524
31127
57074
39156
29408
0          !!!
20096
0          !!!
26986
57766
23268
23564
23797
24503

$ for i in $(zim-testing-suite/scripts/inspectzim --title_index ifixit_fr_all_2022-04.zim |head -n 20|tail -n +2); do zimdump show --idx $i ifixit_fr_all_2022-04.zim|grep '<title>'; done 
    <title> Palm IIIxe Screen Replacement</title>
    <title>&#34;Jump-Starting&#34; a Dead MacBook Battery by Resetting the SMC</title>
    <title>&#34;Re-cap&#34; and Electrically Overhaul a Direct Drive Turntable</title>
    <title>&#34;Replace Filament&#34; Function (During Printing)</title>
    <title>(Fat) Xbox 360 Hard Drive Replacement</title>
    <title>(TC) PlayStation Teardown</title>
    <title>(Video) Nikon 24-70G F2.8 Autofocus Fault Repairing (Part One)</title>
    <title>(Windows only) Install a HP LaserJet Printer using the HP UPD</title>
    <title>*OBSOLETE* *Pre iPhone 7* *Pre 7th Gen Touch* How to access Recovery Mode</title>
    <title>*Out Dated* Programming Open-Storm Board</title>
Entry Main-Page is a redirect.
    <title>*Read Windows EOL warning* How to install the Xbox One Wireless Receiver 1713 on Windows 7 and Windows 8.1</title>
Entry Main-Page is a redirect.
    <title>*Read guide note* How to rebuild a laptop CMOS battery</title>
    <title>-</title>
    <title>01 | XY Frame Assembly</title>
    <title>02 | Z Frame Assembly</title>
    <title>03 | Y Axis Assembly</title>
    <title>04 | Z and X Axis Assembly</title>
 

@veloman-yunkan
Copy link
Collaborator

Or else the zeros in the title index are a result of some kind of memory/storage corruption.

@veloman-yunkan
Copy link
Collaborator

The value 0 occurs in the title index 1440 times.

@kelson42
Copy link
Contributor Author

@veloman-yunkan Ouch... was kind of hoping this was a bug in the checking part of the algorithm :( Good luck for next steps.

@benoit74
Copy link

@veloman-yunkan : Are you sure it is not 1442 times?
If so, I have the explanation: openzim/ifixit#49

@mgautierfr
Copy link
Collaborator

If so, I have the explanation: openzim/ifixit#49

This is probably the issue.
We should mark the existing dirent as removed in https://github.com/openzim/libzim/blob/master/src/writer/creator.cpp#L496-L498 (the same way we do in resolveRedirecIndexes when we remove dirents https://github.com/openzim/libzim/blob/master/src/writer/creator.cpp#L626)

This way we will exclude removed dirent from the title index : https://github.com/openzim/libzim/blob/master/src/writer/titleListingHandler.cpp#L83-L85

@benoit74
Copy link

benoit74 commented Apr 20, 2022 via email

@mgautierfr
Copy link
Collaborator

@veloman-yunkan I've made a PR already (#690)
But thanks for you investigation, it's paved the way to the fix.

@veloman-yunkan
Copy link
Collaborator

@veloman-yunkan : Are you sure it is not 1442 times? If so, I have the explanation: openzim/ifixit#49

@benoit74 No, the count of zeros in the title index is 1440. But it's wonderful that you could link it to the count of errors in the ZIM creation log.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants