Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Article not displayed accurately #193

Closed
Inbefortus opened this issue Sep 16, 2021 · 9 comments
Closed

Article not displayed accurately #193

Inbefortus opened this issue Sep 16, 2021 · 9 comments
Assignees
Labels
bug/non-critical fixed Fixed, awaiting publication in new apps

Comments

@Inbefortus
Copy link

This applies to both desktop and mobile.

Kiwix JS PWA:

Screenshot_20210916-174437_Samsung Internet

Kiwix Android:

Screenshot_20210916-175408_Kiwix
Screenshot_20210916-175422_Kiwix

Original Android Wikipedia app:

Screenshot_20210916-174546_Wikipedia

@Inbefortus
Copy link
Author

Other example:

Kiwix JS PWA:

Screenshot_20210917-114032_Samsung Internet

Kiwix Android:

Screenshot_20210917-114009_Kiwix

Original:

Screenshot_20210917-114110_Samsung Internet
20210917_114201

@Jaifroid Jaifroid self-assigned this Sep 17, 2021
@Jaifroid
Copy link
Member

Thanks for the report. I'll investigate.

@Jaifroid
Copy link
Member

Jaifroid commented Sep 17, 2021

I think the second one might be caused by the transformation to desktop mode, which is a bit hacky. It searches for infoboxes to move them, but it's possible it selects too much, and includes some text it shouldn't.

@Inbefortus
Copy link
Author

@Jaifroid If it helps, here is another example:

Kiwis JS PWA:

20210917_155031

Original:

20210917_155138

@Jaifroid
Copy link
Member

@Inbefortus This issue is now fixed in Version 1.7.4-rc2. I've pushed this to pwa.kiwix.org, so if you have that version of the PWA, you should be able to verify (also in the dev pwa: kiwix.github.io/kiwix-js-windows/). Please ensure you are using rc2 (the app will auto-update after a restart).

The issue was an over-greedy regular expression that moves misplaced hatnotes that don't have any class or id that enables a cleaner way of moving them. So I have to use heuristics to guess common patterns. I've put in a limit to the amount the regex will scan within the <dl> block. The hatnotes are misplaced (in the ZIM) due to a bug that has a long-standing issue on mwOffliner: openzim/mwoffliner#182 . Kiwix JS PWA tries to compensate for this error by scanning for hatnotes that do not appear after the title.

Please see screenshots below.

2021-09-19

2021-09-19 (1)

2021-09-19 (2)

@Jaifroid Jaifroid added the fixed Fixed, awaiting publication in new apps label Sep 19, 2021
@Inbefortus
Copy link
Author

@Jaifroid Nice. I can also confirm that it has been fixed!

I have a question about one of your screenshots: In the article "Glossary of Islam" there is a small table called "Contents", what did you do to make it appear? I do not have such a table. The ZIM I use is 2021-09 English Wikipedia Endless.

@Jaifroid
Copy link
Member

@Inbefortus The screenshot is from wikipedia_en_all_maxi_2021-02.zim. I've just checked the Endless version, and the table doesn't appear to be in the HTML, hence it must not have been scraped in that version. It's a pity. Below is the HTML of that table, in case @kelson42 can shed any light on why it is not being scraped post 2021-02:

<div role="navigation" id="toc" class="toc  hlist" aria-labelledby="tocheading" style=" text-align:left;">
<div id="toctitle" class="toctitle" style="text-align:center;display:inline-block;"><span id="tocheading" style="font-weight:bold;">Contents<span>:</span><span>&nbsp;</span></span></div>
<div style="margin:auto;white-space:nowrap;display:inline-block;">
<ul><li><a href="#top">Top</a></li>
<li><a href="#0–9">0–9</a></li>
<li><a href="#A">A</a></li>
<li><a href="#B">B</a></li>
<li><a href="#C">C</a></li>
<li><a href="#D">D</a></li>
<li><a href="#E">E</a></li>
<li><a href="#F">F</a></li>
<li><a href="#G">G</a></li>
<li><a href="#H">H</a></li>
<li><a href="#I">I</a></li>
<li><a href="#J">J</a></li>
<li><a href="#K">K</a></li>
<li><a href="#L">L</a></li>
<li><a href="#M">M</a></li>
<li><a href="#N">N</a></li>
<li><a href="#O">O</a></li>
<li><a href="#P">P</a></li>
<li><a href="#Q">Q</a></li>
<li><a href="#R">R</a></li>
<li><a href="#S">S</a></li>
<li><a href="#T">T</a></li>
<li><a href="#U">U</a></li>
<li><a href="#V">V</a></li>
<li><a href="#W">W</a></li>
<li><a href="#X">X</a></li>
<li><a href="#Y">Y</a></li>
<li><a href="#Z">Z</a> </li></ul>
<p></p>       
</div></div>

@Inbefortus
Copy link
Author

Inbefortus commented Sep 20, 2021

@Jaifroid Ah, I see. I think that might be related to this? openzim/mwoffliner#1514
After 2021-07, these valuable tables appear to be entirely excluded from scraping.

@Jaifroid
Copy link
Member

This fix is now in v1.7.4 of the PWA and in the latest WikiMed UWP release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug/non-critical fixed Fixed, awaiting publication in new apps
Projects
None yet
Development

No branches or pull requests

2 participants