Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Http_server: Fix images not downloading on some Portal pages (images sometimes not appearing) #686

Open
desb42 opened this issue Mar 18, 2020 · 14 comments

Comments

@desb42
Copy link
Collaborator

desb42 commented Mar 18, 2020

As described by @Ope30 in #680, the page de.wikipedia.org/wiki/Portal:Wikipedia_nach_Themen seems to be inconsistent in displaying images

I have seen this before in other wikis

Taking this page as an example, the image to the right of Geographie is chosen 'randomly' from a list of 5 (in this case) images. The wikitext is:

{{Zufallsbild
| ANZAHL = 5 | SAAT = 1
| 1 = [[Datei:Views of Geneva.jpg|right|150px|Genf]]
| 2 = [[Datei:Hn-caecilien66-web.jpg|right|150px|Villa Faißt in Heilbronn]]
| 3 = [[Datei:Collage of views of Poznan, Poland.jpg|right|150px|Posen]]
| 4 = [[Datei:Arrasate-mondragon.jpg|right|150px|Baskenland]]
| 5 = [[Datei:Akihabara Electric Town 2.jpg|right|150px|Tokio]]
}}

If I take just the list of images

[[Datei:Views of Geneva.jpg|right|150px|Genf]]
[[Datei:Hn-caecilien66-web.jpg|right|150px|Villa Faißt in Heilbronn]]
[[Datei:Collage of views of Poznan, Poland.jpg|right|150px|Posen]]
[[Datei:Arrasate-mondragon.jpg|right|150px|Baskenland]]
[[Datei:Akihabara Electric Town 2.jpg|right|150px|Tokio]]

and cut out all the rest of the wikitext and replace with these files, when I Show preview (Vorschau zeigen), I get 2 images and three failures

@gnosygnu
Copy link
Owner

Taking this page as an example, the image to the right of Geographie is chosen 'randomly' from a list of 5 (in this case) images. The wikitext is:....

Yeah, I don't think this is resolvable. I don't know of a way to identify all the images in these "revolving" templates. I remember running across this early on in a random enwiki page for India (it switched the image based on the time of day)

The problem is that the hdump process loads a page only once, and if there is a "revolving" image template only 1 of the many images will be downloaded. I could try scanning the raw template text, but that becomes extremely difficult as you could get things like "{{random_template|Views of Geneva.jpg|Hn-caecilien66-web.jpg}}" which would need template parsing.

For now, I'll leave this as a known issue in the backlog. Let me know if any other thoughts. Thanks

@desb42
Copy link
Collaborator Author

desb42 commented Jun 7, 2020

I have been doing a bit of digging and think I can explain the issue

Taking as an example en.wikipedia.org/wiki/Portal:Arts
This has many section that involve random selection

It is not the randomness that is the cause (I believe)

Generating from wikidata, the randomness potentially produces new images to 'download', the download process runs, and then the wikitext is processed a second time

This second time, potentially generates a different set of images - which do not go through another download - hence causing the process not to find a valid image

@desb42
Copy link
Collaborator Author

desb42 commented Jun 8, 2020

In principle, the second pass could be performed on the html generated in the first pass.
A bit like hdump?

@desb42
Copy link
Collaborator Author

desb42 commented Jun 20, 2020

In light of the above comment, I have made some changes to a few files to implement this concept

The basic idea is that during the html construction when a file is not in the file subdirectory already, changing the generation of the link to use the hdump formatter and then once the files have been downloaded, passing this generated html through the hdump process
(hopefully that make sense)

I have introduced a new function into Xow_hdump_mgr_load.java
Parse(src, page)
which is called from Http_server_page.java

The other change (a bit hacky) is in Xoh_file_wtr__basic.java
I change html_fmtr to use the fmtr__hdump formatter if the current formatter is fmtr__basic and the file does not exist

Please see attached
rebuild.zip
(definitely a work in progress)

@gnosygnu
Copy link
Owner

My apologies here. I missed the comments from 2 weeks ago when my email was weird

Thanks for the code files. I took a look at the attached rebuild.zip, and I think it won't handle the html static image dumps. Calling fmtr__hdump may allow the GUI / HTTP_SERVER to show the image, but it won't log the image for the html static image dumper (The main call is here: https://github.com/gnosygnu/xowa/blob/master/400_xowa/src/gplx/xowa/parsers/lnkis/Xop_lnki_wkr.java#L75) . I can alter Xoh_file_wtr__basic to do make this call, but I wanted to reproduce this on my side first.

Generating from wikidata, the randomness potentially produces new images to 'download', the download process runs, and then the wikitext is processed a second time

I tried to debug this further on my side, but with the XOWA GUI and no image databases, all the images on en.wikipedia.org/wiki/Portal:Arts show (They are "random" so each refresh of the page will download new images from the internet). I'll be downloading de.wikipedia.org sometime tonight, so will take a look at de.wikipedia.org/wiki/Portal:Wikipedia_nach_Themen. Is that the best page to witness the behavior in the excerpt above?

@desb42
Copy link
Collaborator Author

desb42 commented Jun 23, 2020

I believe that this behaviour is 'limited' to xowa-http
Due to the complete reprocessing of a page if an image is missing (in Http_server_page.java)

The changes I suggested above seem to work in xowa-http but I forgot to see what impact there would be for xowa-gui (which I think does it a different way)

@desb42
Copy link
Collaborator Author

desb42 commented Jun 23, 2020

Attached is a version of Xoh_file_wtr__basic.java that takes account of the application mode
This seems to make things OK with xowa-gui

Xoh_file_wtr__basic.zip

@desb42
Copy link
Collaborator Author

desb42 commented Jun 23, 2020

Having been playing with the xowa-gui version and page
en.wikipedia.org/wiki/Portal:Arts
I have noticed some inconsistent behaviour
I start with a fresh build of xowa (xowa_get_and_make.sh) - this deletes all files in the /file/ subdirectory

Start xowa and in Options->Wiki - HTML Databases untick 'Prefer HTML Databases for Read tab'
(so as to always use wikitext)

In a new tab request the above page

The page loads and all images load (along with the appropriate text)

However, if within the page, I right click and choose 'Reload Page' the page loads but some images are missing
random1

If I go to the address bar and hit carriage return (or enter), the page loads with all (random) images

My version of xowa exhibits the same problem, however, I have added a line of code to Xof_xfer_queue.java, that indicate which file is being downloaded (System.out.println)

When I use 'Reload Page' no images are downloaded, when I hit enter in the address bar, images are downloaded

@gnosygnu
Copy link
Owner

Cool. Thanks for the updates. I'm running errands tomorrow, so won't get a chance to review till Thursday morning.

@gnosygnu
Copy link
Owner

Hey, so I tried it today and couldn't reproduce it.

Maybe this is something to do with your forked changes? Could you try with xowa_get_and_make.sh? See my steps below.

Thanks!


Let's assume the XOWA root is something like C:\xowa_latest

  • Get the latest version of 2020-06 version of dewiki, wikidatawiki, commonswiki
    • Note, these wikis are probably not necessary, as we could probably do this with just the home wiki. However, I wanted to simulate as close as possible the original bug report from 3/18
  • Run sh xowa_get_and_make.sh
  • Take the xowa_dev.jar and move it to C:\xowa_latest
  • Start XOWA GUI by running java -jar xowa_dev.jar
  • Go to de.wikipedia.org/wiki/Project:Sandbox
  • Replace the wikitext with the below:
[[Datei:Views of Geneva.jpg|right|150px|Genf]]
[[Datei:Hn-caecilien66-web.jpg|right|150px|Villa Faißt in Heilbronn]]
[[Datei:Collage_of_views_of_Poznań,_Poland.jpg|right|150px|Posen]]
[[Datei:Arrasate-mondragon.jpg|right|150px|Baskenland]]
[[Datei:Akihabara Electric Town 2.jpg|right|150px|Tokio]]
  • Exit XOWA GUI
  • Delete the C:\xowa_latest\file directory
  • Start XOWA HTTP_server by running java -jar xowa_dev.jar --app_mode http_server
  • Go to de.wikipedia.org/wiki/Project:Sandbox -> All 5 files get downloaded and show

@desb42
Copy link
Collaborator Author

desb42 commented Jun 25, 2020

With the original issue - I had a forked change that shows the problem described (my version allows a 'Show preview' from the xowa-http side)
Most of the time, I try to reproduce these issues with a fresh build with xowa_get_and_make.sh
I agree that following the step described immediately above works fine.

However
I have also, in further comments in this post, described other failures (that I believe are related)
Specifically my comments on 7th June and 23rd June
(Clearing the /file/ cache is an important step)

Have you had an opportunity to try to reproduce those ones?

@gnosygnu
Copy link
Owner

However
I have also, in further comments in this post, described other failures (that I believe are related)
Specifically my comments on 7th June and 23rd June
(Clearing the /file/ cache is an important step)

Oops. I assumed the first comment was still related to the others. Sorry, my mistake. I should have read the others more closely

Have you had an opportunity to try to reproduce those ones?

I tried now with http://localhost:8080/en.wikipedia.org/wiki/Portal:Arts and see the issue. Let me re-review your commits and work on that next.

Sorry again for not spending a bit more time on going through the other comments. I know how much time you spend on these issues, and the least I could have done was read a little more closely. Will work on this over the next few days. Thanks!

@gnosygnu gnosygnu changed the title images sometimes not appearing Http_server: Fix images not downloading on some Portal pages (images sometimes not appearing) Jun 27, 2020
@gnosygnu
Copy link
Owner

Added commit above. The approach is a bit different, as I ended up adding a new Xoh_wtr_ctx.HttpServer and used it to handle all the hdump logic.

Also, FWIW, your approach was very clever. I didn't actually realize what you were doing until I re-reviewed your changes today. I think if I had to solve the same problem, I would not have come up with this approach -- which is pretty sad considering I wrote both the hdump code.

Anyway, nice job! Sorry again for the misunderstanding above, but thanks many more times for a great fix!

@desb42
Copy link
Collaborator Author

desb42 commented Jul 23, 2020

Having done some further experiments and builds I have noticed a number of tweeks that need consideration

The second pass goes through a (almost) completely built html. This means there are some anchors (<a>) and image links (<img>) that have not needed to be considered before
(This shows up in the logs)
I have made some changes that stop the generation of these messages

I still cannot get enwikivoyage pagebanner images to 'download' properly

Today, however, I have just noticed (its taken this long!) that the Categories section does not display at all

This is due to the fact that the generation of the Categories checks the Hdump status which, now, is always on at that point - hence no Categories

In 400_xowa\src\gplx\xowa\htmls\core\htmls\Xoh_wtr_ctx.java I have introduced a new Mode check
Mode_is_hdump_only which just checks the flags {return mode == TID_HDUMP || mode == TID_EMBEDDABLE;}

And changed 400_xowa\src\gplx\xowa\htmls\Xoh_page_wtr_wkr.java to check that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment