Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't create Bulbapedia zim #149

Closed
rashiq opened this issue Oct 23, 2017 · 15 comments
Closed

Can't create Bulbapedia zim #149

rashiq opened this issue Oct 23, 2017 · 15 comments
Labels

Comments

@rashiq
Copy link

rashiq commented Oct 23, 2017

I'm trying to create a zim file of bulbapedia.bulbagarden.net but it's not working.
For test purposes here is a command you can try to only download only a single article:
mwoffliner --mwUrl=https://bulbapedia.bulbagarden.net/ --adminEmail=rashiq@kiwix.org --articleList articles.txt

articles.txt:

KAORI

Output:

root@2656a79f440d:/bulba# mwoffliner --mwUrl=https://bulbapedia.bulbagarden.net/ --adminEmail=rashiq@kiwix.org --articleList articles.txt > output.txt
Unable to download content [1] http://bulba-ad-host-website.azurewebsites.net/Content/customstyles.css (statusCode=406).
Unable to download content [2] http://bulba-ad-host-website.azurewebsites.net/Content/customstyles.css (statusCode=406).
Unable to download content [1] https://bulbapedia.bulbagarden.net/wiki/Mediawiki:offline.css?action=raw (statusCode=403).
Unable to download content [3] http://bulba-ad-host-website.azurewebsites.net/Content/customstyles.css (statusCode=406).
Absolutly unable to retrieve async. URL: Unable to download content [3] http://bulba-ad-host-website.azurewebsites.net/Content/customstyles.css (statusCode=406).
Unable to download content [2] https://bulbapedia.bulbagarden.net/wiki/Mediawiki:offline.css?action=raw (statusCode=403).
Unable to download content [3] https://bulbapedia.bulbagarden.net/wiki/Mediawiki:offline.css?action=raw (statusCode=403).
Absolutly unable to retrieve async. URL: Unable to download content [3] https://bulbapedia.bulbagarden.net/wiki/Mediawiki:offline.css?action=raw (statusCode=403).
Failed to start to optim /bulba/tmp/bulbagarden_en_articles_2017-10/favicon.png. Size should be 6022 (4012)
Error by retrieving article: Unrecognized value for parameter 'action': visualeditor

I'm using the mwoffliner docker image.

@kelson42
Copy link
Collaborator

I would try mwoffliner --mwUrl=https://bulbapedia.bulbagarden.net/ --adminEmail=rashiq@kiwix.org --verbose --withZimFullTextIndex --localParsoid. The --localParsoid is mandatory because it's not installed server side.

@kelson42 kelson42 added the bug label Oct 23, 2017
@rashiq
Copy link
Author

rashiq commented Oct 23, 2017

I ran it with the localparsoid flag
mwoffliner --mwUrl=https://bulbapedia.bulbagarden.net/ --adminEmail=rashiq@kiwix.org --withZimFullTextIndex --localParsoid

but I still got this: http://termbin.com/v9s4 (the last 1000 lines)

there's lots of warnings and then it just fails with a time out - I'm running it again but are the warnings before that bad or can I just ignore them

@kelson42
Copy link
Collaborator

@rashiq I come to a similar problem on my side

warn/api/main { logType: 'warn/api/main',
  wiki: 'wiki$0',
  title: 'Ace_Trainer_(Trainer_class)',
  oldId: 2704375,
  reqId: null,
  userAgent: 'MWOffliner/HEAD (kelson@kiwix.org)',
  msg: 'Image Info Request Unrecognized parameter: \'iibadfilecontexttitle\'',
  longMsg: 'Image Info Request\nUnrecognized parameter: \'iibadfilecontexttitle\'' }
warn/api/imageinfo { logType: 'warn/api/imageinfo',
  wiki: 'wiki$0',
  title: 'Ace_Trainer_(Trainer_class)',
  oldId: 2704375,
  reqId: null,
  userAgent: 'MWOffliner/HEAD (kelson@kiwix.org)',
  msg: 'Image Info Request Unrecognized value for parameter \'iiprop\': badfile',
  longMsg: 'Image Info Request\nUnrecognized value for parameter \'iiprop\': badfile' }
fatal Timed out processing: wiki$0/Ace_Trainer_(Trainer_class)?oldid=2704375

The problem is that it comes from Parsoid so not mwoffliner directly.

@subbuss Any clue why we have a problem here? I find quite strange to have this $0 in the error message?

@subbuss
Copy link
Contributor

subbuss commented Oct 27, 2017

Do you have the page title that produced the error?

@subbuss
Copy link
Contributor

subbuss commented Oct 27, 2017

Oh never mind .. I see it in the error message.

@rashiq
Copy link
Author

rashiq commented Oct 28, 2017

@subbuss could you figure out what's causing it? :)

@subbuss
Copy link
Contributor

subbuss commented Oct 31, 2017

Sorry, not yet. I started and got distracted .. but, my quick comments are:
(a) the warnings are because you are running a newer parsoid version against a slightly older version of m/w .. it is not a problem .. they are warnings only
(b) when i run the parse locally on my laptop for that page, it finishes in ~47 seconds. so, I think the problem is that the page is big enough OR the wiki is hosted on a slowish server OR the computer on which parsoid runs is slow OR the network reqs to the wiki takes too long ... that Parsoid server times out. One way around is to increase the timeout values in Parsoid by updating your config.yaml / localsettings.js file for Parsoid ... but, I suppose that is a mwoffliner thing. So, that requires messing with the mwoffliner code unless we refactor that code a bit more to expose parsoid configuration to a separate file so that can be tweaked without changing mwoffliner code. TO DO.

@subbuss
Copy link
Contributor

subbuss commented Oct 31, 2017

A sample of time profile from Parsoid when I ran it locally.

           TOTAL PARSE TIME: 47677
           TOTAL PROFILED TIME: 14101

Since ~30 secs of the profile time is unaccounted for, that is likely i/o wait time. So, my bet is that the reason for the long parse time is network i/o time and/or a slow mediawiki server.

@MattyBoy4444
Copy link

I have another wiki with the same timeout issue. Has anyone been able to fix this?

@ejtejada
Copy link

ejtejada commented Nov 6, 2017

@subbuss
Hopefully I am not intruding, but I too was looking to make a zim, imageless version of Bulbapedia.
While I do suspect their server is the problem, I also noticed on their Robots.txt an expected Crawl-delay.
https://bulbapedia.bulbagarden.net/robots.txt
Could trying to download faster than the delay be causing problems, or is this entirely Parsoid version incompatibilities?

@kelson42
Copy link
Collaborator

kelson42 commented Dec 8, 2017

I would run again the command with --speed=0.5

@kelson42
Copy link
Collaborator

@subbuss After X3 of all Parsoid timeouts I achieve to go further... but now it seems that mwoffliner crash on a new article "Battle_Frontier_(Generation_IV)/Pokémon_(Group_3,_001-251)". Parsoid seems simply unable to parse it properly (it is the only title in the articles file).

./bin/mwoffliner.script.js --mwUrl=https://bulbapedia.bulbagarden.net/ --adminEmail=rashiq@kiwix.org --withZimFullTextIndex --localParsoid --verbose --speed=0.1 --articleList=articles

@subbuss
Copy link
Contributor

subbuss commented Dec 10, 2017

Locally on my laptop with the latest version of Parsoid, it parses in 35 s.

parse.js --apiURL https://bulbapedia.bulbagarden.net/w/api.php --pageName "Battle_Frontier_(Generation_IV)/Pokémon_(Group_3,_001-251)" --trace time --dump wt2html:limits < /dev/null

@kelson42
Copy link
Collaborator

@subbuss @rashiq I have achieved to create a ZIM file on a more powerful system and with --speed=0.1 https://download.kiwix.org/zim/other/bulbagarden_en_all_2017-12.zim

@rashiq
Copy link
Author

rashiq commented Dec 14, 2017

awesome! thank you so much @kelson42!! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants