Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request failed with status code 400 #1468

Closed
kelson42 opened this issue May 22, 2021 · 10 comments · Fixed by #1507
Closed

Request failed with status code 400 #1468

kelson42 opened this issue May 22, 2021 · 10 comments · Fixed by #1507
Assignees
Labels
bug stale wikimedia Direct impact on Wikimedia content scraping
Milestone

Comments

@kelson42
Copy link
Collaborator

For https://en.wikipedia.org/api/rest_v1/page/mobile-sections/I%25CC%2587znik

see https://farm.openzim.org/pipeline/7adfa713c4bff5e9ce378a06/debug

@kelson42 kelson42 added bug question upstream wikimedia Direct impact on Wikimedia content scraping labels May 22, 2021
@kelson42 kelson42 removed the bug label Jul 8, 2021
@kelson42
Copy link
Collaborator Author

It seems we request a bad title, json says:

type | "https://mediawiki.org/wiki/HyperSwitch/errors/bad_request"
method | "get"
detail | "title-invalid-characters"

@kelson42 kelson42 self-assigned this Jul 20, 2021
@kelson42 kelson42 added bug and removed upstream labels Jul 20, 2021
@kelson42 kelson42 added this to the 1.12 milestone Jul 20, 2021
@kelson42
Copy link
Collaborator Author

The problem is with the article/title İznik. It this is put in an article list then the scraper dies because a wrongly encoded string seems to be send to the API. IMO this is not a regression and the problem has always been there... but in the past we were not checking the API HTTP response code properly and the article were simply not mirrored at all... and it seems to indeed be missing in the old ZIM files of Wikipedia 0.8. So I guess we are braking the encoding of the title somewhere before requesting the HTML... might be at the time we retrieve meta informations like redirects .... etc.

@MananJethwani Would you be able please to have a look to that one as well. It is easy to reproduce and I'm sure you will find out quickly were the problem occur. Actually this is a quite serious problem because not only one zimfarm recipe dies because of this problem.

@MananJethwani
Copy link
Contributor

looks like we encode it twice!!

@MananJethwani
Copy link
Contributor

@kelson42 looks like we receive the articleIDs encoded from the MediaWiki side, so we don't need to encode them again while fetching.

@kelson42
Copy link
Collaborator Author

@MananJethwani Your fix has allowed to improve the situation, but I still have a scenario here https://farm.openzim.org/pipeline/03264e29e2116ecec91f8f06/debug

@kelson42 kelson42 reopened this Jul 22, 2021
@MananJethwani
Copy link
Contributor

MananJethwani commented Jul 22, 2021

@kelson42 this is strange, %C2%AD is not mapped to any UTF-8 code, does this mean we are encoding some kind of empty line?
and even if we are why is it present in Wikipedia?

@MananJethwani
Copy link
Contributor

most probably this is a problem from the MediaWiki side, the site exists https://dty.wikipedia.org/wiki/%C2%AD
but when we try to fetch it using rest API using this URI https://dty.wikipedia.org/api/rest_v1/page/mobile-sections/%C2%AD we get 400 response.

@kelson42
Copy link
Collaborator Author

WP0.8 is broken again after this https://github.com/openzim/mwoffliner/pull/1521/files#diff-9a83f0d6b6913493f3382285626a8799d767b06b0c309e56d611014e9d05eea4L121. We need to better understand what is going on here.

@stale
Copy link

stale bot commented Nov 9, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Nov 9, 2021
@kelson42
Copy link
Collaborator Author

It seems it was some kind of weird encoding in the article list. I have fixed it.

@kelson42 kelson42 modified the milestones: 1.14.0, 1.12.0 Dec 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug stale wikimedia Direct impact on Wikimedia content scraping
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants