Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zimdump produces invalid path in HTML-based redirect #224

Closed
lidel opened this issue Feb 14, 2021 · 5 comments · Fixed by #225
Closed

zimdump produces invalid path in HTML-based redirect #224

lidel opened this issue Feb 14, 2021 · 5 comments · Fixed by #225
Assignees
Milestone

Comments

@lidel
Copy link
Contributor

lidel commented Feb 14, 2021

Assuming I understood this correctly, the HTML-based redirect created by zimdump points at an invalid path.

How to reproduce

wikipedia_cr_all_maxi_2021-02.zim is small enough to be a good demo:

$ zimdump dump --dir=cr-tmp ./wikipedia_cr_all_maxi_2021-02.zim
$ cat cr-tmp/A/English_language
<!DOCTYPE html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta http-equiv="refresh" content="0;url=A/%E1%90%A7%E1%90%81%E1%92%A5%E1%94%A5%E1%91%8E%E1%91%AF%E1%94%92%E1%90%A4_%E1%90%8A%E1%94%A8%E1%92%A7%E1%90%A7%E1%90%83%E1%93%90" /><head><body></body></html>

So while in A/{name1} we are redirected to A/{name2}.

This looks like a bug, because the redirect is relative to {name1}, so it points at A/A/{name2} which does not exist.

I was able to reproduce the same issue with

  • wikipedia_cu_all_maxi_2021-02 – example: A/Жена effectively redirects to A/A/%D0%96%D1%94%D0%BD%D0%B0
  • wikipedia_tr_all_maxi_2021-02 – example: A/Fatih_Sultan_Mehmed effectively redirects to A/A/II._Mehmed

How to fix?

Relative redirect

Ideally, a relative path would be used, without namespace prefix, something like:

- <meta http-equiv="refresh" content="0;url=A/{name2}" /> 
+ <meta http-equiv="refresh" content="0;url={name2}" /> 

Open problem: subdirectories

The caveat is an edge case when a name with / is redirecting to something else. Something like A/some/name needs to point at article one level up:

- <meta http-equiv="refresh" content="0;url=A/{name2}" /> 
+ <meta http-equiv="refresh" content="0;url=../{name2}" /> 

In case that adds too much complexity to zimdump code, we could do that in JS.

If we replace <meta> tag with <script> then we are able to correctly redirect, no matter if name includes / or not. Something like:

<!doctype html>
<html lang="en">
<head>
<script>
  const from = 'na/me1'
  const to = 'name2'
  const prefix = window.location.pathname.split(from)[0]  
  window.location.replace(prefix + to)
</script>
<noscript>
  <meta http-equiv="refresh" content="0;url=name2" />  <!-- best-effort if no JS -->
</noscript>
</head>
</html>

@kelson42 lmk if there is a better way (or if I misunderstood how things work :))

@kelson42 kelson42 self-assigned this Feb 14, 2021
lidel added a commit to ipfs/distributed-wikipedia-mirror that referenced this issue Feb 15, 2021
This is a workaround to fix issue described in:
openzim/zim-tools#224
@lidel
Copy link
Contributor Author

lidel commented Feb 15, 2021

@kelson42 fwiw I've worked around this in ipfs/distributed-wikipedia-mirror@11fe184 by leveraging find fgrep and sed:

find ./wikipedia_tr_all_maxi_2021-02/A  -type f -size -800c -exec fgrep -l "0;url=A/" {} + -exec sed -i "s|0;url=A/|0;url=|" {} >> fixed_redirects.log +

Inspecting all articles takes too much time, so it looks for 0;url=A/ in files smaller than 800 bytes (so it only works on redirect ones, which are usually <500, while real articles are >1KB), and updates them in-place if needed.

@kelson42
Copy link
Contributor

@veloman-yunkan may I ask you please to look at this ticket if you have the time? This is a pretty small one, but this impairs a bit the new Wikipedia snapshots release effort on IPFS we are running now.

@veloman-yunkan
Copy link
Collaborator

@kelson42 OK, I will investigate it right away

@kelson42
Copy link
Contributor

kelson42 commented Feb 19, 2021

@veloman-yunkan Thank you! I looked yesterday to the code and the problem is clearly that there is no code at all to compute the relative path. The redirect targeted article fullpath URL is just put there. I believe we have a function somewhere (in zimwriterfs?) to make this computation of a relative path given the current path and the target path. Probably could/should be reused.

@veloman-yunkan
Copy link
Collaborator

I believe we have a function somewhere (in zimwriterfs?) to make this computation of a relative path given the current path and the target path. Probably could/should be reused.

I looked for such a function only in src/tools.h. Since it wasn't there I wrote my own implementation of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants