Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML entities are not removed from ZIM article titles #398

Closed
ghost opened this issue Aug 4, 2020 · 7 comments · Fixed by #458
Closed

HTML entities are not removed from ZIM article titles #398

ghost opened this issue Aug 4, 2020 · 7 comments · Fixed by #458

Comments

@ghost
Copy link

ghost commented Aug 4, 2020

@kelson42 commented on Aug 4, 2020, 10:14 AM UTC:

See for example "Jean Aicard - L'Illustre Maurin ( EPUB et PDF gratuits )" in bouquineux.zim which gives Jean Aicard - L'Illustre Maurin ( EPUB et PDF gratuits ) in kiwix-serve suggestions (one time chosen).

This issue was moved by kelson42 from openzim/warc2zim#38.

@ghost ghost added bug wontfix labels Aug 4, 2020
@ghost
Copy link
Author

ghost commented Aug 4, 2020

@rgaudin commented on Aug 4, 2020, 12:07 PM UTC:

  • the page title doesn't encode this.
  • I checked the WARC and it's content doesn't have it encoded.
  • the python code doesn't encode it neither.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Jean Aicard - Notre-Dame-d'Amour ( EPUB et PDF gratuits )</title>
<META NAME="Description" CONT

I've looked at kiwixlib quickly and we don't appear to do much there neither. The suggested list items are not encoded neither so it seems to be solely related to the taskbar JS:

jk( "#kiwixsearchbox" ).autocomplete({

  source: "{{root}}/suggest?content={{#urlencoded}}{{{content}}}{{/urlencoded}}",
  dataType: "json",
  cache: false,

  select: function(event, ui) {
  jk( "#kiwixsearchbox" ).val(ui.item.value);
  jk( "#kiwixsearchform" ).submit();
  },

  });

head_part.html

@mgautierfr can maybe explain with we use {{#urlencoded}} here?

@rgaudin
Copy link
Member

rgaudin commented Aug 7, 2020

FYI this affects all scrapers and prevents any suggestion with a quote. You can search for L'attente on wikipedia_fr_all_* and get the same bug.

@mgautierfr
Copy link
Member

@mgautierfr can maybe explain with we use {{#urlencoded}} here?

The {{#urlencoded}} encode the zim name when we do the request for suggestions. So the base url is valid even if the zim file contains ? or any specific http character.

It doesn't encode the results of the request displayed to the user.

@stale
Copy link

stale bot commented Oct 10, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@veloman-yunkan
Copy link
Collaborator

HTML escaping in suggestions is performed by mustache using this template:
https://github.com/kiwix/kiwix-lib/blob/c1faf55ae88563af2fa1d3788f2ab0a841ce7c24/static/templates/suggestion.json#L1-L7

Escaping is applied to value but not to label. As a consequence search results containing double quotes produce an invalid JSON (this is just a by-finding):

$ curl 'http://library.kiwix.org/suggest?content=wikipedia_hy_all_mini_2021-02&term=Superhero'
[
  {
    "value" : "&quot;Superhero&quot;",
    "label" : ""Superhero""
  },
  {
    "value" : "Superhero",
    "label" : "Superhero"
  },
  {
    "value" : "Superhero (երգ)",
    "label" : "Superhero (երգ)"
  },
  {
    "value" : "Superheroes (երգ)",
    "label" : "Superheroes (երգ)"
  },
  {
    "value" : "Superhero ",
    "label" : "containing 'Superhero'..."
  }
]

I think the correct solution is to ensure that the API's response is valid JSON rather than safe HTML. HTML escaping of the data extracted from the response must be performed - if needed - in the frontend.

@veloman-yunkan
Copy link
Collaborator

I think the correct solution is to ensure that the API's response is valid JSON rather than safe HTML. HTML escaping of the data extracted from the response must be performed - if needed - in the frontend.

While still thinking that this is the right approach, the easiest solution with the current mustache-based implementation (which primarily targets HTML) is to HTML escape the label too and unescape - if needed - the response data in the frontend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants