Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitize improvements #68

Merged
merged 5 commits into from
Oct 2, 2017
Merged

Conversation

kimili
Copy link
Contributor

@kimili kimili commented Sep 28, 2017

I've been finding the Wikipedia::Page.sanitize function to be very useful. However, I've also been finding it a bit too heavy handed in what it decides to strip out and ignore. This update adds some parsing of simple Wikimedia template tags. Some of the additions include:

  • Typographic tags Things like Em Dash, Spaced En Dash, Bullets, Middots and so on.
  • Parsing of language specific tags. This finds tokens that define text in languages other than the main document language and converts it to HTML as a <span> with the appropriate lang attribute.
  • Old Style Date parsing.
  • Also, when the sanitized text is only a single paragraph, I've wrapped a <p>…</p> element around it for consistent formatting.

I've updated the sanitization test as needed and added another one which includes the old style date handling.

These changes are useful to me. I hope they can be useful to others, too.

- Parsing Typographic wiki markup templates
- Improving image and file tag parsing
- Wrapping single paragraph raw text in a paragraph element.
- Also adding JS Bach test which includes an old style date block.
@pietromenna
Copy link
Collaborator

Hi @kimili ,

Thank you very much for your contribution!

@pietromenna pietromenna merged commit 825b797 into kenpratt:master Oct 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants