Refactor entities encoder #5

straight-shoota · 2017-09-12T13:42:01Z

This PR renames the custom HTML.escape and HTML.unescape methods because their purpose is to encode and decode HTML entities, not escaping HTML special characters.

I also refactored and simplified some related code and removed unused constants in Rule.

When crystal-lang/crystal#4555 get's merged, the custom implementation of Renderer#escape should be replaced with the one from stdlib.

…, not escaping HTML special characters.

straight-shoota · 2017-09-12T16:12:52Z

src/markd/renderer.cr

@@ -26,20 +26,14 @@ module Markd
      lit("\n") if @last_output != "\n"
    end

-    def escape(text, preserve_entities = false)


I am not sure if preserve_entities served any purpose, it was at least not specced and the result from gsub would be the same anyway because only the four special characters are really replaced.

You are right.

icyleaf

This is looking ✨ ! I have left a comment about concatenates string.
Through benchmark, + is faster then #{}, Maybe revert it all back?
https://github.com/icyleaf/fast-crystal#concatenation-code

icyleaf · 2017-09-13T02:23:07Z

src/markd/renderer.cr

@@ -26,20 +26,14 @@ module Markd
      lit("\n") if @last_output != "\n"
    end

-    def escape(text, preserve_entities = false)


You are right.

icyleaf · 2017-09-13T02:26:59Z

src/markd/html_entities.cr

      high = chars.codepoint_at(0)
      low = chars.codepoint_at(0)
      codepoint = (high - 0xD800) * -0x400 + low - 0xDC00 + 0x10000

-      "&#x" + codepoint.to_s(16).upcase + ";"
+      "&#x#{codepoint.to_s(16).upcase};"


Through benchmark, + is faster then #{} when concatenates string, maybe revert it all?
https://github.com/icyleaf/fast-crystal#concatenation-code

straight-shoota · 2017-09-13T09:11:04Z

About the string interpolation: I don't think your benchmark is accurate because it uses a constant and LLVM is probably applying some performance tweaks.
When using a dynamic value, I measured interpolation beeing between 1.06 and 1.3 times slower which actually just means a few nanoseconds. I think that's negligible in this context. And concatenation is certainly not faster for multiple values. Only concatenating 2 or 3 (short) strings is seemingly faster than the overhead of initializing a StringBuilder for interpolation.

I don't think it is worth sacrificing good coding style for this little improvement.
It might even be possible to improve this performance for single-instance interpolations in the compiler. I think I'll open an issue about that.

If you don't mind, I'd rather use interpolation, but I'm happy to change it if you'd prefer it that way.

Benchmark example:

def foo
  rand.to_s
end

Benchmark.ips do |bm|
  bm.report "+ single" { 10.times { "a" + foo }}
  bm.report "# single" { 10.times { "a#{foo}" }}
end

asterite · 2017-09-28T11:18:42Z

This PR made everything 100 times slower (I thought it was #7 but it's this PR)

asterite · 2017-09-28T11:19:23Z

I would strongly suggest to have a benchmark somewhere and, before merging a PR, check if the performance gets better or worse.

asterite · 2017-09-28T11:24:18Z

It seems Decoder.regex is creating a Regex everytime. If we cache it (for example in a constant) the performance problem goes away.

asterite · 2017-09-28T11:24:39Z

Also this:

      return @@regex if @@regex.source != "^"

      @@regex = Regex.union(HTMLEntities::ENTITIES_MAPPINGS.values)

can be easily replaced with a constant...

straight-shoota · 2017-09-28T11:30:07Z

@asterite Thanks for the investigation. I was pretty sure I tested performance impact before pushing, but I'm not certain that I did it on this PR... 🤔

Decoder.regex seems like an unlikely candidate because I can't see there were any significant changes to this method or where it is called... Or am I missing something?

asterite · 2017-09-28T11:32:25Z

I profiled it with XCode's instruments and it pointed right to that method. I didn't investigate it much more, though.

asterite · 2017-09-28T11:39:45Z

I'm using Crystal's README.md to benchmark this.

Before this PR there was just a decode method, and it was never called. Now decode_entities is called a lot of times. I won't spend more time investigating it, but it seems the logic was changed. In any case, caching the regex solves the problem, though it would be nice to know why decode is now called when previously it wasn't needed and all specs passed.

straight-shoota · 2017-09-28T11:52:38Z

Ah yes, decode_entities_string now calls HTML.decode_entities every time directly while before it was hidden behind two regexes. I think this is the right thing to to because it doesn't make sense to check with a regex if there is an backslash or ampersand and then run another regex to replace these. It's just that the regex for decode needs to be cached.

straight-shoota added 3 commits September 12, 2017 15:25

Rename {un,}escape to {de,en}code - it\'s about encoding entities…

637c7e7

…, not escaping HTML special characters.

remove unused constants

bf8b7f0

Refactor entities maps

f3b2500

straight-shoota commented Sep 12, 2017

View reviewed changes

replace ENTITY_HERE regex

3b789ae

straight-shoota force-pushed the jm-entities-encode branch from 168f751 to 4cf3726 Compare September 12, 2017 21:30

Refactor Encoder to module

852a393

straight-shoota force-pushed the jm-entities-encode branch from 4cf3726 to 852a393 Compare September 12, 2017 21:37

icyleaf requested changes Sep 13, 2017

View reviewed changes

icyleaf merged commit 70848d0 into icyleaf:master Sep 13, 2017

asterite mentioned this pull request Sep 28, 2017

Replace regexes and char codes #7

Merged

asterite mentioned this pull request Sep 28, 2017

Remove markdown, use markd shard in compiler crystal-lang/crystal#4955

Closed

straight-shoota mentioned this pull request Sep 28, 2017

Store often-used regex in constant to improve performance #8

Merged

straight-shoota deleted the jm-entities-encode branch September 28, 2017 13:22

xfbs mentioned this pull request Feb 10, 2018

It fails karlcow/markdown-testsuite #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor entities encoder #5

Refactor entities encoder #5

straight-shoota commented Sep 12, 2017

straight-shoota Sep 12, 2017

icyleaf Sep 13, 2017

icyleaf left a comment

icyleaf Sep 13, 2017

icyleaf Sep 13, 2017

straight-shoota commented Sep 13, 2017 •

edited

asterite commented Sep 28, 2017

asterite commented Sep 28, 2017

asterite commented Sep 28, 2017

asterite commented Sep 28, 2017 •

edited

straight-shoota commented Sep 28, 2017

asterite commented Sep 28, 2017

asterite commented Sep 28, 2017

straight-shoota commented Sep 28, 2017

Refactor entities encoder #5

Refactor entities encoder #5

Conversation

straight-shoota commented Sep 12, 2017

straight-shoota Sep 12, 2017

Choose a reason for hiding this comment

icyleaf Sep 13, 2017

Choose a reason for hiding this comment

icyleaf left a comment

Choose a reason for hiding this comment

icyleaf Sep 13, 2017

Choose a reason for hiding this comment

icyleaf Sep 13, 2017

Choose a reason for hiding this comment

straight-shoota commented Sep 13, 2017 • edited

asterite commented Sep 28, 2017

asterite commented Sep 28, 2017

asterite commented Sep 28, 2017

asterite commented Sep 28, 2017 • edited

straight-shoota commented Sep 28, 2017

asterite commented Sep 28, 2017

asterite commented Sep 28, 2017

straight-shoota commented Sep 28, 2017

straight-shoota commented Sep 13, 2017 •

edited

asterite commented Sep 28, 2017 •

edited