Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor entities encoder #5

Merged
merged 5 commits into from Sep 13, 2017

Conversation

straight-shoota
Copy link
Contributor

This PR renames the custom HTML.escape and HTML.unescape methods because their purpose is to encode and decode HTML entities, not escaping HTML special characters.

I also refactored and simplified some related code and removed unused constants in Rule.

When crystal-lang/crystal#4555 get's merged, the custom implementation of Renderer#escape should be replaced with the one from stdlib.

@@ -26,20 +26,14 @@ module Markd
lit("\n") if @last_output != "\n"
end

def escape(text, preserve_entities = false)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if preserve_entities served any purpose, it was at least not specced and the result from gsub would be the same anyway because only the four special characters are really replaced.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right.

Copy link
Owner

@icyleaf icyleaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking ✨ ! I have left a comment about concatenates string.
Through benchmark, + is faster then #{}, Maybe revert it all back?
https://github.com/icyleaf/fast-crystal#concatenation-code

@@ -26,20 +26,14 @@ module Markd
lit("\n") if @last_output != "\n"
end

def escape(text, preserve_entities = false)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right.

high = chars.codepoint_at(0)
low = chars.codepoint_at(0)
codepoint = (high - 0xD800) * -0x400 + low - 0xDC00 + 0x10000

"&#x" + codepoint.to_s(16).upcase + ";"
"&#x#{codepoint.to_s(16).upcase};"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Through benchmark, + is faster then #{} when concatenates string, maybe revert it all?
https://github.com/icyleaf/fast-crystal#concatenation-code

@straight-shoota
Copy link
Contributor Author

straight-shoota commented Sep 13, 2017

About the string interpolation: I don't think your benchmark is accurate because it uses a constant and LLVM is probably applying some performance tweaks.
When using a dynamic value, I measured interpolation beeing between 1.06 and 1.3 times slower which actually just means a few nanoseconds. I think that's negligible in this context. And concatenation is certainly not faster for multiple values. Only concatenating 2 or 3 (short) strings is seemingly faster than the overhead of initializing a StringBuilder for interpolation.

I don't think it is worth sacrificing good coding style for this little improvement.
It might even be possible to improve this performance for single-instance interpolations in the compiler. I think I'll open an issue about that.

If you don't mind, I'd rather use interpolation, but I'm happy to change it if you'd prefer it that way.

Benchmark example:

def foo
  rand.to_s
end

Benchmark.ips do |bm|
  bm.report "+ single" { 10.times { "a" + foo }}
  bm.report "# single" { 10.times { "a#{foo}" }}
end

@icyleaf icyleaf merged commit 70848d0 into icyleaf:master Sep 13, 2017
@asterite
Copy link
Contributor

This PR made everything 100 times slower (I thought it was #7 but it's this PR)

@asterite
Copy link
Contributor

I would strongly suggest to have a benchmark somewhere and, before merging a PR, check if the performance gets better or worse.

@asterite
Copy link
Contributor

It seems Decoder.regex is creating a Regex everytime. If we cache it (for example in a constant) the performance problem goes away.

@asterite
Copy link
Contributor

asterite commented Sep 28, 2017

Also this:

      return @@regex if @@regex.source != "^"

      @@regex = Regex.union(HTMLEntities::ENTITIES_MAPPINGS.values)

can be easily replaced with a constant...

@straight-shoota
Copy link
Contributor Author

@asterite Thanks for the investigation. I was pretty sure I tested performance impact before pushing, but I'm not certain that I did it on this PR... 🤔

Decoder.regex seems like an unlikely candidate because I can't see there were any significant changes to this method or where it is called... Or am I missing something?

@asterite
Copy link
Contributor

I profiled it with XCode's instruments and it pointed right to that method. I didn't investigate it much more, though.

@asterite
Copy link
Contributor

I'm using Crystal's README.md to benchmark this.

Before this PR there was just a decode method, and it was never called. Now decode_entities is called a lot of times. I won't spend more time investigating it, but it seems the logic was changed. In any case, caching the regex solves the problem, though it would be nice to know why decode is now called when previously it wasn't needed and all specs passed.

@straight-shoota
Copy link
Contributor Author

Ah yes, decode_entities_string now calls HTML.decode_entities every time directly while before it was hidden behind two regexes. I think this is the right thing to to because it doesn't make sense to check with a regex if there is an backslash or ampersand and then run another regex to replace these. It's just that the regex for decode needs to be cached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants