Skip to content

Commit

Permalink
Improve performance of JSON HTML entity escaping
Browse files Browse the repository at this point in the history
Running gsub! 5 times with string arguments seems to be faster than
running it once with a regex and Hash.

When there are matches to the regex (there are characters to escape)
this is faster in part because CRuby will allocate a new match object
and string as a key to lookup in the map hash provided. It's possible
that could be optimized upstream, but at the moment this avoids those
allocations.

Surprisingly (at least to me) this is still much faster when there is no
replacement needed: in my test ~3x faster on a short ~200 byte string,
and ~5x faster on a pre-escaped ~600k twitter.json.
  • Loading branch information
jhawthorn committed Jul 5, 2023
1 parent 807bd54 commit ebe0c40
Show file tree
Hide file tree
Showing 3 changed files with 23 additions and 26 deletions.
10 changes: 8 additions & 2 deletions actionview/test/template/erb_util_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,15 @@ class ErbUtilTest < ActiveSupport::TestCase
end
end

ERB::Util::JSON_ESCAPE.each do |given, expected|
{
"&" => '\u0026',
">" => '\u003e',
"<" => '\u003c',
"\u2028" => '\u2028',
"\u2029" => '\u2029'
}.each do |given, expected|
define_method "test_json_escape_#{expected.gsub(/\W/, '')}" do
assert_equal ERB::Util::JSON_ESCAPE[given], json_escape(given)
assert_equal expected, json_escape(given)
end
end

Expand Down
9 changes: 6 additions & 3 deletions activesupport/lib/active_support/core_ext/erb/util.rb
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,7 @@ module ERBUtilPrivate
class ERB
module Util
HTML_ESCAPE = { "&" => "&amp;", ">" => "&gt;", "<" => "&lt;", '"' => "&quot;", "'" => "&#39;" }
JSON_ESCAPE = { "&" => '\u0026', ">" => '\u003e', "<" => '\u003c', "\u2028" => '\u2028', "\u2029" => '\u2029' }
HTML_ESCAPE_ONCE_REGEXP = /["><']|&(?!([a-zA-Z]+|(#\d+)|(#[xX][\dA-Fa-f]+));)/
JSON_ESCAPE_REGEXP = /[\u2028\u2029&><]/u

# Following XML requirements: https://www.w3.org/TR/REC-xml/#NT-Name
TAG_NAME_START_CODEPOINTS = "@:A-Z_a-z\u{C0}-\u{D6}\u{D8}-\u{F6}\u{F8}-\u{2FF}\u{370}-\u{37D}\u{37F}-\u{1FFF}" \
Expand Down Expand Up @@ -124,7 +122,12 @@ def html_escape_once(s)
# JSON gem, do not provide this kind of protection by default; also some gems
# might override +to_json+ to bypass Active Support's encoder).
def json_escape(s)
result = s.to_s.gsub(JSON_ESCAPE_REGEXP, JSON_ESCAPE)
result = s.to_s.dup
result.gsub!(">", '\u003e')
result.gsub!("<", '\u003c')
result.gsub!("&", '\u0026')
result.gsub!("\u2028", '\u2028')
result.gsub!("\u2029", '\u2029')
s.html_safe? ? result.html_safe : result
end

Expand Down
30 changes: 9 additions & 21 deletions activesupport/lib/active_support/json/encoding.rb
Original file line number Diff line number Diff line change
Expand Up @@ -39,33 +39,21 @@ def encode(value)
value = value.as_json(options.dup)
end
json = stringify(jsonify(value))

# Rails does more escaping than the JSON gem natively does (we
# escape \u2028 and \u2029 and optionally >, <, & to work around
# certain browser problems).
if Encoding.escape_html_entities_in_json
json.gsub! ESCAPE_REGEX_WITH_HTML_ENTITIES, ESCAPED_CHARS
else
json.gsub! ESCAPE_REGEX_WITHOUT_HTML_ENTITIES, ESCAPED_CHARS
json.gsub!(">", '\u003e')
json.gsub!("<", '\u003c')
json.gsub!("&", '\u0026')
end
json.gsub!("\u2028", '\u2028')
json.gsub!("\u2029", '\u2029')
json
end

private
# Rails does more escaping than the JSON gem natively does (we
# escape \u2028 and \u2029 and optionally >, <, & to work around
# certain browser problems).
ESCAPED_CHARS = {
"\u2028" => '\u2028',
"\u2029" => '\u2029',
">" => '\u003e',
"<" => '\u003c',
"&" => '\u0026',
}

ESCAPE_REGEX_WITH_HTML_ENTITIES = /[\u2028\u2029><&]/u
ESCAPE_REGEX_WITHOUT_HTML_ENTITIES = /[\u2028\u2029]/u

# Mark these as private so we don't leak encoding-specific constructs
private_constant :ESCAPED_CHARS, :ESCAPE_REGEX_WITH_HTML_ENTITIES,
:ESCAPE_REGEX_WITHOUT_HTML_ENTITIES

# Convert an object into a "JSON-ready" representation composed of
# primitives like Hash, Array, String, Symbol, Numeric,
# and +true+/+false+/+nil+.
Expand Down

0 comments on commit ebe0c40

Please sign in to comment.