Skip to content
Permalink
Browse files

Add --ascii option for grammars that don't use unicode

  • Loading branch information...
longhotsummer committed Mar 29, 2019
1 parent 970bb70 commit 8386b8416accb2f71416760c3c25d995c4565dd7
Showing with 36 additions and 7 deletions.
  1. +4 −0 README.md
  2. +3 −0 bin/slaw
  3. +29 −7 lib/slaw/parse/builder.rb
@@ -84,6 +84,10 @@ You can create your own grammar by creating a gem that provides these files and

## Changelog

### 3.1.0 (29 March 2019)

* Add --ascii flag to %-encode utf-8 strings into US-ASCII for speed. See https://github.com/cjheath/treetop/issues/31

### 3.0.0 (28 March 2019)

* Inline bold and italics
@@ -18,6 +18,7 @@ class SlawCLI < Thor
option :id_prefix, type: :string, desc: "Prefix to be used when generating ID elements when parsing a fragment."
option :section_number_position, enum: ['before-title', 'after-title', 'guess'], desc: "Where do section titles come in relation to the section number? Default: before-title"
option :grammar, type: :string, desc: "Grammar name (usually a two-letter country code). Default is za."
option :ascii, type: :boolean, default: false, desc: "Process text as ASCII using %-encoding. This can provide significant speed improvements if the grammar uses only ASCII literals. See https://github.com/cjheath/treetop/issues/31."
def parse(name)
logging

@@ -65,6 +66,8 @@ class SlawCLI < Thor
generator.parser.options[:section_number_after_title] = after
end

generator.builder.force_ascii = options[:ascii]

begin
act = generator.generate_from_text(text)
rescue Slaw::Parse::ParseError => e
@@ -1,5 +1,6 @@
# encoding: UTF-8

require 'uri'
require 'treetop'

module Slaw
@@ -32,6 +33,9 @@ class Builder
# Prefix to use when generating IDs for fragments
attr_accessor :fragment_id_prefix

# Should the parsing re-encoding the string as ASCII?
attr_accessor :force_ascii

# Create a new builder.
#
# Specify either `:parser` or `:grammar_file` and `:grammar_class`.
@@ -41,6 +45,7 @@ class Builder
def initialize(opts={})
@parser = opts[:parser]
@parse_options = opts[:parse_optiosn] || {}
@force_ascii = false
end

# Do all the work necessary to parse text into a well-formed XML document.
@@ -78,18 +83,35 @@ def preprocess(text)
def parse_text(text, parse_options={})
text = preprocess(text)

require 'uri'
# use %-encoding to escape everything outside of the US_ASCII range,
# including encoding % itself.
text = escape_utf8(text) if @force_ascii

tree = text_to_syntax_tree(text, parse_options)
xml = xml_from_syntax_tree(tree)

xml = unescape_utf8(xml) if @force_ascii

xml
end

# Use %-encoding to escape everything outside of the US_ASCII range,
# including encoding % itself.
#
# This can have a huge performance benefit. String lookups on utf-8 strings
# are linear in Ruby, while string lookups on US_ASCII encoded strings
# are constant time.
#
# This option can only be used if the grammar doesn't include non-ascii literals.
#
# See https://github.com/cjheath/treetop/issues/31
def escape_utf8(text)
unsafe = (0..126).to_a - ['%'.ord]
unsafe = unsafe.map { |i| '\u%04x' % i }
unsafe = Regexp.new('[^' + unsafe.join('') + ']')

text = URI::DEFAULT_PARSER.escape(text, unsafe)

tree = text_to_syntax_tree(text, parse_options)
xml = xml_from_syntax_tree(tree)
URI::DEFAULT_PARSER.escape(text, unsafe)
end

def unescape_utf8(xml)
URI.unescape(xml)
end

0 comments on commit 8386b84

Please sign in to comment.
You can’t perform that action at this time.