Replace Treetop parser with a Ragel based parser #490

bpot · 2013-01-09T02:14:53Z

This pull request replaces the Treetop based parser with a Ragel-based parser. This change is primarily to improve the performance of message processing. Compared to the Treetop parser the Ragel version is ~7.5x faster.

The new parser exhibits the same behavior as the current parser except for a couple case where the Treetop parser was incorrectly handling fields. I've submitted PRs to fix both issues: PR #487 PR #481. A couple specs related to these issues are marked pending in the current PR. Assuming these other two PRs are merged I will rebase and remove the pending lines from those specs.

The change in parsers necessitates removing a public interface. It is already deprecated but still may require a major version bump.

I know this is a massive change, let me know if there is anything I can do to make it more digestible.

Benchmark

Parsing a set of 1000 emails from the enron data set:

Mail-2.5.3: 24.78s (40.35 emails/second)
Mail-2.5.3 w/Ragel Parser: 3.245290s (308.6 emails/second)

Parser layout

lib/mail/parsers/
  address_lists_parser.rb        # Build data Structs for the Elements (lib/mail/elements)
  content_disposition_parser.rb  # by interpreting actions emitted by the state machine modules.
  ...
  ragel/
    common.rl # Main grammar definition
    ruby/
      machines/
        address_lists_machine.rb        # Ragel state machines which emit events that
        content_disposition_machine.rb  # are consumed by the higher level parsers
        ...

Further Work

I am also working on a native parser (based on the same ragel grammar) that will improve performance even further. It uses an FFI interface to a custom shared module so that gains can be shared with Rubinius and JRuby. This is one of the advantages of using a Ragel-based parser and why there is a strict separation between the state machine modules and the classes that interpret the actions.

staugaard · 2013-01-09T21:37:05Z

This is awesome. @mikel what do you say?

mig-hub · 2013-01-10T11:38:35Z

Wow the benchmark is promising!
And keeping the same Ragel grammar for a native parser is a good move.

jeremy · 2013-01-19T05:30:00Z

Great patch. Tested in two apps and runs fine!

Pushing the parsing machinery out to arms' length cleans up the field classes nicely.

jeremy · 2013-01-28T02:27:49Z

Nice speedup in the test suite as well!

master:

Finished in 19.91 seconds
1455 examples, 0 failures, 9 pending

ragel:

Finished in 6.55 seconds
1407 examples, 0 failures, 10 pending

jeremy · 2013-01-28T02:40:01Z

@bpot could you rebase on latest master? Here's a rundown of the .treetop changes:

diff --git a/lib/mail/parsers/content_transfer_encoding.treetop b/lib/mail/parsers/content_transfer_encoding.treetop
index 9d0f50a..9db6134 100644
--- a/lib/mail/parsers/content_transfer_encoding.treetop
+++ b/lib/mail/parsers/content_transfer_encoding.treetop
@@ -9,12 +9,10 @@ module Mail
     end

     rule encoding
-      ietf_token "s"? {
-        def text_value
-          ietf_token.text_value
-        end
-      } / custom_x_token
+      "7bits" / "8bits" /
+      "7bit" / "8bit" / "binary" / "quoted-printable" / "base64" /
+      ietf_token / custom_x_token
     end

   end
-end
\ No newline at end of file
+end
diff --git a/lib/mail/parsers/content_type.treetop b/lib/mail/parsers/content_type.treetop
index 86fe64b..84eeced 100644
--- a/lib/mail/parsers/content_type.treetop
+++ b/lib/mail/parsers/content_type.treetop
@@ -5,7 +5,7 @@ module Mail
     include RFC2045

     rule content_type
-      main_type "/" sub_type param_hashes:(CFWS ";"? parameter CFWS)* {
+      main_type "/" sub_type param_hashes:(CFWS ";"* parameter CFWS)* {
         def parameters
           param_hashes.elements.map do |param|
             param.parameter.param_hash
@@ -65,4 +65,4 @@ module Mail
     end

   end
-end
\ No newline at end of file
+end
diff --git a/lib/mail/parsers/rfc2045.treetop b/lib/mail/parsers/rfc2045.treetop
index c166492..2839e73 100644
--- a/lib/mail/parsers/rfc2045.treetop
+++ b/lib/mail/parsers/rfc2045.treetop
@@ -8,8 +8,7 @@ module Mail
     end

     rule ietf_token
-      "7bit" / "8bit" / "binary" /
-      "quoted-printable" / "base64"
+      token+
     end

     rule custom_x_token
diff --git a/lib/mail/parsers/rfc2822.treetop b/lib/mail/parsers/rfc2822.treetop
index fc437f6..77dc3d6 100644
--- a/lib/mail/parsers/rfc2822.treetop
+++ b/lib/mail/parsers/rfc2822.treetop
@@ -184,7 +184,7 @@ module Mail
     end

     rule quoted_string
-      CFWS? DQUOTE quoted_content:(FWS? qcontent)+ FWS? DQUOTE CFWS?
+      CFWS? DQUOTE quoted_content:(FWS? qcontent)* FWS? DQUOTE CFWS?
     end

     rule qcontent
@@ -222,7 +222,22 @@ module Mail
     end

     rule mailbox
-      name_addr / addr_spec
+      (name_addr / addr_spec) {
+        def dig_comments(comments, elements)
+          elements.each { |elem|
+            if elem.respond_to?(:comment)
+              comments << elem.comment
+            end
+            dig_comments(comments, elem.elements) if elem.elements
+           }
+        end
+
+        def comments
+          comments = []
+          dig_comments(comments, elements)
+          comments
+        end
+      }
     end

     rule address
@@ -244,24 +259,7 @@ module Mail
         end

       } /
-      mailbox {
-
-      def dig_comments(comments, elements)
-        elements.each { |elem|
-          if elem.respond_to?(:comment)
-            comments << elem.comment
-          end
-          dig_comments(comments, elem.elements) if elem.elements
-         }
-      end
-
-      def comments
-        comments = []
-        dig_comments(comments, elements)
-        comments
-      end
-
-      }
+      mailbox
     end

     rule address_list
@@ -340,7 +338,7 @@ module Mail
     end

     rule name_val_list
-      (CFWS)? (name_val_pair (CFWS name_val_pair)*)
+      (CFWS)? (name_val_pair (CFWS name_val_pair)*)?
     end

     rule name_val_pair

bpot · 2013-01-29T19:09:42Z

Rebased!

ConradIrwin · 2013-02-02T04:35:43Z

This PR is awesome. I have an email here with 612 recipients and 1152 Ccs (someone really fails at email :). It takes the time for Mail.new(str).to_s from 20.5s to 1.0s

ConradIrwin · 2013-02-08T21:26:04Z

@bpot: I'm definitely "doing it wrong", but this seems like unexpected behaviour:

h = Mail::Header.new; h['From'] = "Conrad Irwin <me@cirw.in> "; h['From'].addresses
# => ["me@cirw.in", "me"]

Without the trailing space, or using the old treetop parser, I get the expected ["me@cirw.in"].

bpot · 2013-02-08T22:27:18Z

@ConradIrwin interesting, I'll look at that -- my goal is for the two parsers to be compatible as possible within reason.

bpot · 2013-02-12T07:25:09Z

Rebased against latest master and fixed the issue @ConradIrwin found.

ConradIrwin · 2013-02-12T07:38:33Z

Awesome, thanks! We've been running your code since Friday, and all seems to be going well so far :). (I guess we're parsing about a thousand emails a day through it at the moment, so not a huge amount, but definitely confidence inspiring)

guilleiguaran · 2013-02-28T00:58:45Z

@jeremy can you check this?

I would love to ship next Rails 4 beta with mail 2.6.0

jeremy · 2013-03-01T02:01:30Z

@guilleiguaran Working well for me. Next major release will be a while though. Next minor, maybe.

eric · 2013-03-22T22:36:16Z

lib/mail/parsers/ragel/common.rl

+  local_part_no_capture = (local_dot_atom | quoted_string | obs_local_part);
+
+  # domain:
+  dot_atom_text = ("."+)? domain_text (("."+)? domain_text)*;


Is ("."+)? any different than "."*?

Sharp eye! I first read that as a non-greedy match. @bpot?

Nice catch! They are the same. I even doubled checked by comparing ragel's xml statemachine output for both and it's exactly the same for both versions.

I'll update the PR to use the clearer "."* form.

jpmckinney · 2013-04-15T18:04:15Z

Sweet. What more work needs to be done to get this merged?

jeremy · 2013-04-15T19:09:20Z

@jpmckinney Slated for merge to master after next minor version release. Please do test it out in your own apps!

…en it ends with a space.

bpot · 2013-05-13T22:07:57Z

Updated the funky grammar that @eric pointed out.
Added a fix for email addresses that begin with a comment -- was triggering an exception.
Rebased on current master.

mikel · 2013-05-14T03:05:06Z

This is great work, I'll be doing some updates to get a minor out then we'll merge this into the next major release.

This was referenced Jan 26, 2013

mail gem speed in comparison to tmail when parsing mail #115

Closed

Mail.new is slower than molasses in winter #256

Closed

ConradIrwin mentioned this pull request Feb 2, 2013

Performance improvements for people parsing email headers #505

Merged

eric reviewed Mar 22, 2013
View reviewed changes

bpot added 5 commits May 13, 2013 16:53

Replace Treetop parser with a Ragel based parser

2da7c79

AddressListsParser: Don't add an extra address to the address list wh…

c6b656c

…en it ends with a space.

ReceivedParser: don't error out on quoted strings in a received header

1f49764

common.rl: use clearer .* in dot_atom_text, instead of (.+)?

ab2af8c

Correctly parse addresses that begin with a comment.

1505198

mikel merged commit 1505198 into mikel:master May 14, 2013

lencioni mentioned this pull request Apr 3, 2014

Next release timeframe? #695

Closed

jeremy mentioned this pull request Jun 4, 2019

group list parse failure #1336

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace Treetop parser with a Ragel based parser #490

Replace Treetop parser with a Ragel based parser #490

bpot commented Jan 9, 2013

staugaard commented Jan 9, 2013

mig-hub commented Jan 10, 2013

jeremy commented Jan 19, 2013

jeremy commented Jan 28, 2013

jeremy commented Jan 28, 2013

bpot commented Jan 29, 2013

ConradIrwin commented Feb 2, 2013

ConradIrwin commented Feb 8, 2013

bpot commented Feb 8, 2013

bpot commented Feb 12, 2013

ConradIrwin commented Feb 12, 2013

guilleiguaran commented Feb 28, 2013

jeremy commented Mar 1, 2013

eric Mar 22, 2013

jeremy Mar 23, 2013

bpot Mar 23, 2013

jpmckinney commented Apr 15, 2013

jeremy commented Apr 15, 2013

bpot commented May 13, 2013

mikel commented May 14, 2013

Replace Treetop parser with a Ragel based parser #490

Replace Treetop parser with a Ragel based parser #490

Conversation

bpot commented Jan 9, 2013

Benchmark

Parser layout

Further Work

staugaard commented Jan 9, 2013

mig-hub commented Jan 10, 2013

jeremy commented Jan 19, 2013

jeremy commented Jan 28, 2013

jeremy commented Jan 28, 2013

bpot commented Jan 29, 2013

ConradIrwin commented Feb 2, 2013

ConradIrwin commented Feb 8, 2013

bpot commented Feb 8, 2013

bpot commented Feb 12, 2013

ConradIrwin commented Feb 12, 2013

guilleiguaran commented Feb 28, 2013

jeremy commented Mar 1, 2013

eric Mar 22, 2013

Choose a reason for hiding this comment

jeremy Mar 23, 2013

Choose a reason for hiding this comment

bpot Mar 23, 2013

Choose a reason for hiding this comment

jpmckinney commented Apr 15, 2013

jeremy commented Apr 15, 2013

bpot commented May 13, 2013

mikel commented May 14, 2013