Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Treetop parser with a Ragel based parser #490

Merged
merged 5 commits into from May 14, 2013

Conversation

bpot
Copy link
Contributor

@bpot bpot commented Jan 9, 2013

This pull request replaces the Treetop based parser with a Ragel-based parser. This change is primarily to improve the performance of message processing. Compared to the Treetop parser the Ragel version is ~7.5x faster.

The new parser exhibits the same behavior as the current parser except for a couple case where the Treetop parser was incorrectly handling fields. I've submitted PRs to fix both issues: PR #487 PR #481. A couple specs related to these issues are marked pending in the current PR. Assuming these other two PRs are merged I will rebase and remove the pending lines from those specs.

The change in parsers necessitates removing a public interface. It is already deprecated but still may require a major version bump.

I know this is a massive change, let me know if there is anything I can do to make it more digestible.

Benchmark

Parsing a set of 1000 emails from the enron data set:

Mail-2.5.3: 24.78s (40.35 emails/second)
Mail-2.5.3 w/Ragel Parser: 3.245290s (308.6 emails/second)

Parser layout

lib/mail/parsers/
  address_lists_parser.rb        # Build data Structs for the Elements (lib/mail/elements)
  content_disposition_parser.rb  # by interpreting actions emitted by the state machine modules.
  ...
  ragel/
    common.rl # Main grammar definition
    ruby/
      machines/
        address_lists_machine.rb        # Ragel state machines which emit events that
        content_disposition_machine.rb  # are consumed by the higher level parsers
        ...

Further Work

I am also working on a native parser (based on the same ragel grammar) that will improve performance even further. It uses an FFI interface to a custom shared module so that gains can be shared with Rubinius and JRuby. This is one of the advantages of using a Ragel-based parser and why there is a strict separation between the state machine modules and the classes that interpret the actions.

@staugaard
Copy link

This is awesome. @mikel what do you say?

@mig-hub
Copy link

mig-hub commented Jan 10, 2013

Wow the benchmark is promising!
And keeping the same Ragel grammar for a native parser is a good move.

@jeremy
Copy link
Collaborator

jeremy commented Jan 19, 2013

Great patch. Tested in two apps and runs fine!

Pushing the parsing machinery out to arms' length cleans up the field classes nicely.

@jeremy
Copy link
Collaborator

jeremy commented Jan 28, 2013

Nice speedup in the test suite as well!

master:

Finished in 19.91 seconds
1455 examples, 0 failures, 9 pending

ragel:

Finished in 6.55 seconds
1407 examples, 0 failures, 10 pending

@jeremy
Copy link
Collaborator

jeremy commented Jan 28, 2013

@bpot could you rebase on latest master? Here's a rundown of the .treetop changes:

diff --git a/lib/mail/parsers/content_transfer_encoding.treetop b/lib/mail/parsers/content_transfer_encoding.treetop
index 9d0f50a..9db6134 100644
--- a/lib/mail/parsers/content_transfer_encoding.treetop
+++ b/lib/mail/parsers/content_transfer_encoding.treetop
@@ -9,12 +9,10 @@ module Mail
     end

     rule encoding
-      ietf_token "s"? {
-        def text_value
-          ietf_token.text_value
-        end
-      } / custom_x_token
+      "7bits" / "8bits" /
+      "7bit" / "8bit" / "binary" / "quoted-printable" / "base64" /
+      ietf_token / custom_x_token
     end

   end
-end
\ No newline at end of file
+end
diff --git a/lib/mail/parsers/content_type.treetop b/lib/mail/parsers/content_type.treetop
index 86fe64b..84eeced 100644
--- a/lib/mail/parsers/content_type.treetop
+++ b/lib/mail/parsers/content_type.treetop
@@ -5,7 +5,7 @@ module Mail
     include RFC2045

     rule content_type
-      main_type "/" sub_type param_hashes:(CFWS ";"? parameter CFWS)* {
+      main_type "/" sub_type param_hashes:(CFWS ";"* parameter CFWS)* {
         def parameters
           param_hashes.elements.map do |param|
             param.parameter.param_hash
@@ -65,4 +65,4 @@ module Mail
     end

   end
-end
\ No newline at end of file
+end
diff --git a/lib/mail/parsers/rfc2045.treetop b/lib/mail/parsers/rfc2045.treetop
index c166492..2839e73 100644
--- a/lib/mail/parsers/rfc2045.treetop
+++ b/lib/mail/parsers/rfc2045.treetop
@@ -8,8 +8,7 @@ module Mail
     end

     rule ietf_token
-      "7bit" / "8bit" / "binary" /
-      "quoted-printable" / "base64"
+      token+
     end

     rule custom_x_token
diff --git a/lib/mail/parsers/rfc2822.treetop b/lib/mail/parsers/rfc2822.treetop
index fc437f6..77dc3d6 100644
--- a/lib/mail/parsers/rfc2822.treetop
+++ b/lib/mail/parsers/rfc2822.treetop
@@ -184,7 +184,7 @@ module Mail
     end

     rule quoted_string
-      CFWS? DQUOTE quoted_content:(FWS? qcontent)+ FWS? DQUOTE CFWS?
+      CFWS? DQUOTE quoted_content:(FWS? qcontent)* FWS? DQUOTE CFWS?
     end

     rule qcontent
@@ -222,7 +222,22 @@ module Mail
     end

     rule mailbox
-      name_addr / addr_spec
+      (name_addr / addr_spec) {
+        def dig_comments(comments, elements)
+          elements.each { |elem|
+            if elem.respond_to?(:comment)
+              comments << elem.comment
+            end
+            dig_comments(comments, elem.elements) if elem.elements
+           }
+        end
+
+        def comments
+          comments = []
+          dig_comments(comments, elements)
+          comments
+        end
+      }
     end

     rule address
@@ -244,24 +259,7 @@ module Mail
         end

       } /
-      mailbox {
-
-      def dig_comments(comments, elements)
-        elements.each { |elem|
-          if elem.respond_to?(:comment)
-            comments << elem.comment
-          end
-          dig_comments(comments, elem.elements) if elem.elements
-         }
-      end
-
-      def comments
-        comments = []
-        dig_comments(comments, elements)
-        comments
-      end
-
-      }
+      mailbox
     end

     rule address_list
@@ -340,7 +338,7 @@ module Mail
     end

     rule name_val_list
-      (CFWS)? (name_val_pair (CFWS name_val_pair)*)
+      (CFWS)? (name_val_pair (CFWS name_val_pair)*)?
     end

     rule name_val_pair

@bpot
Copy link
Contributor Author

bpot commented Jan 29, 2013

Rebased!

@ConradIrwin
Copy link
Collaborator

This PR is awesome. I have an email here with 612 recipients and 1152 Ccs (someone really fails at email :). It takes the time for Mail.new(str).to_s from 20.5s to 1.0s

@ConradIrwin
Copy link
Collaborator

@bpot: I'm definitely "doing it wrong", but this seems like unexpected behaviour:

h = Mail::Header.new; h['From'] = "Conrad Irwin <me@cirw.in> "; h['From'].addresses
# => ["me@cirw.in", "me"]

Without the trailing space, or using the old treetop parser, I get the expected ["me@cirw.in"].

@bpot
Copy link
Contributor Author

bpot commented Feb 8, 2013

@ConradIrwin interesting, I'll look at that -- my goal is for the two parsers to be compatible as possible within reason.

@bpot
Copy link
Contributor Author

bpot commented Feb 12, 2013

Rebased against latest master and fixed the issue @ConradIrwin found.

@ConradIrwin
Copy link
Collaborator

Awesome, thanks! We've been running your code since Friday, and all seems to be going well so far :). (I guess we're parsing about a thousand emails a day through it at the moment, so not a huge amount, but definitely confidence inspiring)

@guilleiguaran
Copy link

@jeremy can you check this?

I would love to ship next Rails 4 beta with mail 2.6.0

@jeremy
Copy link
Collaborator

jeremy commented Mar 1, 2013

@guilleiguaran Working well for me. Next major release will be a while though. Next minor, maybe.

local_part_no_capture = (local_dot_atom | quoted_string | obs_local_part);

# domain:
dot_atom_text = ("."+)? domain_text (("."+)? domain_text)*;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ("."+)? any different than "."*?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sharp eye! I first read that as a non-greedy match. @bpot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! They are the same. I even doubled checked by comparing ragel's xml statemachine output for both and it's exactly the same for both versions.

I'll update the PR to use the clearer "."* form.

@jpmckinney
Copy link
Contributor

Sweet. What more work needs to be done to get this merged?

@jeremy
Copy link
Collaborator

jeremy commented Apr 15, 2013

@jpmckinney Slated for merge to master after next minor version release. Please do test it out in your own apps!

@bpot
Copy link
Contributor Author

bpot commented May 13, 2013

  • Updated the funky grammar that @eric pointed out.
  • Added a fix for email addresses that begin with a comment -- was triggering an exception.
  • Rebased on current master.

@mikel
Copy link
Owner

mikel commented May 14, 2013

This is great work, I'll be doing some updates to get a minor out then we'll merge this into the next major release.

@mikel mikel merged commit 1505198 into mikel:master May 14, 2013
@lencioni lencioni mentioned this pull request Apr 3, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants