New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace Treetop parser with a Ragel based parser #490
Conversation
This is awesome. @mikel what do you say? |
Wow the benchmark is promising! |
Great patch. Tested in two apps and runs fine! Pushing the parsing machinery out to arms' length cleans up the field classes nicely. |
Nice speedup in the test suite as well! master:
ragel:
|
@bpot could you rebase on latest master? Here's a rundown of the .treetop changes: diff --git a/lib/mail/parsers/content_transfer_encoding.treetop b/lib/mail/parsers/content_transfer_encoding.treetop
index 9d0f50a..9db6134 100644
--- a/lib/mail/parsers/content_transfer_encoding.treetop
+++ b/lib/mail/parsers/content_transfer_encoding.treetop
@@ -9,12 +9,10 @@ module Mail
end
rule encoding
- ietf_token "s"? {
- def text_value
- ietf_token.text_value
- end
- } / custom_x_token
+ "7bits" / "8bits" /
+ "7bit" / "8bit" / "binary" / "quoted-printable" / "base64" /
+ ietf_token / custom_x_token
end
end
-end
\ No newline at end of file
+end
diff --git a/lib/mail/parsers/content_type.treetop b/lib/mail/parsers/content_type.treetop
index 86fe64b..84eeced 100644
--- a/lib/mail/parsers/content_type.treetop
+++ b/lib/mail/parsers/content_type.treetop
@@ -5,7 +5,7 @@ module Mail
include RFC2045
rule content_type
- main_type "/" sub_type param_hashes:(CFWS ";"? parameter CFWS)* {
+ main_type "/" sub_type param_hashes:(CFWS ";"* parameter CFWS)* {
def parameters
param_hashes.elements.map do |param|
param.parameter.param_hash
@@ -65,4 +65,4 @@ module Mail
end
end
-end
\ No newline at end of file
+end
diff --git a/lib/mail/parsers/rfc2045.treetop b/lib/mail/parsers/rfc2045.treetop
index c166492..2839e73 100644
--- a/lib/mail/parsers/rfc2045.treetop
+++ b/lib/mail/parsers/rfc2045.treetop
@@ -8,8 +8,7 @@ module Mail
end
rule ietf_token
- "7bit" / "8bit" / "binary" /
- "quoted-printable" / "base64"
+ token+
end
rule custom_x_token
diff --git a/lib/mail/parsers/rfc2822.treetop b/lib/mail/parsers/rfc2822.treetop
index fc437f6..77dc3d6 100644
--- a/lib/mail/parsers/rfc2822.treetop
+++ b/lib/mail/parsers/rfc2822.treetop
@@ -184,7 +184,7 @@ module Mail
end
rule quoted_string
- CFWS? DQUOTE quoted_content:(FWS? qcontent)+ FWS? DQUOTE CFWS?
+ CFWS? DQUOTE quoted_content:(FWS? qcontent)* FWS? DQUOTE CFWS?
end
rule qcontent
@@ -222,7 +222,22 @@ module Mail
end
rule mailbox
- name_addr / addr_spec
+ (name_addr / addr_spec) {
+ def dig_comments(comments, elements)
+ elements.each { |elem|
+ if elem.respond_to?(:comment)
+ comments << elem.comment
+ end
+ dig_comments(comments, elem.elements) if elem.elements
+ }
+ end
+
+ def comments
+ comments = []
+ dig_comments(comments, elements)
+ comments
+ end
+ }
end
rule address
@@ -244,24 +259,7 @@ module Mail
end
} /
- mailbox {
-
- def dig_comments(comments, elements)
- elements.each { |elem|
- if elem.respond_to?(:comment)
- comments << elem.comment
- end
- dig_comments(comments, elem.elements) if elem.elements
- }
- end
-
- def comments
- comments = []
- dig_comments(comments, elements)
- comments
- end
-
- }
+ mailbox
end
rule address_list
@@ -340,7 +338,7 @@ module Mail
end
rule name_val_list
- (CFWS)? (name_val_pair (CFWS name_val_pair)*)
+ (CFWS)? (name_val_pair (CFWS name_val_pair)*)?
end
rule name_val_pair |
Rebased! |
This PR is awesome. I have an email here with 612 recipients and 1152 Ccs (someone really fails at email :). It takes the time for |
@bpot: I'm definitely "doing it wrong", but this seems like unexpected behaviour: h = Mail::Header.new; h['From'] = "Conrad Irwin <me@cirw.in> "; h['From'].addresses
# => ["me@cirw.in", "me"] Without the trailing space, or using the old treetop parser, I get the expected |
@ConradIrwin interesting, I'll look at that -- my goal is for the two parsers to be compatible as possible within reason. |
Rebased against latest master and fixed the issue @ConradIrwin found. |
Awesome, thanks! We've been running your code since Friday, and all seems to be going well so far :). (I guess we're parsing about a thousand emails a day through it at the moment, so not a huge amount, but definitely confidence inspiring) |
@jeremy can you check this? I would love to ship next Rails 4 beta with mail 2.6.0 |
@guilleiguaran Working well for me. Next major release will be a while though. Next minor, maybe. |
local_part_no_capture = (local_dot_atom | quoted_string | obs_local_part); | ||
|
||
# domain: | ||
dot_atom_text = ("."+)? domain_text (("."+)? domain_text)*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ("."+)?
any different than "."*
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sharp eye! I first read that as a non-greedy match. @bpot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! They are the same. I even doubled checked by comparing ragel's xml statemachine output for both and it's exactly the same for both versions.
I'll update the PR to use the clearer "."*
form.
Sweet. What more work needs to be done to get this merged? |
@jpmckinney Slated for merge to master after next minor version release. Please do test it out in your own apps! |
…en it ends with a space.
|
This is great work, I'll be doing some updates to get a minor out then we'll merge this into the next major release. |
This pull request replaces the Treetop based parser with a Ragel-based parser. This change is primarily to improve the performance of message processing. Compared to the Treetop parser the Ragel version is ~7.5x faster.
The new parser exhibits the same behavior as the current parser except for a couple case where the Treetop parser was incorrectly handling fields. I've submitted PRs to fix both issues: PR #487 PR #481. A couple specs related to these issues are marked pending in the current PR. Assuming these other two PRs are merged I will rebase and remove the pending lines from those specs.
The change in parsers necessitates removing a public interface. It is already deprecated but still may require a major version bump.
I know this is a massive change, let me know if there is anything I can do to make it more digestible.
Benchmark
Parsing a set of 1000 emails from the enron data set:
Parser layout
Further Work
I am also working on a native parser (based on the same ragel grammar) that will improve performance even further. It uses an FFI interface to a custom shared module so that gains can be shared with Rubinius and JRuby. This is one of the advantages of using a Ragel-based parser and why there is a strict separation between the state machine modules and the classes that interpret the actions.