Skip to content

Commit

Permalink
Parse UTF-8 mail headers per RFC6532
Browse files Browse the repository at this point in the history
Implement RFC6532 extension to RFC5322 for parsing UTF-8 messages.

* Ragel parser for valid UTF-8 characters
* Parse as bytes rather than chars
* Encode parsed strings as UTF-8
* No longer b/q-encode UTF-8 header values when parsing emails
* For compatibility with others, b/q-encode UTF-8 headers when generating emails

Fixes #39
  • Loading branch information
jeremy committed May 14, 2017
1 parent cff964e commit da0d4b9
Show file tree
Hide file tree
Showing 45 changed files with 47,306 additions and 20,905 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.rdoc
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ Compatibility:
* #464 - Improve attachment filename detection by preferring Content-Disposition filename. (lawrencepit)
* #655 - Sort attachments to the end of the parts list to work around email clients that may mistake a text attachment for the message body. (npickens)
* #982 – Faithfully preserve unfolded whitespace rather than collapsing to a single space. (jeremy)
* Support parsing UTF-8 headers. Implements RFC 6532. (jeremy)

Bugs:
* #539 - Fix that whitespace-only continued headers would be incorrectly parsed as the break between headers and body. (ConradIrwin)
Expand Down
21 changes: 10 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,21 +55,20 @@ the [Google Group](http://groups.google.com/group/mail-ruby).
Current Capabilities of Mail
----------------------------

* RFC2822 Support, Reading and Writing
* RFC5322 Support, Reading and Writing
* RFC6532 Support, reading UTF-8 headers
* RFC2045-2049 Support for multipart emails
* Support for creating multipart alternate emails
* Support for reading multipart/report emails & getting details from such
* Support for multibyte emails - needs quite a lot of work and testing
* Wrappers for File, Net/POP3, Net/SMTP
* Auto encoding of non US-ASCII header fields
* Auto encoding of non US-ASCII bodies

Mail is RFC2822 compliant now, that is, it can parse and generate valid US-ASCII
emails. There are a few obsoleted syntax emails that it will have problems with, but
it also is quite robust, meaning, if it finds something it doesn't understand it will
not crash, instead, it will skip the problem and keep parsing. In the case of a header
it doesn't understand, it will initialise the header as an optional unstructured
field and continue parsing.
* Auto-encoding of non-US-ASCII bodies and header fields

Mail is RFC5322 and RFC6532 compliant now, that is, it can parse US-ASCII and UTF-8
emails and generate US-ASCII emails. There are a few obsoleted syntax emails that
it will have problems with, but it also is quite robust, meaning, if it finds something
it doesn't understand it will not crash, instead, it will skip the problem and keep
parsing. In the case of a header it doesn't understand, it will initialise the header
as an optional unstructured field and continue parsing.

This means Mail won't (ever) crunch your data (I think).

Expand Down
16 changes: 12 additions & 4 deletions lib/mail/encodings.rb
Original file line number Diff line number Diff line change
Expand Up @@ -168,20 +168,26 @@ def Encodings.unquote_and_convert_to(str, to_encoding)

def Encodings.address_encode(address, charset = 'utf-8')
if address.is_a?(Array)
# loop back through for each element
address.compact.map { |a| Encodings.address_encode(a, charset) }.join(", ")
else
# find any word boundary that is not ascii and encode it
encode_non_usascii(address, charset) if address
elsif address
encode_non_usascii(address, charset)
end
end

def Encodings.encode_non_usascii(address, charset)
return address if address.ascii_only? or charset.nil?

# With KCODE=u we can't use regexps on other encodings. Go ASCII.
if $KCODE
$KCODE, original_kcode = '', $KCODE
end

# Encode all strings embedded inside of quotes
address = address.gsub(/("[^"]*")/) { |s| Encodings.b_value_encode(unquote(s), charset) }

# Then loop through all remaining items and encode as needed
tokens = address.split(/\s/)

map_with_index(tokens) do |word, i|
if word.ascii_only?
word
Expand All @@ -193,6 +199,8 @@ def Encodings.encode_non_usascii(address, charset)
Encodings.b_value_encode(word, charset)
end
end.join(' ')
ensure
$KCODE = original_kcode if original_kcode
end

# Encode a string with Base64 Encoding and returns it ready to be inserted
Expand Down
28 changes: 25 additions & 3 deletions lib/mail/field.rb
Original file line number Diff line number Diff line change
Expand Up @@ -83,9 +83,31 @@ class ParseError < FieldError #:nodoc:

def initialize(element, value, reason)
@element = element
@value = value
@reason = reason
super("#{element} can not parse |#{value}|\nReason was: #{reason}")
@value = to_utf8(value)
@reason = to_utf8(reason)
super("#{@element} can not parse |#{@value}|: #{@reason}")
end

private
def to_utf8(text)
if text.respond_to?(:force_encoding)
text.dup.force_encoding(Encoding::UTF_8)
else
text
end
end
end

class NilParseError < ParseError #:nodoc:
def initialize(element)
super element, nil, 'nil is invalid'
end
end

class IncompleteParseError < ParseError #:nodoc:
def initialize(element, original_text, unparsed_index)
parsed_text = to_utf8(original_text[0...unparsed_index])
super element, original_text, "Only able to parse up to #{parsed_text.inspect}"
end
end

Expand Down
22 changes: 17 additions & 5 deletions lib/mail/fields/common/common_address.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,35 @@

module Mail
module CommonAddress # :nodoc:

def parse(val = value)
unless Utilities.blank?(val)
@address_list = AddressList.new(encode_if_needed(val))
else
nil
end
end

def charset
@charset
end

def encode_if_needed(val)
Encodings.address_encode(val, charset)
# Need to join arrays of addresses into a single value
if val.kind_of?(Array)
val.compact.map { |a| encode_if_needed a }.join(', ')

# Pass through UTF-8 addresses
elsif charset =~ /\AUTF-8\z/i
val
elsif val.respond_to?(:encoding) && val.encoding == Encoding::UTF_8
val

# Encode non-UTF-8 strings
else
Encodings.encode_non_usascii(val, charset)
end
end

# Allows you to iterate through each address object in the address_list
def each
address_list.addresses.each do |address|
Expand Down
Loading

0 comments on commit da0d4b9

Please sign in to comment.