Iso 2022 jp - implement charset codec #595

Open
wants to merge 4 commits into from

3 participants

@yujiyokoo

With help from @kuroda and @mikel, I have implemented the CharsetCodec which @jeremy suggested in the original PR (#534).

I have simply moved things around - it should work exactly the same as the original iso-2022-jp branch AFAIK.

kuroda and others added some commits Apr 12, 2013
@kuroda kuroda Extract more methods into Mail::CommonAddress module 166a0a3
@kuroda kuroda Apply patches from kuroda/iso-2022-jp
These patches introduce some special treatments for ISO-2022-JP
encoding (so-called JIS-code), which is the de facto standard
when we send an email written in Japanese language.

These patches have effects only when a Mail object is
instanciated with :charset => 'ISO-2022-JP' option.
With this option, the header values and body text are
automatically converted from UTF-8 to ISO-2022-JP.

Note that some characters are replaced with other
characters before this conversion takes place
in order to solve problems on some platforms.

See https://github.com/kuroda/mail-iso-2022-jp for details.
09b55e5
@kuroda kuroda Treat ascii-only subject correctly 95d2fa4
@yujiyokoo yujiyokoo Implement "charset codec" for iso-2022-jp 82881c1
@jeremy
Collaborator

Nice work. Thank you! It's good to see big blocks of red in the diff.

@jeremy jeremy commented on the diff Aug 8, 2013
lib/mail/charset_codec/base.rb
@@ -0,0 +1,51 @@
+require 'mail/charset_codec'
+module Mail
+ module CharsetCodec
+ class Base
+
+ def remap_characters(value)
+ value
+ end
+
+ def set_charset_on(message)
+ end
@jeremy
Collaborator
jeremy added a note Aug 8, 2013

Should be a ! method?

@jeremy
Collaborator
jeremy added a note Aug 8, 2013

On further review, this doesn't look like the codec's responsibility.

The codec is converting "messy" external encoding to a clean internal encoding.

So the codec should expose its internal charset, then have calling code do something like message.charset = codec.charset instead of codec.set_charset_on message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jeremy jeremy commented on the diff Aug 8, 2013
lib/mail/charset_codec/base.rb
+ value
+ end
+
+ def set_charset_on(message)
+ end
+
+ def preprocess(value)
+ value
+ end
+
+ def preprocess_body_raw(value)
+ value
+ end
+
+ def force_regexp_compatibility_on(value)
+ end
@jeremy
Collaborator
jeremy added a note Aug 8, 2013

Also consider, should it be a ! method? It does match #force_encoding though. The name is clear that it will mutate the argument, so it's not surprising...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jeremy jeremy commented on the diff Aug 8, 2013
lib/mail/charset_codec/base.rb
+ def force_regexp_compatibility_on(value)
+ end
+
+ def encode(value)
+ value
+ end
+
+ def encode_address(value)
+ value
+ end
+
+ def encode_crlf(value)
+ value.gsub!("\r", '=0D')
+ value.gsub!("\n", '=0A')
+ value
+ end
@jeremy
Collaborator
jeremy added a note Aug 8, 2013

Curious whether a single gsub would be worth the performance boost. Also, this mutates the argument, but the method name doesn't make it clear, so it's also a candidate for ! (or perhaps it should use #gsub instead)

@jeremy
Collaborator
jeremy added a note Aug 8, 2013

(Ah, much like remap_characters is implemented below.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jeremy jeremy commented on the diff Aug 8, 2013
lib/mail/charset_codec/iso_2022_jp_codec.rb
+ HORIZONTAL_BAR = [0x2015].pack("U")
+ DOUBLE_VERTICAL_LINE = [0x2016].pack("U")
+ PARALLEL_TO = [0x2225].pack("U")
+
+ REMAP_MATCH = Regexp.union([WAVE_DASH, MINUS_SIGN, EM_DASH, DOUBLE_VERTICAL_LINE])
+ REMAP = {
+ WAVE_DASH => FULLWIDTH_TILDE,
+ MINUS_SIGN => FULLWIDTH_HYPHEN_MINUS,
+ EM_DASH => HORIZONTAL_BAR,
+ DOUBLE_VERTICAL_LINE => PARALLEL_TO
+ }
+
+ def remap_characters(value)
+ # Preprocessor.process goes here
+ value.gsub(REMAP_MATCH) { |c| REMAP[c] }
+ end
@jeremy
Collaborator
jeremy added a note Aug 8, 2013

Can optimize this on Ruby 1.9: value.gsub REMAP_MATCH, REMAP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jeremy jeremy commented on the diff Aug 8, 2013
lib/mail/charset_codec/iso_2022_jp_codec.rb
+ }
+
+ def remap_characters(value)
+ # Preprocessor.process goes here
+ value.gsub(REMAP_MATCH) { |c| REMAP[c] }
+ end
+
+ def set_charset_on(message)
+ if message && message.charset.nil?
+ message.charset = 'iso-2022-jp'
+ end
+ end
+
+ def preprocess(value)
+ RubyVer.preprocess('iso-2022-jp', value)
+ end
@jeremy
Collaborator
jeremy added a note Aug 8, 2013

The preprocess language still feels too generic. When someone looks at this library, which does many kinds of processing, and tries to understand this behavior, she'll need to examine every place it's used to see what it precedes.

Is there a better term for this step? It looks like its purpose is to transcode to iso-2022-jp from Shift JIS and others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jeremy jeremy commented on the diff Aug 8, 2013
lib/mail/field.rb
@@ -108,6 +108,7 @@ class SyntaxError < FieldError #:nodoc:
#
# Field.new('content-type', ['text', 'plain', {:charset => 'UTF-8'}])
def initialize(name, value = nil, charset = 'utf-8')
+ value = RubyVer.preprocess(charset, value)
@jeremy
Collaborator
jeremy added a note Aug 8, 2013

Should this use the charset codec to convert the value?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jeremy jeremy commented on the diff Aug 8, 2013
lib/mail/fields/common/common_address.rb
module Mail
module CommonAddress # :nodoc:
-
+
+ def initialize(value = nil, charset = 'utf-8')
+ value = CharsetCodec.find(charset).encode_address(value)
+ self.charset = charset
+ super(self.class.const_get('CAPITALIZED_FIELD'), strip_field(self.class.const_get('FIELD_NAME'), value), charset)
@jeremy
Collaborator
jeremy added a note Aug 8, 2013

fwiw you can do self.class::CAPITALIZED_FIELD and self.class::FIELD_NAME rather than const_get

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jeremy jeremy commented on the diff Aug 8, 2013
lib/mail/fields/subject_field.rb
@@ -8,6 +8,7 @@ class SubjectField < UnstructuredField
CAPITALIZED_FIELD = "Subject"
def initialize(value = nil, charset = 'utf-8')
+ value = CharsetCodec.find(charset).encode(value)
@jeremy
Collaborator
jeremy added a note Aug 8, 2013

I'm having a hard time following the responsibilities of the charset conversion. In some cases, we preprocess and in others we just encode.

Is this a matter of needing clearer method names?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jeremy jeremy commented on the diff Aug 8, 2013
lib/mail/message.rb
@@ -1600,6 +1601,8 @@ def html_part=(msg)
# html_part are both defined in a message, then it will be a multipart/alternative
# message and set itself that way.
def text_part=(msg)
+ CharsetCodec.find(@charset).set_charset_on(msg)
@jeremy
Collaborator
jeremy added a note Aug 8, 2013

This is a new feature. Setting the text part does not change the message charset, currently. It will need new test coverage to match!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jeremy jeremy commented on the diff Aug 8, 2013
lib/mail/message.rb
@@ -1134,6 +1134,7 @@ def default( sym, val = nil )
# mail.parts.length #=> 2
# mail.parts.last.content_type.content_type #=> 'This is a body'
def body=(value)
+ value = CharsetCodec.find(@charset).preprocess(value)
@jeremy
Collaborator
jeremy added a note Aug 8, 2013

Assumes that value is always a String, but it may be a Mail::Body

@jeremy
Collaborator
jeremy added a note Aug 8, 2013

Something to consider: should we keep the original, unconverted value?

If you parse an email and wish to serialize it exactly as it was received, it's impossible now since the original value is lost.

@mikel
Owner
mikel added a note Nov 13, 2013

I agree, we should be able to serialise the original value of the message. Worth keeping in the object. Once we send the new message though this would be discarded obviously, it wouldn't go as a new mime part or something

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jeremy jeremy commented on the diff Aug 8, 2013
lib/mail/message.rb
@@ -1918,6 +1921,7 @@ def body_lazy(value)
def process_body_raw
+ @body_raw = CharsetCodec.find(@charset).preprocess_body_raw(@body_raw)
@jeremy
Collaborator
jeremy added a note Aug 8, 2013

Same consideration. We discard the original body.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jeremy jeremy commented on the diff Aug 8, 2013
lib/mail/preprocessor.rb
@@ -0,0 +1,20 @@
+module Mail
+ class Preprocessor
@jeremy
Collaborator
jeremy added a note Aug 8, 2013

Same concern with generic naming as #preprocess on the charset codec. There are many kinds of processing in the mail library. What does this come before? How would I know?

Is there a better name that indicates its precise purpose?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jeremy jeremy commented on the diff Aug 8, 2013
lib/mail/version_specific/ruby_1_8.rb
@@ -101,6 +104,28 @@ def Ruby18.param_encode(str)
"#{encoding}'#{language}'#{URI.escape(str)}"
end
+ def Ruby18.preprocess(charset, value)
+ value
+ end
+
+ def Ruby18.preprocess_body_raw(charset, body_raw)
+ if charset.to_s.downcase == 'iso-2022-jp'
+ NKF.nkf(NKF_OPTIONS, Mail::Preprocessor.process(body_raw))
+ else
+ body_raw
+ end
+ end
@jeremy
Collaborator
jeremy added a note Aug 8, 2013

I don't think these methods need a charset argument. They're called directly from Iso2022JpCodec so they can use a name that reflects their exact purpose, like transcode_iso_2022_jp_to_cp50221(string)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jeremy
Collaborator

I made some comments on method naming and separation of responsibilities. Thanks again! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment