decoded_body returns string with the wrong encoding #431

pupeno · 2012-09-03T11:37:04Z

Starting with an email that is encoded with windows-1252 as sent by Apple Mail:

Date: Mon, 03 Sep 2012 11:53:03 +0100
From: =?iso-8859-1?Q?=22J=2E_Pablo_Fern=E1ndez=22?= <pupeno@example.com>
To: John Doe <info+c25oubi092kehec6y7aj8nlf1bolctv4@example.org>
Message-ID: <5A13A2D2-A135-46CB-A81D-34325FAE6D3A@example.com>
In-Reply-To: <5044816653066_b0d48803fc7817d@machinex.mail>
References: <5044816653066_b0d48803fc7817d@machinex.mail>
Subject: Re: Fancy an Apple?
Mime-Version: 1.0
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_3C4CAFE2-2D3A-4DD5-8122-E0F3B2B2392C";
 charset=UTF-8
Content-Transfer-Encoding: 7bit

--Apple-Mail=_3C4CAFE2-2D3A-4DD5-8122-E0F3B2B2392C
Date: Mon, 03 Sep 2012 10:53:09 +0000
Mime-Version: 1.0
Content-Type: text/plain;
 charset=windows-1252
Content-Transfer-Encoding: quoted-printable
Content-ID: <50448c1523a5f_31e34d563f445258@machinex.mail>

This is an email=85=0D
=0D

--Apple-Mail=_3C4CAFE2-2D3A-4DD5-8122-E0F3B2B2392C
Date: Mon, 03 Sep 2012 10:53:09 +0000
Mime-Version: 1.0
Content-Type: text/html;
 charset=windows-1252
Content-Transfer-Encoding: quoted-printable
Content-ID: <50448c152476a_31e34d563f4453f1@machinex.mail>

<html><head></head><body style=3D"word-wrap: break-word; -webkit-nbsp-mod=
e: space; -webkit-line-break: after-white-space; ">This is an email=85<div=
><br></body></html>=

--Apple-Mail=_3C4CAFE2-2D3A-4DD5-8122-E0F3B2B2392C--

I create a mail object with it (I put that in a file and open it):

m = Mail.new(open("mail.txt").read)

If I get the text part:

> m.text_part
 => #<Mail::Part:70281222717060, Multipart: false, Headers: <Date: Mon, 03 Sep 2012 10:53:09 +0000>, <Mime-Version: 1.0>, <Content-Type: text/plain; charset=windows-1252>, <Content-Transfer-Encoding: quoted-printable>, <Content-ID: <50448c1523a5f_31e34d563f445258@machinex.mail>>>

we can see that the content type is text/plain and the charset is windows-1252. If I call decode_body on it:

> m.text_part.decode_body
 => "This is an email\x85\r\r\n\r"

it returns the body with \x85 in it, which is windows-1252 code for the ellipsis "…". If I now try to convert it to UTF-8:

m.text_part.decode_body.encode("utf-8")

it fails with an exception:

Encoding::UndefinedConversionError: "\x85" from ASCII-8BIT to UTF-8
    from (irb):45:in `encode'
    from (irb):45
    from /Users/pupeno/.rvm/gems/ruby-1.9.3-p194@watu/gems/railties-3.2.7/lib/rails/commands/console.rb:47:in `start'
    from /Users/pupeno/.rvm/gems/ruby-1.9.3-p194@watu/gems/railties-3.2.7/lib/rails/commands/console.rb:8:in `start'
    from /Users/pupeno/.rvm/gems/ruby-1.9.3-p194@watu/gems/railties-3.2.7/lib/rails/commands.rb:41:in `<top (required)>'
    from script/rails:6:in `require'
    from script/rails:6:in `<main>'

the reason being that the initial string is marked as ASCII-8BIT instead of windows-1252 and \x85 is not a valid ASCII-8BIT char. We can see what's wrong this way:

> m.text_part.decode_body.encoding
 => #<Encoding:ASCII-8BIT>

and we can workaround it this way:

> m.text_part.decode_body.encode("utf-8", "windows-1252")
 => "This is an email…\r\r\n\r"

The text was updated successfully, but these errors were encountered:

lawrencepit · 2012-11-17T08:35:07Z

I found we always need to do this:

part = mail.multipart? ? mail.text_part : mail
/charset=(?<charset>[^\s]+)\s?/ =~ part.content_type
part.body.decoded.force_encoding(charset ? charset.gsub('"', '') : "iso-8859-1").encode("UTF-8")

or same thing:

part.body.decoded.encode('UTF-8', charset ? charset.gsub('"', '') : "iso-8859-1", :invalid => :replace, :undef => :replace, :replace => '?')

PS. If no charset can be determined from the email we default to "iso-8859-1", for some reason this seems to work for us, but probably doesn't work in all environments; instead you might need to use something like charlock to make a best guess of the charset used.

lawrencepit · 2012-11-17T09:12:54Z

Or what also seems to work:

part.body.decoded.encode('UTF-8', part.content_type_parameters['charset'] || "ISO-8859-1", :invalid => :replace, :undef => :replace, :replace => '?')

dwt · 2012-11-20T12:05:03Z

+1 I also have this problem

rbu · 2012-11-20T15:15:05Z

Using ruby 1.9.3p327 and Mail 2.5.2 I see the following:

(rdb:1) p @email
#<Mail::Message:7464800, Multipart: false, Headers: <Date: Tue, 20 Nov 2012 11:53:41 +0000>, <From: info@example.com>, <To: john843483@example.com>, <Message-Id: : 20121120125341.23256.1563256327@localhost>, <Subject: something>, <Mime-Version: 1.0>, <Content-Type: text/plain; charset="utf-8">, <Content-Transfer-Encoding: base64>, <X-Actually-From: info@example.com>, <X-Actually-To: john843483@example.com>>
(rdb:1) p @email.charset
"utf-8"
(rdb:1) p @email.body.charset
"US-ASCII"
(rdb:1) p @email.body.encoding
"base64"
(rdb:1) p @email.body.decoded.encoding
#<Encoding:ASCII-8BIT>
(rdb:1) p @email.body.decoded[0,19]
"Hello John Sm\xC3\xBCd,\n\n"
(rdb:1) p @email.body.decoded[0,19].force_encoding("utf-8")
"Hello John Smüd,\n\n"

To my understanding, the content-type of the email given in the header should be considered content type of the body and thus translate to be the encoding of the transfer-decoded string.

lawrencepit · 2012-11-22T03:01:43Z

Same issue #403

jeremy · 2013-01-22T19:34:17Z

Rather than using message.body.decoded or message.decode_body (which just calls body.decoded), call message.decoded. That checks whether it's a text message [1] and calls message.decode_body_as_text, which decodes the transfer-encoding and sets the Ruby string encoding according to the message's charset [2].

[1] https://github.com/mikel/mail/blob/master/lib/mail/message.rb#L1786
[2] https://github.com/mikel/mail/blob/master/lib/mail/message.rb#L2047-L2050

dwt · 2013-01-29T07:46:51Z

Out of curiosity, can you explain why calling messae.body.decoded is wrong / or why it needs to choose the 'wrong' encoding?

Reason being, that it still seems like a bug to me.

jeremy · 2013-01-30T17:23:03Z

It's an API design decision. Decoding/encoding applies to transfer encoding, not charset, here. I agree it'd be nice if it worked the same way, too. Perhaps in a future version!

frenkel mentioned this issue Dec 7, 2012

Correctly handle charsets other than UTF-8 (problem on PostgreSQL) ivaldi/brimir#11

Closed

jeremy closed this as completed Jan 22, 2013

jeremy mentioned this issue Sep 2, 2013

ISO-8859-1 encoded body decoded as ascii/binary #618

Closed

rymohr mentioned this issue Oct 22, 2015

Mandrill encoding errors honeybadger-io/incoming#13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decoded_body returns string with the wrong encoding #431

decoded_body returns string with the wrong encoding #431

pupeno commented Sep 3, 2012

lawrencepit commented Nov 17, 2012

lawrencepit commented Nov 17, 2012

dwt commented Nov 20, 2012

rbu commented Nov 20, 2012

lawrencepit commented Nov 22, 2012

jeremy commented Jan 22, 2013

dwt commented Jan 29, 2013

jeremy commented Jan 30, 2013

decoded_body returns string with the wrong encoding #431

decoded_body returns string with the wrong encoding #431

Comments

pupeno commented Sep 3, 2012

lawrencepit commented Nov 17, 2012

lawrencepit commented Nov 17, 2012

dwt commented Nov 20, 2012

rbu commented Nov 20, 2012

lawrencepit commented Nov 22, 2012

jeremy commented Jan 22, 2013

dwt commented Jan 29, 2013

jeremy commented Jan 30, 2013