Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decoded_body returns string with the wrong encoding #431

Closed
pupeno opened this issue Sep 3, 2012 · 8 comments
Closed

decoded_body returns string with the wrong encoding #431

pupeno opened this issue Sep 3, 2012 · 8 comments

Comments

@pupeno
Copy link

pupeno commented Sep 3, 2012

Starting with an email that is encoded with windows-1252 as sent by Apple Mail:


Return-Path: <pupeno@example.com>
Date: Mon, 03 Sep 2012 11:53:03 +0100
From: =?iso-8859-1?Q?=22J=2E_Pablo_Fern=E1ndez=22?= <pupeno@example.com>
To: John Doe <info+c25oubi092kehec6y7aj8nlf1bolctv4@example.org>
Message-ID: <5A13A2D2-A135-46CB-A81D-34325FAE6D3A@example.com>
In-Reply-To: <5044816653066_b0d48803fc7817d@machinex.mail>
References: <5044816653066_b0d48803fc7817d@machinex.mail>
Subject: Re: Fancy an Apple?
Mime-Version: 1.0
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_3C4CAFE2-2D3A-4DD5-8122-E0F3B2B2392C";
 charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Original-To: info+c25oubi092kehec6y7aj8nlf1bolctv4@example.org
Delivered-To: info+c25oubi092kehec6y7aj8nlf1bolctv4@example.org
X-Mailer: Apple Mail (2.1278)
X-Gm-Message-State: ALoCoQlNRV5YobE7K4nH17w7s9zpjsXLCUrIMeg8/J56oFwGb+4bkp+IhagKrnPdB2HoXKyzlhbw



--Apple-Mail=_3C4CAFE2-2D3A-4DD5-8122-E0F3B2B2392C
Date: Mon, 03 Sep 2012 10:53:09 +0000
Mime-Version: 1.0
Content-Type: text/plain;
 charset=windows-1252
Content-Transfer-Encoding: quoted-printable
Content-ID: <50448c1523a5f_31e34d563f445258@machinex.mail>

This is an email=85=0D
=0D

--Apple-Mail=_3C4CAFE2-2D3A-4DD5-8122-E0F3B2B2392C
Date: Mon, 03 Sep 2012 10:53:09 +0000
Mime-Version: 1.0
Content-Type: text/html;
 charset=windows-1252
Content-Transfer-Encoding: quoted-printable
Content-ID: <50448c152476a_31e34d563f4453f1@machinex.mail>

<html><head></head><body style=3D"word-wrap: break-word; -webkit-nbsp-mod=
e: space; -webkit-line-break: after-white-space; ">This is an email=85<div=
><br></body></html>=


--Apple-Mail=_3C4CAFE2-2D3A-4DD5-8122-E0F3B2B2392C--

I create a mail object with it (I put that in a file and open it):

m = Mail.new(open("mail.txt").read)

If I get the text part:

> m.text_part
 => #<Mail::Part:70281222717060, Multipart: false, Headers: <Date: Mon, 03 Sep 2012 10:53:09 +0000>, <Mime-Version: 1.0>, <Content-Type: text/plain; charset=windows-1252>, <Content-Transfer-Encoding: quoted-printable>, <Content-ID: <50448c1523a5f_31e34d563f445258@machinex.mail>>>

we can see that the content type is text/plain and the charset is windows-1252. If I call decode_body on it:

> m.text_part.decode_body
 => "This is an email\x85\r\r\n\r"

it returns the body with \x85 in it, which is windows-1252 code for the ellipsis "…". If I now try to convert it to UTF-8:

m.text_part.decode_body.encode("utf-8")

it fails with an exception:

Encoding::UndefinedConversionError: "\x85" from ASCII-8BIT to UTF-8
    from (irb):45:in `encode'
    from (irb):45
    from /Users/pupeno/.rvm/gems/ruby-1.9.3-p194@watu/gems/railties-3.2.7/lib/rails/commands/console.rb:47:in `start'
    from /Users/pupeno/.rvm/gems/ruby-1.9.3-p194@watu/gems/railties-3.2.7/lib/rails/commands/console.rb:8:in `start'
    from /Users/pupeno/.rvm/gems/ruby-1.9.3-p194@watu/gems/railties-3.2.7/lib/rails/commands.rb:41:in `<top (required)>'
    from script/rails:6:in `require'
    from script/rails:6:in `<main>'

the reason being that the initial string is marked as ASCII-8BIT instead of windows-1252 and \x85 is not a valid ASCII-8BIT char. We can see what's wrong this way:

> m.text_part.decode_body.encoding
 => #<Encoding:ASCII-8BIT>

and we can workaround it this way:

> m.text_part.decode_body.encode("utf-8", "windows-1252")
 => "This is an email…\r\r\n\r" 
@lawrencepit
Copy link
Contributor

I found we always need to do this:

part = mail.multipart? ? mail.text_part : mail
/charset=(?<charset>[^\s]+)\s?/ =~ part.content_type
part.body.decoded.force_encoding(charset ? charset.gsub('"', '') : "iso-8859-1").encode("UTF-8")

or same thing:

part.body.decoded.encode('UTF-8', charset ? charset.gsub('"', '') : "iso-8859-1", :invalid => :replace, :undef => :replace, :replace => '?')

PS. If no charset can be determined from the email we default to "iso-8859-1", for some reason this seems to work for us, but probably doesn't work in all environments; instead you might need to use something like charlock to make a best guess of the charset used.

@lawrencepit
Copy link
Contributor

Or what also seems to work:

part.body.decoded.encode('UTF-8', part.content_type_parameters['charset'] || "ISO-8859-1", :invalid => :replace, :undef => :replace, :replace => '?')

@dwt
Copy link

dwt commented Nov 20, 2012

+1 I also have this problem

@rbu
Copy link

rbu commented Nov 20, 2012

Using ruby 1.9.3p327 and Mail 2.5.2 I see the following:

(rdb:1) p @email
#<Mail::Message:7464800, Multipart: false, Headers: <Date: Tue, 20 Nov 2012 11:53:41 +0000>, <From: info@example.com>, <To: john843483@example.com>, <Message-Id: : 20121120125341.23256.1563256327@localhost>, <Subject: something>, <Mime-Version: 1.0>, <Content-Type: text/plain; charset="utf-8">, <Content-Transfer-Encoding: base64>, <X-Actually-From: info@example.com>, <X-Actually-To: john843483@example.com>>
(rdb:1) p @email.charset
"utf-8"
(rdb:1) p @email.body.charset
"US-ASCII"
(rdb:1) p @email.body.encoding
"base64"
(rdb:1) p @email.body.decoded.encoding
#<Encoding:ASCII-8BIT>
(rdb:1) p @email.body.decoded[0,19]
"Hello John Sm\xC3\xBCd,\n\n"
(rdb:1) p @email.body.decoded[0,19].force_encoding("utf-8")
"Hello John Smüd,\n\n"

To my understanding, the content-type of the email given in the header should be considered content type of the body and thus translate to be the encoding of the transfer-decoded string.

@lawrencepit
Copy link
Contributor

Same issue #403

@jeremy
Copy link
Collaborator

jeremy commented Jan 22, 2013

Rather than using message.body.decoded or message.decode_body (which just calls body.decoded), call message.decoded. That checks whether it's a text message [1] and calls message.decode_body_as_text, which decodes the transfer-encoding and sets the Ruby string encoding according to the message's charset [2].

[1] https://github.com/mikel/mail/blob/master/lib/mail/message.rb#L1786
[2] https://github.com/mikel/mail/blob/master/lib/mail/message.rb#L2047-L2050

@jeremy jeremy closed this as completed Jan 22, 2013
@dwt
Copy link

dwt commented Jan 29, 2013

Out of curiosity, can you explain why calling messae.body.decoded is wrong / or why it needs to choose the 'wrong' encoding?

Reason being, that it still seems like a bug to me.

@jeremy
Copy link
Collaborator

jeremy commented Jan 30, 2013

It's an API design decision. Decoding/encoding applies to transfer encoding, not charset, here. I agree it'd be nice if it worked the same way, too. Perhaps in a future version!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants