Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF attachment corruption regression #5905

Closed
garethrees opened this issue Oct 1, 2020 · 38 comments · Fixed by #5936
Closed

PDF attachment corruption regression #5905

garethrees opened this issue Oct 1, 2020 · 38 comments · Fixed by #5936
Labels
bug Breaks expected functionality f:request-analysis x:uk

Comments

@garethrees
Copy link
Member

garethrees commented Oct 1, 2020

We recently had a report of a corrupted/malformed PDF attachment.

Looks like this is a regression. I imported the raw email locally and get different attachment results between 0.37 and develop:

# develop
attachment_attributes = MailHandler.get_attachment_attributes(RawEmail.last.mail!)
attachment_attributes.size
# => 1

# 0.37.1.4
attachment_attributes = MailHandler.get_attachment_attributes(RawEmail.last.mail!)
attachment_attributes.size
# => 2
@garethrees garethrees added x:uk bug Breaks expected functionality f:request-analysis labels Oct 1, 2020
@garethrees
Copy link
Member Author

#!/bin/bash
bundle --quiet
bundle exec rails runner "exit 1 unless MailHandler.get_attachment_attributes(RawEmail.last.mail!).size == 2"
git bisect start
git bisect good 40306b3726c81c134207393d67b940267e493cb2 # 0.37.1.4
git bisect bad 1506c45c3 # very recent commit
git bisect run tmp/5905-bisect

# snip…

c581d83285e9093a8cd08e1e0a8901acc495eefb is the first bad commit
commit c581d83285e9093a8cd08e1e0a8901acc495eefb
Author: Graeme Porteous <graeme@rgbp.co.uk>
Date:   Wed Jul 22 11:46:52 2020 +0100

    Bump mail gem to 2.7.1

    Already bumped for the next Rails Gemfile so this is just for 5.2.

:100644 100644 de8a68dcdb3b525668ad3a7af88a8e2982501700 937859904bc66b17b6f909098fa3139e0a7419d2 M      Gemfile
:100644 100644 4dd6237626b3560e63b8d0c04d14c3d01b4bff4c 46b724b0206556176d4a19718e7f97d2c282fd64 M      Gemfile.lock
bisect run success

Doesn't surprise me that this is caused by c581d83, but why is unclear.

@garethrees
Copy link
Member Author

Definitely caused by a change in Mail:

[3](main)> Mail::VERSION.version
# => "2.6.6"
[4](main)> Mail.read('1625443.eml').parts.map(&:content_type)
# => ["text/plain; charset=utf-8; format=flowed", "application/pdf; name=\"20200819 - Aerial Images.pdf\""]


[3](main)> Mail::VERSION.version
# => "2.7.1"
[4](main)> Mail.read('1625443.eml').parts.map(&:content_type)
# => []

Need to establish whether it's a regression or whether we're doing the wrong thing with it.

@garethrees
Copy link
Member Author

#!/bin/bash
bundle --quiet
/usr/bin/env ruby -e "require 'bundler/setup'" -e "require './lib/mail'" -e "exit 1 if Mail.read('1625443.eml').parts.map(&:content_type).empty?"
$ git bisect start
$ git bisect good 0eb3691342d3cbcbbf477bfc105337c65271992a # 2.7.1
$ git bisect bad 7c43c84c16f017e0ff5e5c9962f6a1d842301ee3 # "Bump to 2.7.0.edge. See 2-6-stable branch for latest 2.6.x." 
$ git bisect run ./mail-bisect
running ./mail-bisect

#

21222e1b48f08f6f848d26643d6b6fb1d873d18c is the first bad commit
commit 21222e1b48f08f6f848d26643d6b6fb1d873d18c
Author: Jeremy Daer <jeremydaer@gmail.com>
Date:   Mon May 22 10:01:00 2017 -0700

    Eliminate attachment corruption caused by line ending conversions

    * Omit initial CRLF linefeed conversion since CRLF are required newline
      separators. We shouldn't need to convert bare CR or LF. Update our
      fixture emails to use CRLF throughout. Closes #609. Fixes #408.

    * Drop quoted-printable CRLF conversion. This was introduced to
      harmonize with Ruby's \n-based line endings. But this breaks Q-P
      encoding with binary data. It's not *meant* for binary data, but we
      don't yet take adequate measures to use base64 for these cases.
      Reverts #496. Fixes #1010.

    Closes #1113

:100644 100644 2e9167261111beac1a3a791327d83e538c84ac33 0508d51f20614f64cd472986e081b25b5ac4e3a7 M      CHANGELOG.rdoc
:040000 040000 f0ae85fe4c5510b18408fa51dfb60e6ff0d7b138 af9137afa146367afa370189de463041a98aec55 M      lib
:040000 040000 ebef4af0a90b414f918750b0f049a8db14c765a1 7ddbc49dac5921a273972e2f6b8d2910b2a7e0b2 M      spec
bisect run success

@garethrees
Copy link
Member Author

git checkout 21222e1b48f08f6f848d26643d6b6fb1d873d18c

Before Diff

[1](main)> require 'bundler/setup'
NOTE: Gem::Specification#has_rdoc= is deprecated with no replacement. It will be removed on or after 2018-12-01.
Gem::Specification#has_rdoc= called from /vagrant/mail.gemspec:13.
=> true
[2](main)> require './lib/mail'
=> true
[3](main)> Mail.read('1625443.eml').parts.map(&:content_type)
WARNING: More than 1000 header fields; only using the first 1000 and ignoring the rest
WARNING: Ignoring unparsable header "From clerk@moultonpc.org.uk Thu Aug 20 16:18:14 2020": invalid header name syntax: "From clerk@moultonpc.org.uk Thu Aug 20 16"
=> []

Apply Diff

diff --git a/lib/mail/attachments_list.rb b/lib/mail/attachments_list.rb
index ca7d0d6a..5f5ac2fb 100644
--- a/lib/mail/attachments_list.rb
+++ b/lib/mail/attachments_list.rb
@@ -7,7 +7,7 @@ module Mail
       @content_disposition_type = 'attachment'
       parts_list.map { |p|
         if p.mime_type == 'message/rfc822'
-          Mail.new(p.body.encoded).attachments
+          Mail.new(p.body).attachments
         elsif p.parts.empty?
           p if p.attachment?
         else
diff --git a/lib/mail/encodings/quoted_printable.rb b/lib/mail/encodings/quoted_printable.rb
index 6a78ffc3..6c0438a5 100644
--- a/lib/mail/encodings/quoted_printable.rb
+++ b/lib/mail/encodings/quoted_printable.rb
@@ -16,11 +16,11 @@ module Mail
       # Decode the string from Quoted-Printable. Cope with hard line breaks
       # that were incorrectly encoded as hex instead of literal CRLF.
       def self.decode(str)
-        str.gsub(/(?:=0D=0A|=0D|=0A)\r\n/, "\r\n").unpack("M*").first
+        ::Mail::Utilities.to_lf str.gsub(/(?:=0D=0A|=0D|=0A)\r\n/, "\r\n").unpack("M*").first
       end

       def self.encode(str)
-        [str].pack("M")
+        ::Mail::Utilities.to_crlf([::Mail::Utilities.to_lf(str)].pack("M"))
       end

       def self.cost(str)
diff --git a/lib/mail/message.rb b/lib/mail/message.rb
index 4398885e..32d67b11 100644
--- a/lib/mail/message.rb
+++ b/lib/mail/message.rb
@@ -2012,7 +2012,7 @@ module Mail

     def raw_source=(value)
       value = value.dup.force_encoding(Encoding::BINARY) if RUBY_VERSION >= "1.9.1"
-      @raw_source = value
+      @raw_source = ::Mail::Utilities.to_crlf(value)
     end

     # see comments to body=. We take data and process it lazily

After diff

[1](main)> require 'bundler/setup'
NOTE: Gem::Specification#has_rdoc= is deprecated with no replacement. It will be removed on or after 2018-12-01.
Gem::Specification#has_rdoc= called from /vagrant/mail.gemspec:13.
=> true
[2](main)> require './lib/mail'
=> true
[3](main)> Mail.read('1625443.eml').parts.map(&:content_type)
=> ["text/plain; charset=utf-8; format=flowed", "application/pdf; name=\"20200819 - Aerial Images.pdf\""]

@garethrees
Copy link
Member Author

garethrees commented Oct 1, 2020

Pretty sure mikel/mail@21222e1 is causing the problem.

@garethrees
Copy link
Member Author

Applying only this hunk of the diff results in the attachments being returned correctly:

diff --git a/lib/mail/message.rb b/lib/mail/message.rb
index 4398885e..32d67b11 100644
--- a/lib/mail/message.rb
+++ b/lib/mail/message.rb
@@ -2012,7 +2012,7 @@ module Mail

     def raw_source=(value)
       value = value.dup.force_encoding(Encoding::BINARY) if RUBY_VERSION >= "1.9.1"
-      @raw_source = value
+      @raw_source = ::Mail::Utilities.to_crlf(value)
     end

     # see comments to body=. We take data and process it lazily
[5](main)> Mail.read('1625443.eml').parts.map(&:content_type)
# => ["text/plain; charset=utf-8; format=flowed", "application/pdf; name=\"20200819 - Aerial Images.pdf\""]

@garethrees
Copy link
Member Author

garethrees commented Oct 7, 2020

@sagepe
Copy link
Member

sagepe commented Oct 14, 2020

Possibly related: mikel/mail#1329?

@garethrees
Copy link
Member Author

Possibly related: mikel/mail#1329?

Well, our failing email doesn't have a Content-Location header, but then neither does this successfully parsed email.

I've noticed that the Mail::Body#extract_parts method requires CRLF line endings, and sometimes we seem to receive mail with only LF endings.

This could be worth verifying for the failing and successful examples.

mikel/mail#1350 – Handle edge case when inline image attachment part is missing a filename

Are our failing attachments inline and missing a filename? The PR seems to specifically target images, so we might need to do something different for the PDF cases we're seeing fail.

@RichardTaylor
Copy link

@gbp
Copy link
Member

gbp commented Oct 27, 2020

Inspired by https://github.com/rails/rails/blob/master/actionmailbox/lib/action_mailbox/mail_ext/from_source.rb#L5

If we do:

diff --git a/lib/mail_handler/backends/mail_backend.rb b/lib/mail_handler/backends/mail_backend.rb
index e8de3ffa9c..970ba1efd4 100644
--- a/lib/mail_handler/backends/mail_backend.rb
+++ b/lib/mail_handler/backends/mail_backend.rb
@@ -43,7 +43,7 @@ def backend
 
       def mail_from_raw_email(data)
         data = data.force_encoding(Encoding::BINARY) if data.is_a? String
-        Mail.new(data)
+        Mail.new Mail::Utilities.binary_unsafe_to_crlf(data)
       end
 
       # Extracts all attachments from the given TNEF file as a Mail object

and then:

[1] pry(main)> cd IncomingMessage.last
[2] pry(#<IncomingMessage>):1)> get_main_body_text_folded
=> "This is a multi-part message in MIME format..."
[3] pry(#<IncomingMessage>):1> parse_raw_email!(true)
[4] pry(#<IncomingMessage>):1> clear_in_database_caches!
[5] pry(#<IncomingMessage>):1> get_main_body_text_folded
=> "Dear ..."

Seems to means the emails parts get extracted correctly.

@gbp
Copy link
Member

gbp commented Oct 27, 2020

This is basically waht we already do in our specs as editors automattically convert CRLF into LF.

mysociety-pusher pushed a commit that referenced this issue Oct 27, 2020
Fixes issue with extracting out mail parts happening since upgrading the
mail gem to 2.7.

Fixes #5905

Inspired by ActionMailbox [1]

[1] https://github.com/rails/rails/blob/a5b2fff/actionmailbox/lib/action_mailbox/mail_ext/from_source.rb#L5
mysociety-pusher pushed a commit that referenced this issue Oct 27, 2020
Fixes issue with extracting out mail parts happening since upgrading the
mail gem to 2.7.

Fixes #5905

Inspired by ActionMailbox [1]

[1] https://github.com/rails/rails/blob/a5b2fff/actionmailbox/lib/action_mailbox/mail_ext/from_source.rb#L5
@RichardTaylor
Copy link

Another example

https://www.whatdotheyknow.com/request/policies_329#incoming-1665717

Getting this fix up on WhatDoTheyKnow will prevent requests to extract and upload documents from reformatted responses.

@RichardTaylor
Copy link

Will the fix have a retrospective effect on messages that have already come in?

@gbp
Copy link
Member

gbp commented Oct 28, 2020

Will the fix have a retrospective effect on messages that have already come in?

It won't fix them automatically but I'm intenteding to run some commands in the app console to fix

@RichardTaylor
Copy link

@RichardTaylor
Copy link

Possibly related issue, a further message on the thread mentioned above not involving a PDF though:

https://www.whatdotheyknow.com/request/professional_services_procuremen_316#incoming-1670997

@mdeuk
Copy link
Collaborator

mdeuk commented Nov 7, 2020

Possibly related issue, a further message on the thread mentioned above not involving a PDF though:

https://www.whatdotheyknow.com/request/professional_services_procuremen_316#incoming-1670997

The authority in this case (Adur & Worthing Councils) is using the same CRM software (Mats / NetCall) as the rest of the instances here, so it does seem reasonable to conclude that there is a link between the two.

@RichardTaylor
Copy link

@garethrees
Copy link
Member Author

Ha, well this is strange:

Mail.read('tmp/1625443.eml').parts.map(&:content_type)
# => []
Mail.new(File.read('tmp/1625443.eml')).parts.map(&:content_type)
# => ["text/plain; charset=utf-8; format=flowed", "application/pdf; name=\"20200819 - Aerial Images.pdf\""]

@gbp
Copy link
Member

gbp commented Nov 12, 2020

Mail.read('tmp/1625443.eml').parts.map(&:content_type)
# => []

is equvailent to

data = File.open('tmp/1625443.eml', 'rb') { |f| f.read }
Mail.open(data).parts.map(&:content_type)
# => []

and what we have in MailHandler::Backends::MailBackend

data = File.read('tmp/1625443.eml').force_encoding(Encoding::BINARY)
Mail.open(data).parts.map(&:content_type)
# => []

All are fixed with the proposed CRLF change:

data = File.read('tmp/1625443.eml').force_encoding(Encoding::BINARY)
data = Mail::Utilities.binary_unsafe_to_crlf(data.to_s)
Mail.open(data).parts.map(&:content_type)
# => ["text/plain; charset=utf-8; format=flowed", "application/pdf; name=\"20200819 - Aerial Images.pdf\""]

@RichardTaylor
Copy link

@RichardTaylor
Copy link

@mdeuk
Copy link
Collaborator

mdeuk commented Nov 20, 2020

We've another instance of this issue at:
https://www.whatdotheyknow.com/request/secure_email_contracts_7#incoming-1678252

@RichardTaylor
Copy link

Apparent further case:
https://www.whatdotheyknow.com/request/investments_and_divestment_progr_2#incoming-1672202

Once fixed it would be good to try and identify as many cases as possible systematically.

@RichardTaylor
Copy link

@RichardTaylor
Copy link

Possible other example, not clear if this is the same issue or not:

https://www.whatdotheyknow.com/request/ipc_grants_11#incoming-1680749

@RichardTaylor
Copy link

@garethrees
Copy link
Member Author

Just to note that we do have a probable fix for this (#5936). The hold up at the moment is that I wanted to add a test that demonstrates the problem (#5970) and that the new code fixes it. I also want to make sure we really understand what's causing the problem we're seeing here, as I'm reluctant to push a fix when it may may turn out to cause other problems somewhere else because we've jumped to conclusions.

@garethrees
Copy link
Member Author

garethrees commented Dec 1, 2020

Let's recap.

Since upgrading to Mail 2.7.1 (c581d83) we've noticed a regression in handling some PDFs. Instead of being extracted to an FoiAttachment, the base64 encoded string gets rendered into the main part of the correspondence. We've isolated the cause to mikel/mail@21222e1.

21222e1b48f08f6f848d26643d6b6fb1d873d18c is the first bad commit
commit 21222e1b48f08f6f848d26643d6b6fb1d873d18c
Author: Jeremy Daer <jeremydaer@gmail.com>
Date:   Mon May 22 10:01:00 2017 -0700

    Eliminate attachment corruption caused by line ending conversions

    * Omit initial CRLF linefeed conversion since CRLF are required newline
      separators. We shouldn't need to convert bare CR or LF. Update our
      fixture emails to use CRLF throughout. Closes #609. Fixes #408.

    * Drop quoted-printable CRLF conversion. This was introduced to
      harmonize with Ruby's \n-based line endings. But this breaks Q-P
      encoding with binary data. It's not *meant* for binary data, but we
      don't yet take adequate measures to use base64 for these cases.
      Reverts #496. Fixes #1010.

    Closes #1113

Since the introduction of Action Mailbox Rails parses raw emails through Mail through Mail::Utilities.binary_unsafe_to_crlf. This appears to have been introduced in 627bbd34 with no specific explanation for the use of the helper.

This (fixed) issue in Redmine appears to describe a similar problem. There are some notes suggesting that for some users the LANG environment variable, or Postfix configuration, could be causing the issue.

I've downloaded a RawEmail from WhatDoTheyKnow where the problem was exhibited, anonymised it and created a spec (d325c8b) that demonstrates the problem. We can see that when the fixture has unix line endings, with the current code the spec fails in the same way the production code fails. Parsing the raw email through Mail::Utilities.binary_unsafe_to_crlf as Rails does (1dfb71f) makes the spec pass.

I don't fully understand whether we're doing the right thing at the infrastructure level at the moment. Are we applying any line-ending conversions during the Exim pipe process? What should we be expecting the application to ingest, and is that what's happening?

If we're confident that we're not mangling the email before entry into the app, then I'd like to understand what it is about these emails is "wrong". Is it that they don't follow the RFC, but are generated in this way due to a poor email client, or is it that there's an upstream issue in Mail that we should consider raising?

It feels like 1dfb71f is probably the right thing to do. If we were building Alaveteli from scratch today we'd likely use Action Mailbox, and therefore go through Mail::Utilities.binary_unsafe_to_crlf.

I want to be sure that:

  1. By applying 1dfb71f we're not going to cause some other bug because we don't really understand what's happening.
  2. We're not patching over something that we've got wrong lower down the stack in Exim or on the OS.

@sagepe
Copy link
Member

sagepe commented Dec 1, 2020

Regarding Exim: I don't think we're doing anything out of the ordinary on our infra at present. The emails are received and forwarded on without explicitly changing formats.

There's a description in the Exim docs covering how line endings are handled during mail processing.

There is an option to force CRLF when using a pipe transport rather than just LF. I'm not sure this is the right thing to do however; some attachments must be coming through correctly and the default is not to do this, so I'd want to understand why it was necessary before turning it on.

It's also worth noting that we can't predict necessarily how a given Alaveteli's environment may be configured; it seems sensible to me to have an internal step like the one you mention above using Mail::Utilities.binary_unsafe_to_crlf to ensure mail is subsequently processed correctly - we could tweak the email configuration used on WDTK infrastructure, but that might not help a re-user who may have less control over the format of their mail.

@dracos
Copy link
Member

dracos commented Dec 1, 2020

What happens then is due to the presence of non-ASCII characters somewhere in the raw email text.

  1. lib/mail/message.rb calls Mail::Body.new(@body_raw)
  2. Body initialize does @raw_source = ::Mail::Utilities.to_crlf(string.to_s)
  3. to_crlf only calls binary_unsafe_to_crlf if safe_for_line_ending_conversion? which is (if ruby<1.9) if string is ASCII only, or (ruby>=1.9) if string is valid encoding or (BINARY+ASCII only)
  4. the string is BINARY (fair enough, as encoding isn't known until headers are processed).

The part splitter uses a regexp(?!) which only spots \r\n, and thus will fail on any LF-ended non-ASCII email only, leaving the whole email as one non-MIME part and thus outputting it all including boundaries, as you see. If the email is ASCII only, line endings are being converted back to CRLF and it works fine. I agree, unilaterally forcing a conversion to crlf, given the mail code parsing mandates CRLF and isn't always doing the conversion, is the right thing to do.

mysociety-pusher pushed a commit that referenced this issue Dec 3, 2020
Since upgrading to Mail 2.7.1 (c581d83) we've noticed a regression in
handling some PDFs. Instead of being extracted to an FoiAttachment, the
base64 encoded string gets rendered into the main part of the
correspondence. We've isolated the cause to mikel/mail@21222e1.

Since the introduction of Action Mailbox Rails parses raw emails through
Mail through `Mail::Utilities.binary_unsafe_to_crlf` [1]. This appears
to have been introduced in 627bbd34 with no specific explanation for the
use of the helper.

The issue appears with the presence of non-ASCII characters somewhere in
the raw email text of multipart messages.

Mail uses a Regexp to split parts, which only spots \r\n, and thus will
fail on any LF-ended non-ASCII email leaving the whole email as
one non-MIME part. Mail does convert the raw email to CRLF, but only
when the String has a valid encoding, or if the String is encoded as
`Encoding::BINARY`, then only if the String contains only ASCII
characters.

Alaveteli forces all raw emails to `Encoding::BINARY` as of 0f9d143,
so `Mail::Utilities.safe_for_line_ending_conversion?` returns `false`.
If the email is ASCII only, line endings are being converted back to
CRLF and it works fine.

Unilaterally forcing a conversion to CRLF, given the mail code parsing
mandates CRLF and isn't always doing the conversion, seems like the
right thing to do.

As well as fixing the above issue, some cleanup is included in this
commit:

* Rely on `mail_from_raw_email` converting to CRLF rather than in
  `email_helpers` to ensure consistent behaviour.
* Reset all email fixture files back to Unix line endings for
  consistency.

Fixes #5905.

* Initial problem description, specs and cleanup contributed by
  @garethrees
* Bugfix contributed by @gbp
* Conversion issues in Mail identified by @dracos

[1] https://github.com/rails/rails/blob/a5b2fff/actionmailbox/lib/action_mailbox/mail_ext/from_source.rb#L5
mysociety-pusher pushed a commit that referenced this issue Dec 3, 2020
Since upgrading to Mail 2.7.1 (c581d83) we've noticed a regression in
handling some PDFs. Instead of being extracted to an FoiAttachment, the
base64 encoded string gets rendered into the main part of the
correspondence. We've isolated the cause to mikel/mail@21222e1.

Since the introduction of Action Mailbox Rails parses raw emails through
Mail through `Mail::Utilities.binary_unsafe_to_crlf` [1]. This appears
to have been introduced in 627bbd34 with no specific explanation for the
use of the helper.

The issue appears with the presence of non-ASCII characters somewhere in
the raw email text of multipart messages.

Mail uses a Regexp to split parts, which only spots \r\n, and thus will
fail on any LF-ended non-ASCII email leaving the whole email as
one non-MIME part. Mail does convert the raw email to CRLF, but only
when the String has a valid encoding, or if the String is encoded as
`Encoding::BINARY`, then only if the String contains only ASCII
characters.

Alaveteli forces all raw emails to `Encoding::BINARY` as of 0f9d143,
so `Mail::Utilities.safe_for_line_ending_conversion?` returns `false`.
If the email is ASCII only, line endings are being converted back to
CRLF and it works fine.

Unilaterally forcing a conversion to CRLF, given the mail code parsing
mandates CRLF and isn't always doing the conversion, seems like the
right thing to do.

As well as fixing the above issue, some cleanup is included in this
commit:

* Rely on `mail_from_raw_email` converting to CRLF rather than in
  `email_helpers` to ensure consistent behaviour.
* Reset all email fixture files back to Unix line endings for
  consistency.

Fixes #5905.

* Initial problem description, specs and cleanup contributed by
  @garethrees
* Bugfix contributed by @gbp
* Conversion issues in Mail identified by @dracos

[1] https://github.com/rails/rails/blob/a5b2fff/actionmailbox/lib/action_mailbox/mail_ext/from_source.rb#L5
@RichardTaylor
Copy link

mysociety-pusher pushed a commit that referenced this issue Dec 8, 2020
Re-parses any incoming messages that look like they were affected by a
regression in Mail [1] that caused some multipart emails to be parsed as
a single part.

The date to look for messages defaults to the commit date [2] that we
introduced the gem version that causes the regression, but is tunable
for re-users who may upgrade much later than us.

[1] #5905
[2] c581d83285
mysociety-pusher pushed a commit that referenced this issue Dec 8, 2020
Re-parses any incoming messages that look like they were affected by a
regression in Mail [1] that caused some multipart emails to be parsed as
a single part.

The date to look for messages defaults to the commit date [2] that we
introduced the gem version that causes the regression, but is tunable
for re-users who may upgrade much later than us.

[1] #5905
[2] c581d83285
mysociety-pusher pushed a commit that referenced this issue Dec 8, 2020
Re-parses any incoming messages that look like they were affected by a
regression in Mail [1] that caused some multipart emails to be parsed as
a single part.

To identify affected messages we look for 'Content-Type' in the cached
body text, since that should almost never be seen there. It doesn't
really matter if we re-parse a few false-positives, since caches will
get generated on next view.

The date to look for messages defaults to the commit date [2] that we
introduced the gem version that causes the regression, but is tunable
for re-users who may upgrade much later than us.

[1] #5905
[2] c581d83285
mysociety-pusher pushed a commit that referenced this issue Dec 9, 2020
Re-parses any incoming messages that look like they were affected by a
regression in Mail [1] that caused some multipart emails to be parsed as
a single part.

To identify affected messages we look for 'Content-Type' in the cached
body text, since that should almost never be seen there. It doesn't
really matter if we re-parse a few false-positives, since caches will
get generated on next view.

The date to look for messages defaults to the commit date [2] that we
introduced the gem version that causes the regression, but is tunable
for re-users who may upgrade much later than us.

[1] #5905
[2] c581d83285
@garethrees
Copy link
Member Author

I've now run the task in 7ae31c5 and all of the above messages are correctly parsed.

Possible other example, not clear if this is the same issue or not:

https://www.whatdotheyknow.com/request/ipc_grants_11#incoming-1680749

This one hasn't changed, so I expect a different, edge-case issue.

@mdeuk
Copy link
Collaborator

mdeuk commented Dec 9, 2020

I've now run the task in 7ae31c5 and all of the above messages are correctly parsed.

Possible other example, not clear if this is the same issue or not:
https://www.whatdotheyknow.com/request/ipc_grants_11#incoming-1680749

This one hasn't changed, so I expect a different, edge-case issue.

I think this specific one is the message itself, rather than an Alaveteli (or upstream) bug.

The raw message doesn't render correctly in an email client either, and the source suggests that it was munged before it arrived.
image

mysociety-pusher pushed a commit that referenced this issue Jan 14, 2021
Since upgrading to Mail 2.7.1 (c581d83) we've noticed a regression in
handling some PDFs. Instead of being extracted to an FoiAttachment, the
base64 encoded string gets rendered into the main part of the
correspondence. We've isolated the cause to mikel/mail@21222e1.

Since the introduction of Action Mailbox Rails parses raw emails through
Mail through `Mail::Utilities.binary_unsafe_to_crlf` [1]. This appears
to have been introduced in 627bbd34 with no specific explanation for the
use of the helper.

The issue appears with the presence of non-ASCII characters somewhere in
the raw email text of multipart messages.

Mail uses a Regexp to split parts, which only spots \r\n, and thus will
fail on any LF-ended non-ASCII email leaving the whole email as
one non-MIME part. Mail does convert the raw email to CRLF, but only
when the String has a valid encoding, or if the String is encoded as
`Encoding::BINARY`, then only if the String contains only ASCII
characters.

Alaveteli forces all raw emails to `Encoding::BINARY` as of 0f9d143,
so `Mail::Utilities.safe_for_line_ending_conversion?` returns `false`.
If the email is ASCII only, line endings are being converted back to
CRLF and it works fine.

Unilaterally forcing a conversion to CRLF, given the mail code parsing
mandates CRLF and isn't always doing the conversion, seems like the
right thing to do.

As well as fixing the above issue, some cleanup is included in this
commit:

* Rely on `mail_from_raw_email` converting to CRLF rather than in
  `email_helpers` to ensure consistent behaviour.
* Reset all email fixture files back to Unix line endings for
  consistency.

Fixes #5905.

* Initial problem description, specs and cleanup contributed by
  @garethrees
* Bugfix contributed by @gbp
* Conversion issues in Mail identified by @dracos

[1] https://github.com/rails/rails/blob/a5b2fff/actionmailbox/lib/action_mailbox/mail_ext/from_source.rb#L5
mysociety-pusher pushed a commit that referenced this issue Jan 14, 2021
Re-parses any incoming messages that look like they were affected by a
regression in Mail [1] that caused some multipart emails to be parsed as
a single part.

To identify affected messages we look for 'Content-Type' in the cached
body text, since that should almost never be seen there. It doesn't
really matter if we re-parse a few false-positives, since caches will
get generated on next view.

The date to look for messages defaults to the commit date [2] that we
introduced the gem version that causes the regression, but is tunable
for re-users who may upgrade much later than us.

[1] #5905
[2] c581d83285
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Breaks expected functionality f:request-analysis x:uk
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants