PDF attachment corruption regression #5905

garethrees · 2020-10-01T12:11:46Z

We recently had a report of a corrupted/malformed PDF attachment.

Looks like this is a regression. I imported the raw email locally and get different attachment results between 0.37 and develop:

# develop
attachment_attributes = MailHandler.get_attachment_attributes(RawEmail.last.mail!)
attachment_attributes.size
# => 1

# 0.37.1.4
attachment_attributes = MailHandler.get_attachment_attributes(RawEmail.last.mail!)
attachment_attributes.size
# => 2

The text was updated successfully, but these errors were encountered:

garethrees · 2020-10-01T12:36:49Z

#!/bin/bash
bundle --quiet
bundle exec rails runner "exit 1 unless MailHandler.get_attachment_attributes(RawEmail.last.mail!).size == 2"

git bisect start
git bisect good 40306b3726c81c134207393d67b940267e493cb2 # 0.37.1.4
git bisect bad 1506c45c3 # very recent commit
git bisect run tmp/5905-bisect

# snip…

c581d83285e9093a8cd08e1e0a8901acc495eefb is the first bad commit
commit c581d83285e9093a8cd08e1e0a8901acc495eefb
Author: Graeme Porteous <graeme@rgbp.co.uk>
Date:   Wed Jul 22 11:46:52 2020 +0100

    Bump mail gem to 2.7.1

    Already bumped for the next Rails Gemfile so this is just for 5.2.

:100644 100644 de8a68dcdb3b525668ad3a7af88a8e2982501700 937859904bc66b17b6f909098fa3139e0a7419d2 M      Gemfile
:100644 100644 4dd6237626b3560e63b8d0c04d14c3d01b4bff4c 46b724b0206556176d4a19718e7f97d2c282fd64 M      Gemfile.lock
bisect run success

Doesn't surprise me that this is caused by c581d83, but why is unclear.

garethrees · 2020-10-01T15:09:25Z

Definitely caused by a change in Mail:

[3](main)> Mail::VERSION.version
# => "2.6.6"
[4](main)> Mail.read('1625443.eml').parts.map(&:content_type)
# => ["text/plain; charset=utf-8; format=flowed", "application/pdf; name=\"20200819 - Aerial Images.pdf\""]


[3](main)> Mail::VERSION.version
# => "2.7.1"
[4](main)> Mail.read('1625443.eml').parts.map(&:content_type)
# => []

Need to establish whether it's a regression or whether we're doing the wrong thing with it.

garethrees · 2020-10-01T15:55:15Z

#!/bin/bash
bundle --quiet
/usr/bin/env ruby -e "require 'bundler/setup'" -e "require './lib/mail'" -e "exit 1 if Mail.read('1625443.eml').parts.map(&:content_type).empty?"

$ git bisect start
$ git bisect good 0eb3691342d3cbcbbf477bfc105337c65271992a # 2.7.1
$ git bisect bad 7c43c84c16f017e0ff5e5c9962f6a1d842301ee3 # "Bump to 2.7.0.edge. See 2-6-stable branch for latest 2.6.x." 
$ git bisect run ./mail-bisect
running ./mail-bisect

# …

21222e1b48f08f6f848d26643d6b6fb1d873d18c is the first bad commit
commit 21222e1b48f08f6f848d26643d6b6fb1d873d18c
Author: Jeremy Daer <jeremydaer@gmail.com>
Date:   Mon May 22 10:01:00 2017 -0700

    Eliminate attachment corruption caused by line ending conversions

    * Omit initial CRLF linefeed conversion since CRLF are required newline
      separators. We shouldn't need to convert bare CR or LF. Update our
      fixture emails to use CRLF throughout. Closes #609. Fixes #408.

    * Drop quoted-printable CRLF conversion. This was introduced to
      harmonize with Ruby's \n-based line endings. But this breaks Q-P
      encoding with binary data. It's not *meant* for binary data, but we
      don't yet take adequate measures to use base64 for these cases.
      Reverts #496. Fixes #1010.

    Closes #1113

:100644 100644 2e9167261111beac1a3a791327d83e538c84ac33 0508d51f20614f64cd472986e081b25b5ac4e3a7 M      CHANGELOG.rdoc
:040000 040000 f0ae85fe4c5510b18408fa51dfb60e6ff0d7b138 af9137afa146367afa370189de463041a98aec55 M      lib
:040000 040000 ebef4af0a90b414f918750b0f049a8db14c765a1 7ddbc49dac5921a273972e2f6b8d2910b2a7e0b2 M      spec
bisect run success

garethrees · 2020-10-01T16:03:12Z

git checkout 21222e1b48f08f6f848d26643d6b6fb1d873d18c

Before Diff

[1](main)> require 'bundler/setup'
NOTE: Gem::Specification#has_rdoc= is deprecated with no replacement. It will be removed on or after 2018-12-01.
Gem::Specification#has_rdoc= called from /vagrant/mail.gemspec:13.
=> true
[2](main)> require './lib/mail'
=> true
[3](main)> Mail.read('1625443.eml').parts.map(&:content_type)
WARNING: More than 1000 header fields; only using the first 1000 and ignoring the rest
WARNING: Ignoring unparsable header "From clerk@moultonpc.org.uk Thu Aug 20 16:18:14 2020": invalid header name syntax: "From clerk@moultonpc.org.uk Thu Aug 20 16"
=> []

Apply Diff

diff --git a/lib/mail/attachments_list.rb b/lib/mail/attachments_list.rb
index ca7d0d6a..5f5ac2fb 100644
--- a/lib/mail/attachments_list.rb
+++ b/lib/mail/attachments_list.rb
@@ -7,7 +7,7 @@ module Mail
       @content_disposition_type = 'attachment'
       parts_list.map { |p|
         if p.mime_type == 'message/rfc822'
-          Mail.new(p.body.encoded).attachments
+          Mail.new(p.body).attachments
         elsif p.parts.empty?
           p if p.attachment?
         else
diff --git a/lib/mail/encodings/quoted_printable.rb b/lib/mail/encodings/quoted_printable.rb
index 6a78ffc3..6c0438a5 100644
--- a/lib/mail/encodings/quoted_printable.rb
+++ b/lib/mail/encodings/quoted_printable.rb
@@ -16,11 +16,11 @@ module Mail
       # Decode the string from Quoted-Printable. Cope with hard line breaks
       # that were incorrectly encoded as hex instead of literal CRLF.
       def self.decode(str)
-        str.gsub(/(?:=0D=0A|=0D|=0A)\r\n/, "\r\n").unpack("M*").first
+        ::Mail::Utilities.to_lf str.gsub(/(?:=0D=0A|=0D|=0A)\r\n/, "\r\n").unpack("M*").first
       end

       def self.encode(str)
-        [str].pack("M")
+        ::Mail::Utilities.to_crlf([::Mail::Utilities.to_lf(str)].pack("M"))
       end

       def self.cost(str)
diff --git a/lib/mail/message.rb b/lib/mail/message.rb
index 4398885e..32d67b11 100644
--- a/lib/mail/message.rb
+++ b/lib/mail/message.rb
@@ -2012,7 +2012,7 @@ module Mail

     def raw_source=(value)
       value = value.dup.force_encoding(Encoding::BINARY) if RUBY_VERSION >= "1.9.1"
-      @raw_source = value
+      @raw_source = ::Mail::Utilities.to_crlf(value)
     end

     # see comments to body=. We take data and process it lazily

After diff

[1](main)> require 'bundler/setup'
NOTE: Gem::Specification#has_rdoc= is deprecated with no replacement. It will be removed on or after 2018-12-01.
Gem::Specification#has_rdoc= called from /vagrant/mail.gemspec:13.
=> true
[2](main)> require './lib/mail'
=> true
[3](main)> Mail.read('1625443.eml').parts.map(&:content_type)
=> ["text/plain; charset=utf-8; format=flowed", "application/pdf; name=\"20200819 - Aerial Images.pdf\""]

garethrees · 2020-10-01T16:03:36Z

Pretty sure mikel/mail@21222e1 is causing the problem.

garethrees · 2020-10-02T09:08:25Z

Applying only this hunk of the diff results in the attachments being returned correctly:

diff --git a/lib/mail/message.rb b/lib/mail/message.rb
index 4398885e..32d67b11 100644
--- a/lib/mail/message.rb
+++ b/lib/mail/message.rb
@@ -2012,7 +2012,7 @@ module Mail

     def raw_source=(value)
       value = value.dup.force_encoding(Encoding::BINARY) if RUBY_VERSION >= "1.9.1"
-      @raw_source = value
+      @raw_source = ::Mail::Utilities.to_crlf(value)
     end

     # see comments to body=. We take data and process it lazily

[5](main)> Mail.read('1625443.eml').parts.map(&:content_type)
# => ["text/plain; charset=utf-8; format=flowed", "application/pdf; name=\"20200819 - Aerial Images.pdf\""]

garethrees · 2020-10-07T08:40:23Z

Another case https://www.whatdotheyknow.com/request/adult_social_care_financial_asse#incoming-1639461

sagepe · 2020-10-14T15:21:19Z

Possibly related: mikel/mail#1329?

garethrees · 2020-10-15T08:35:10Z

Possibly related: mikel/mail#1329?

Well, our failing email doesn't have a Content-Location header, but then neither does this successfully parsed email.

I've noticed that the Mail::Body#extract_parts method requires CRLF line endings, and sometimes we seem to receive mail with only LF endings.

This could be worth verifying for the failing and successful examples.

mikel/mail#1350 – Handle edge case when inline image attachment part is missing a filename

Are our failing attachments inline and missing a filename? The PR seems to specifically target images, so we might need to do something different for the PDF cases we're seeing fail.

RichardTaylor · 2020-10-19T11:34:22Z

Another case https://www.whatdotheyknow.com/request/education_team_structure_8

gbp · 2020-10-27T12:58:01Z

Inspired by https://github.com/rails/rails/blob/master/actionmailbox/lib/action_mailbox/mail_ext/from_source.rb#L5

If we do:

diff --git a/lib/mail_handler/backends/mail_backend.rb b/lib/mail_handler/backends/mail_backend.rb
index e8de3ffa9c..970ba1efd4 100644
--- a/lib/mail_handler/backends/mail_backend.rb
+++ b/lib/mail_handler/backends/mail_backend.rb
@@ -43,7 +43,7 @@ def backend
 
       def mail_from_raw_email(data)
         data = data.force_encoding(Encoding::BINARY) if data.is_a? String
-        Mail.new(data)
+        Mail.new Mail::Utilities.binary_unsafe_to_crlf(data)
       end
 
       # Extracts all attachments from the given TNEF file as a Mail object

and then:

[1] pry(main)> cd IncomingMessage.last
[2] pry(#<IncomingMessage>):1)> get_main_body_text_folded
=> "This is a multi-part message in MIME format..."
[3] pry(#<IncomingMessage>):1> parse_raw_email!(true)
[4] pry(#<IncomingMessage>):1> clear_in_database_caches!
[5] pry(#<IncomingMessage>):1> get_main_body_text_folded
=> "Dear ..."

Seems to means the emails parts get extracted correctly.

gbp · 2020-10-27T13:03:40Z

This is basically waht we already do in our specs as editors automattically convert CRLF into LF.

Fixes issue with extracting out mail parts happening since upgrading the mail gem to 2.7. Fixes #5905 Inspired by ActionMailbox [1] [1] https://github.com/rails/rails/blob/a5b2fff/actionmailbox/lib/action_mailbox/mail_ext/from_source.rb#L5

RichardTaylor · 2020-10-28T14:48:03Z

Another example

https://www.whatdotheyknow.com/request/policies_329#incoming-1665717

Getting this fix up on WhatDoTheyKnow will prevent requests to extract and upload documents from reformatted responses.

RichardTaylor · 2020-10-28T14:49:02Z

Will the fix have a retrospective effect on messages that have already come in?

gbp · 2020-10-28T15:10:15Z

Will the fix have a retrospective effect on messages that have already come in?

It won't fix them automatically but I'm intenteding to run some commands in the app console to fix

RichardTaylor · 2020-10-31T14:54:41Z

Possibly another case:

https://www.whatdotheyknow.com/request/professional_services_procuremen_316#incoming-1667331

RichardTaylor · 2020-11-07T11:43:54Z

Possibly related issue, a further message on the thread mentioned above not involving a PDF though:

https://www.whatdotheyknow.com/request/professional_services_procuremen_316#incoming-1670997

mdeuk · 2020-11-07T12:00:02Z

Possibly related issue, a further message on the thread mentioned above not involving a PDF though:

https://www.whatdotheyknow.com/request/professional_services_procuremen_316#incoming-1670997

The authority in this case (Adur & Worthing Councils) is using the same CRM software (Mats / NetCall) as the rest of the instances here, so it does seem reasonable to conclude that there is a link between the two.

RichardTaylor · 2020-11-12T11:21:03Z

Another likely example

https://www.whatdotheyknow.com/request/covid_19_in_sheltered_housing#incoming-1673474

garethrees · 2020-11-12T13:27:48Z

Ha, well this is strange:

Mail.read('tmp/1625443.eml').parts.map(&:content_type)
# => []
Mail.new(File.read('tmp/1625443.eml')).parts.map(&:content_type)
# => ["text/plain; charset=utf-8; format=flowed", "application/pdf; name=\"20200819 - Aerial Images.pdf\""]

gbp · 2020-11-12T13:45:01Z

Mail.read('tmp/1625443.eml').parts.map(&:content_type)
# => []

is equvailent to

data = File.open('tmp/1625443.eml', 'rb') { |f| f.read }
Mail.open(data).parts.map(&:content_type)
# => []

and what we have in MailHandler::Backends::MailBackend

data = File.read('tmp/1625443.eml').force_encoding(Encoding::BINARY)
Mail.open(data).parts.map(&:content_type)
# => []

All are fixed with the proposed CRLF change:

data = File.read('tmp/1625443.eml').force_encoding(Encoding::BINARY)
data = Mail::Utilities.binary_unsafe_to_crlf(data.to_s)
Mail.open(data).parts.map(&:content_type)
# => ["text/plain; charset=utf-8; format=flowed", "application/pdf; name=\"20200819 - Aerial Images.pdf\""]

RichardTaylor · 2020-11-13T13:20:55Z

Further suspected example:

https://www.whatdotheyknow.com/request/agency_staff_management_and_spen#incoming-1673828

RichardTaylor · 2020-11-17T12:36:43Z

Further suspected example:

https://www.whatdotheyknow.com/request/use_of_agency_social_workers_894#incoming-1636554

mdeuk · 2020-11-20T08:05:41Z

We've another instance of this issue at:
https://www.whatdotheyknow.com/request/secure_email_contracts_7#incoming-1678252

RichardTaylor · 2020-11-21T13:34:04Z

Apparent further case:
https://www.whatdotheyknow.com/request/investments_and_divestment_progr_2#incoming-1672202

Once fixed it would be good to try and identify as many cases as possible systematically.

RichardTaylor · 2020-11-24T12:00:40Z

Further example:

https://www.whatdotheyknow.com/request/doubts_about_safety_of_lamppost

RichardTaylor · 2020-11-24T12:26:01Z

Possible other example, not clear if this is the same issue or not:

https://www.whatdotheyknow.com/request/ipc_grants_11#incoming-1680749

mdeuk · 2020-11-27T11:25:46Z

A further two examples in today's monitoring:

https://www.whatdotheyknow.com/request/social_services_qualified_and_un?nocache=incoming-1682090#incoming-1682090

https://www.whatdotheyknow.com/request/per_capita_bin_allocation_and_re?nocache=incoming-1660943#incoming-1660943

RichardTaylor · 2020-11-30T11:25:34Z

Another case

https://www.whatdotheyknow.com/request/intelligent_automated_systems_in_7#incoming-1683750

garethrees · 2020-11-30T11:49:12Z

Just to note that we do have a probable fix for this (#5936). The hold up at the moment is that I wanted to add a test that demonstrates the problem (#5970) and that the new code fixes it. I also want to make sure we really understand what's causing the problem we're seeing here, as I'm reluctant to push a fix when it may may turn out to cause other problems somewhere else because we've jumped to conclusions.

RichardTaylor · 2020-12-01T10:17:57Z

Three further likely cases:

https://www.whatdotheyknow.com/request/management_of_physical_documents_21?nocache=incoming-1684069#incoming-1684069

https://www.whatdotheyknow.com/request/professional_services_procuremen_316?nocache=incoming-1671852#incoming-1671852

https://www.whatdotheyknow.com/request/supply_spend_20172022?nocache=incoming-1684055#incoming-1684055

garethrees · 2020-12-01T14:28:42Z

Let's recap.

Since upgrading to Mail 2.7.1 (c581d83) we've noticed a regression in handling some PDFs. Instead of being extracted to an FoiAttachment, the base64 encoded string gets rendered into the main part of the correspondence. We've isolated the cause to mikel/mail@21222e1.

21222e1b48f08f6f848d26643d6b6fb1d873d18c is the first bad commit
commit 21222e1b48f08f6f848d26643d6b6fb1d873d18c
Author: Jeremy Daer <jeremydaer@gmail.com>
Date:   Mon May 22 10:01:00 2017 -0700

    Eliminate attachment corruption caused by line ending conversions

    * Omit initial CRLF linefeed conversion since CRLF are required newline
      separators. We shouldn't need to convert bare CR or LF. Update our
      fixture emails to use CRLF throughout. Closes #609. Fixes #408.

    * Drop quoted-printable CRLF conversion. This was introduced to
      harmonize with Ruby's \n-based line endings. But this breaks Q-P
      encoding with binary data. It's not *meant* for binary data, but we
      don't yet take adequate measures to use base64 for these cases.
      Reverts #496. Fixes #1010.

    Closes #1113

Since the introduction of Action Mailbox Rails parses raw emails through Mail through Mail::Utilities.binary_unsafe_to_crlf. This appears to have been introduced in 627bbd34 with no specific explanation for the use of the helper.

This (fixed) issue in Redmine appears to describe a similar problem. There are some notes suggesting that for some users the LANG environment variable, or Postfix configuration, could be causing the issue.

I've downloaded a RawEmail from WhatDoTheyKnow where the problem was exhibited, anonymised it and created a spec (d325c8b) that demonstrates the problem. We can see that when the fixture has unix line endings, with the current code the spec fails in the same way the production code fails. Parsing the raw email through Mail::Utilities.binary_unsafe_to_crlf as Rails does (1dfb71f) makes the spec pass.

I don't fully understand whether we're doing the right thing at the infrastructure level at the moment. Are we applying any line-ending conversions during the Exim pipe process? What should we be expecting the application to ingest, and is that what's happening?

If we're confident that we're not mangling the email before entry into the app, then I'd like to understand what it is about these emails is "wrong". Is it that they don't follow the RFC, but are generated in this way due to a poor email client, or is it that there's an upstream issue in Mail that we should consider raising?

It feels like 1dfb71f is probably the right thing to do. If we were building Alaveteli from scratch today we'd likely use Action Mailbox, and therefore go through Mail::Utilities.binary_unsafe_to_crlf.

I want to be sure that:

By applying 1dfb71f we're not going to cause some other bug because we don't really understand what's happening.
We're not patching over something that we've got wrong lower down the stack in Exim or on the OS.

sagepe · 2020-12-01T15:34:07Z

Regarding Exim: I don't think we're doing anything out of the ordinary on our infra at present. The emails are received and forwarded on without explicitly changing formats.

There's a description in the Exim docs covering how line endings are handled during mail processing.

There is an option to force CRLF when using a pipe transport rather than just LF. I'm not sure this is the right thing to do however; some attachments must be coming through correctly and the default is not to do this, so I'd want to understand why it was necessary before turning it on.

It's also worth noting that we can't predict necessarily how a given Alaveteli's environment may be configured; it seems sensible to me to have an internal step like the one you mention above using Mail::Utilities.binary_unsafe_to_crlf to ensure mail is subsequently processed correctly - we could tweak the email configuration used on WDTK infrastructure, but that might not help a re-user who may have less control over the format of their mail.

dracos · 2020-12-01T16:39:42Z

What happens then is due to the presence of non-ASCII characters somewhere in the raw email text.

lib/mail/message.rb calls Mail::Body.new(@body_raw)
Body initialize does @raw_source = ::Mail::Utilities.to_crlf(string.to_s)
to_crlf only calls binary_unsafe_to_crlf if safe_for_line_ending_conversion? which is (if ruby<1.9) if string is ASCII only, or (ruby>=1.9) if string is valid encoding or (BINARY+ASCII only)
the string is BINARY (fair enough, as encoding isn't known until headers are processed).

The part splitter uses a regexp(?!) which only spots \r\n, and thus will fail on any LF-ended non-ASCII email only, leaving the whole email as one non-MIME part and thus outputting it all including boundaries, as you see. If the email is ASCII only, line endings are being converted back to CRLF and it works fine. I agree, unilaterally forcing a conversion to crlf, given the mail code parsing mandates CRLF and isn't always doing the conversion, is the right thing to do.

@garethrees

Since upgrading to Mail 2.7.1 (c581d83) we've noticed a regression in handling some PDFs. Instead of being extracted to an FoiAttachment, the base64 encoded string gets rendered into the main part of the correspondence. We've isolated the cause to mikel/mail@21222e1. Since the introduction of Action Mailbox Rails parses raw emails through Mail through `Mail::Utilities.binary_unsafe_to_crlf` [1]. This appears to have been introduced in 627bbd34 with no specific explanation for the use of the helper. The issue appears with the presence of non-ASCII characters somewhere in the raw email text of multipart messages. Mail uses a Regexp to split parts, which only spots \r\n, and thus will fail on any LF-ended non-ASCII email leaving the whole email as one non-MIME part. Mail does convert the raw email to CRLF, but only when the String has a valid encoding, or if the String is encoded as `Encoding::BINARY`, then only if the String contains only ASCII characters. Alaveteli forces all raw emails to `Encoding::BINARY` as of 0f9d143, so `Mail::Utilities.safe_for_line_ending_conversion?` returns `false`. If the email is ASCII only, line endings are being converted back to CRLF and it works fine. Unilaterally forcing a conversion to CRLF, given the mail code parsing mandates CRLF and isn't always doing the conversion, seems like the right thing to do. As well as fixing the above issue, some cleanup is included in this commit: * Rely on `mail_from_raw_email` converting to CRLF rather than in `email_helpers` to ensure consistent behaviour. * Reset all email fixture files back to Unix line endings for consistency. Fixes #5905. * Initial problem description, specs and cleanup contributed by @garethrees * Bugfix contributed by @gbp * Conversion issues in Mail identified by @dracos [1] https://github.com/rails/rails/blob/a5b2fff/actionmailbox/lib/action_mailbox/mail_ext/from_source.rb#L5

@garethrees

Since upgrading to Mail 2.7.1 (c581d83) we've noticed a regression in handling some PDFs. Instead of being extracted to an FoiAttachment, the base64 encoded string gets rendered into the main part of the correspondence. We've isolated the cause to mikel/mail@21222e1. Since the introduction of Action Mailbox Rails parses raw emails through Mail through `Mail::Utilities.binary_unsafe_to_crlf` [1]. This appears to have been introduced in 627bbd34 with no specific explanation for the use of the helper. The issue appears with the presence of non-ASCII characters somewhere in the raw email text of multipart messages. Mail uses a Regexp to split parts, which only spots \r\n, and thus will fail on any LF-ended non-ASCII email leaving the whole email as one non-MIME part. Mail does convert the raw email to CRLF, but only when the String has a valid encoding, or if the String is encoded as `Encoding::BINARY`, then only if the String contains only ASCII characters. Alaveteli forces all raw emails to `Encoding::BINARY` as of 0f9d143, so `Mail::Utilities.safe_for_line_ending_conversion?` returns `false`. If the email is ASCII only, line endings are being converted back to CRLF and it works fine. Unilaterally forcing a conversion to CRLF, given the mail code parsing mandates CRLF and isn't always doing the conversion, seems like the right thing to do. As well as fixing the above issue, some cleanup is included in this commit: * Rely on `mail_from_raw_email` converting to CRLF rather than in `email_helpers` to ensure consistent behaviour. * Reset all email fixture files back to Unix line endings for consistency. Fixes #5905. * Initial problem description, specs and cleanup contributed by @garethrees * Bugfix contributed by @gbp * Conversion issues in Mail identified by @dracos [1] https://github.com/rails/rails/blob/a5b2fff/actionmailbox/lib/action_mailbox/mail_ext/from_source.rb#L5

RichardTaylor · 2020-12-04T12:21:34Z

Further likely example

https://www.whatdotheyknow.com/request/supply_spend_20172022#incoming-1684055

RichardTaylor · 2020-12-04T12:54:00Z

Further likely cases

https://www.whatdotheyknow.com/request/professional_services_procuremen_70?nocache=incoming-1686240#incoming-1686240

and

https://www.whatdotheyknow.com/request/it_detail_request_36#incoming-1686205

Re-parses any incoming messages that look like they were affected by a regression in Mail [1] that caused some multipart emails to be parsed as a single part. The date to look for messages defaults to the commit date [2] that we introduced the gem version that causes the regression, but is tunable for re-users who may upgrade much later than us. [1] #5905 [2] c581d83285

Re-parses any incoming messages that look like they were affected by a regression in Mail [1] that caused some multipart emails to be parsed as a single part. To identify affected messages we look for 'Content-Type' in the cached body text, since that should almost never be seen there. It doesn't really matter if we re-parse a few false-positives, since caches will get generated on next view. The date to look for messages defaults to the commit date [2] that we introduced the gem version that causes the regression, but is tunable for re-users who may upgrade much later than us. [1] #5905 [2] c581d83285

garethrees · 2020-12-09T13:21:12Z

I've now run the task in 7ae31c5 and all of the above messages are correctly parsed.

Possible other example, not clear if this is the same issue or not:

https://www.whatdotheyknow.com/request/ipc_grants_11#incoming-1680749

This one hasn't changed, so I expect a different, edge-case issue.

mdeuk · 2020-12-09T13:33:46Z

I've now run the task in 7ae31c5 and all of the above messages are correctly parsed.

Possible other example, not clear if this is the same issue or not:
https://www.whatdotheyknow.com/request/ipc_grants_11#incoming-1680749

This one hasn't changed, so I expect a different, edge-case issue.

I think this specific one is the message itself, rather than an Alaveteli (or upstream) bug.

The raw message doesn't render correctly in an email client either, and the source suggests that it was munged before it arrived.

@garethrees

Since upgrading to Mail 2.7.1 (c581d83) we've noticed a regression in handling some PDFs. Instead of being extracted to an FoiAttachment, the base64 encoded string gets rendered into the main part of the correspondence. We've isolated the cause to mikel/mail@21222e1. Since the introduction of Action Mailbox Rails parses raw emails through Mail through `Mail::Utilities.binary_unsafe_to_crlf` [1]. This appears to have been introduced in 627bbd34 with no specific explanation for the use of the helper. The issue appears with the presence of non-ASCII characters somewhere in the raw email text of multipart messages. Mail uses a Regexp to split parts, which only spots \r\n, and thus will fail on any LF-ended non-ASCII email leaving the whole email as one non-MIME part. Mail does convert the raw email to CRLF, but only when the String has a valid encoding, or if the String is encoded as `Encoding::BINARY`, then only if the String contains only ASCII characters. Alaveteli forces all raw emails to `Encoding::BINARY` as of 0f9d143, so `Mail::Utilities.safe_for_line_ending_conversion?` returns `false`. If the email is ASCII only, line endings are being converted back to CRLF and it works fine. Unilaterally forcing a conversion to CRLF, given the mail code parsing mandates CRLF and isn't always doing the conversion, seems like the right thing to do. As well as fixing the above issue, some cleanup is included in this commit: * Rely on `mail_from_raw_email` converting to CRLF rather than in `email_helpers` to ensure consistent behaviour. * Reset all email fixture files back to Unix line endings for consistency. Fixes #5905. * Initial problem description, specs and cleanup contributed by @garethrees * Bugfix contributed by @gbp * Conversion issues in Mail identified by @dracos [1] https://github.com/rails/rails/blob/a5b2fff/actionmailbox/lib/action_mailbox/mail_ext/from_source.rb#L5

Re-parses any incoming messages that look like they were affected by a regression in Mail [1] that caused some multipart emails to be parsed as a single part. To identify affected messages we look for 'Content-Type' in the cached body text, since that should almost never be seen there. It doesn't really matter if we re-parse a few false-positives, since caches will get generated on next view. The date to look for messages defaults to the commit date [2] that we introduced the gem version that causes the regression, but is tunable for re-users who may upgrade much later than us. [1] #5905 [2] c581d83285

garethrees added x:uk bug Breaks expected functionality f:request-analysis labels Oct 1, 2020

mdeuk mentioned this issue Oct 10, 2020

Issue with email rendering for some emails from a specific body #5917

Closed

RichardTaylor mentioned this issue Oct 23, 2020

Images corrupted in PDF in response #5932

Open

mdeuk mentioned this issue Oct 27, 2020

poor rendering of some incoming emails #23

Open

gbp mentioned this issue Oct 27, 2020

[#5905] Force mail data to have CRLF line endings #5936

Merged

RichardTaylor mentioned this issue Nov 30, 2020

Add function to upload files via admin interface #1155

Open

mysociety-pusher closed this as completed in 6de3e0c Dec 8, 2020

garethrees mentioned this issue Dec 8, 2020

Add temp:reparse_multipart_incoming_with_unicode #6005

Merged

PDF attachment corruption regression #5905

PDF attachment corruption regression #5905

Comments

garethrees commented Oct 1, 2020 • edited

garethrees commented Oct 1, 2020

garethrees commented Oct 1, 2020

garethrees commented Oct 1, 2020

garethrees commented Oct 1, 2020

Before Diff

Apply Diff

After diff

garethrees commented Oct 1, 2020 • edited

garethrees commented Oct 2, 2020

garethrees commented Oct 7, 2020 • edited

sagepe commented Oct 14, 2020

garethrees commented Oct 15, 2020

RichardTaylor commented Oct 19, 2020

gbp commented Oct 27, 2020 • edited

gbp commented Oct 27, 2020

RichardTaylor commented Oct 28, 2020

RichardTaylor commented Oct 28, 2020

gbp commented Oct 28, 2020

RichardTaylor commented Oct 31, 2020

RichardTaylor commented Nov 7, 2020

mdeuk commented Nov 7, 2020

RichardTaylor commented Nov 12, 2020

garethrees commented Nov 12, 2020

gbp commented Nov 12, 2020

RichardTaylor commented Nov 13, 2020

RichardTaylor commented Nov 17, 2020

mdeuk commented Nov 20, 2020

RichardTaylor commented Nov 21, 2020

RichardTaylor commented Nov 24, 2020

RichardTaylor commented Nov 24, 2020

mdeuk commented Nov 27, 2020

RichardTaylor commented Nov 30, 2020

garethrees commented Nov 30, 2020

RichardTaylor commented Dec 1, 2020

garethrees commented Dec 1, 2020 • edited

sagepe commented Dec 1, 2020

dracos commented Dec 1, 2020 • edited by garethrees

RichardTaylor commented Dec 4, 2020

RichardTaylor commented Dec 4, 2020

garethrees commented Dec 9, 2020

mdeuk commented Dec 9, 2020

garethrees commented Oct 1, 2020 •

edited

garethrees commented Oct 1, 2020 •

edited

garethrees commented Oct 7, 2020 •

edited

gbp commented Oct 27, 2020 •

edited

garethrees commented Dec 1, 2020 •

edited

dracos commented Dec 1, 2020 •

edited by garethrees