-
-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF attachment corruption regression #5905
Comments
#!/bin/bash
bundle --quiet
bundle exec rails runner "exit 1 unless MailHandler.get_attachment_attributes(RawEmail.last.mail!).size == 2"
Doesn't surprise me that this is caused by c581d83, but why is unclear. |
Definitely caused by a change in Mail: [3](main)> Mail::VERSION.version
# => "2.6.6"
[4](main)> Mail.read('1625443.eml').parts.map(&:content_type)
# => ["text/plain; charset=utf-8; format=flowed", "application/pdf; name=\"20200819 - Aerial Images.pdf\""]
[3](main)> Mail::VERSION.version
# => "2.7.1"
[4](main)> Mail.read('1625443.eml').parts.map(&:content_type)
# => [] Need to establish whether it's a regression or whether we're doing the wrong thing with it. |
$ git bisect start
$ git bisect good 0eb3691342d3cbcbbf477bfc105337c65271992a # 2.7.1
$ git bisect bad 7c43c84c16f017e0ff5e5c9962f6a1d842301ee3 # "Bump to 2.7.0.edge. See 2-6-stable branch for latest 2.6.x."
$ git bisect run ./mail-bisect
running ./mail-bisect
# …
21222e1b48f08f6f848d26643d6b6fb1d873d18c is the first bad commit
commit 21222e1b48f08f6f848d26643d6b6fb1d873d18c
Author: Jeremy Daer <jeremydaer@gmail.com>
Date: Mon May 22 10:01:00 2017 -0700
Eliminate attachment corruption caused by line ending conversions
* Omit initial CRLF linefeed conversion since CRLF are required newline
separators. We shouldn't need to convert bare CR or LF. Update our
fixture emails to use CRLF throughout. Closes #609. Fixes #408.
* Drop quoted-printable CRLF conversion. This was introduced to
harmonize with Ruby's \n-based line endings. But this breaks Q-P
encoding with binary data. It's not *meant* for binary data, but we
don't yet take adequate measures to use base64 for these cases.
Reverts #496. Fixes #1010.
Closes #1113
:100644 100644 2e9167261111beac1a3a791327d83e538c84ac33 0508d51f20614f64cd472986e081b25b5ac4e3a7 M CHANGELOG.rdoc
:040000 040000 f0ae85fe4c5510b18408fa51dfb60e6ff0d7b138 af9137afa146367afa370189de463041a98aec55 M lib
:040000 040000 ebef4af0a90b414f918750b0f049a8db14c765a1 7ddbc49dac5921a273972e2f6b8d2910b2a7e0b2 M spec
bisect run success |
Before Diff[1](main)> require 'bundler/setup'
NOTE: Gem::Specification#has_rdoc= is deprecated with no replacement. It will be removed on or after 2018-12-01.
Gem::Specification#has_rdoc= called from /vagrant/mail.gemspec:13.
=> true
[2](main)> require './lib/mail'
=> true
[3](main)> Mail.read('1625443.eml').parts.map(&:content_type)
WARNING: More than 1000 header fields; only using the first 1000 and ignoring the rest
WARNING: Ignoring unparsable header "From clerk@moultonpc.org.uk Thu Aug 20 16:18:14 2020": invalid header name syntax: "From clerk@moultonpc.org.uk Thu Aug 20 16"
=> [] Apply Diffdiff --git a/lib/mail/attachments_list.rb b/lib/mail/attachments_list.rb
index ca7d0d6a..5f5ac2fb 100644
--- a/lib/mail/attachments_list.rb
+++ b/lib/mail/attachments_list.rb
@@ -7,7 +7,7 @@ module Mail
@content_disposition_type = 'attachment'
parts_list.map { |p|
if p.mime_type == 'message/rfc822'
- Mail.new(p.body.encoded).attachments
+ Mail.new(p.body).attachments
elsif p.parts.empty?
p if p.attachment?
else
diff --git a/lib/mail/encodings/quoted_printable.rb b/lib/mail/encodings/quoted_printable.rb
index 6a78ffc3..6c0438a5 100644
--- a/lib/mail/encodings/quoted_printable.rb
+++ b/lib/mail/encodings/quoted_printable.rb
@@ -16,11 +16,11 @@ module Mail
# Decode the string from Quoted-Printable. Cope with hard line breaks
# that were incorrectly encoded as hex instead of literal CRLF.
def self.decode(str)
- str.gsub(/(?:=0D=0A|=0D|=0A)\r\n/, "\r\n").unpack("M*").first
+ ::Mail::Utilities.to_lf str.gsub(/(?:=0D=0A|=0D|=0A)\r\n/, "\r\n").unpack("M*").first
end
def self.encode(str)
- [str].pack("M")
+ ::Mail::Utilities.to_crlf([::Mail::Utilities.to_lf(str)].pack("M"))
end
def self.cost(str)
diff --git a/lib/mail/message.rb b/lib/mail/message.rb
index 4398885e..32d67b11 100644
--- a/lib/mail/message.rb
+++ b/lib/mail/message.rb
@@ -2012,7 +2012,7 @@ module Mail
def raw_source=(value)
value = value.dup.force_encoding(Encoding::BINARY) if RUBY_VERSION >= "1.9.1"
- @raw_source = value
+ @raw_source = ::Mail::Utilities.to_crlf(value)
end
# see comments to body=. We take data and process it lazily After diff[1](main)> require 'bundler/setup'
NOTE: Gem::Specification#has_rdoc= is deprecated with no replacement. It will be removed on or after 2018-12-01.
Gem::Specification#has_rdoc= called from /vagrant/mail.gemspec:13.
=> true
[2](main)> require './lib/mail'
=> true
[3](main)> Mail.read('1625443.eml').parts.map(&:content_type)
=> ["text/plain; charset=utf-8; format=flowed", "application/pdf; name=\"20200819 - Aerial Images.pdf\""] |
Pretty sure mikel/mail@21222e1 is causing the problem. |
Applying only this hunk of the diff results in the attachments being returned correctly: diff --git a/lib/mail/message.rb b/lib/mail/message.rb
index 4398885e..32d67b11 100644
--- a/lib/mail/message.rb
+++ b/lib/mail/message.rb
@@ -2012,7 +2012,7 @@ module Mail
def raw_source=(value)
value = value.dup.force_encoding(Encoding::BINARY) if RUBY_VERSION >= "1.9.1"
- @raw_source = value
+ @raw_source = ::Mail::Utilities.to_crlf(value)
end
# see comments to body=. We take data and process it lazily [5](main)> Mail.read('1625443.eml').parts.map(&:content_type)
# => ["text/plain; charset=utf-8; format=flowed", "application/pdf; name=\"20200819 - Aerial Images.pdf\""] |
Possibly related: mikel/mail#1329? |
Well, our failing email doesn't have a
This could be worth verifying for the failing and successful examples.
Are our failing attachments inline and missing a filename? The PR seems to specifically target images, so we might need to do something different for the PDF cases we're seeing fail. |
Inspired by https://github.com/rails/rails/blob/master/actionmailbox/lib/action_mailbox/mail_ext/from_source.rb#L5 If we do: diff --git a/lib/mail_handler/backends/mail_backend.rb b/lib/mail_handler/backends/mail_backend.rb
index e8de3ffa9c..970ba1efd4 100644
--- a/lib/mail_handler/backends/mail_backend.rb
+++ b/lib/mail_handler/backends/mail_backend.rb
@@ -43,7 +43,7 @@ def backend
def mail_from_raw_email(data)
data = data.force_encoding(Encoding::BINARY) if data.is_a? String
- Mail.new(data)
+ Mail.new Mail::Utilities.binary_unsafe_to_crlf(data)
end
# Extracts all attachments from the given TNEF file as a Mail object and then:
Seems to means the emails parts get extracted correctly. |
This is basically waht we already do in our specs as editors automattically convert CRLF into LF. |
Fixes issue with extracting out mail parts happening since upgrading the mail gem to 2.7. Fixes #5905 Inspired by ActionMailbox [1] [1] https://github.com/rails/rails/blob/a5b2fff/actionmailbox/lib/action_mailbox/mail_ext/from_source.rb#L5
Fixes issue with extracting out mail parts happening since upgrading the mail gem to 2.7. Fixes #5905 Inspired by ActionMailbox [1] [1] https://github.com/rails/rails/blob/a5b2fff/actionmailbox/lib/action_mailbox/mail_ext/from_source.rb#L5
Another example https://www.whatdotheyknow.com/request/policies_329#incoming-1665717 Getting this fix up on WhatDoTheyKnow will prevent requests to extract and upload documents from reformatted responses. |
Will the fix have a retrospective effect on messages that have already come in? |
It won't fix them automatically but I'm intenteding to run some commands in the app console to fix |
Possibly another case: https://www.whatdotheyknow.com/request/professional_services_procuremen_316#incoming-1667331 |
Possibly related issue, a further message on the thread mentioned above not involving a PDF though: https://www.whatdotheyknow.com/request/professional_services_procuremen_316#incoming-1670997 |
The authority in this case (Adur & Worthing Councils) is using the same CRM software (Mats / NetCall) as the rest of the instances here, so it does seem reasonable to conclude that there is a link between the two. |
Another likely example https://www.whatdotheyknow.com/request/covid_19_in_sheltered_housing#incoming-1673474 |
Ha, well this is strange: Mail.read('tmp/1625443.eml').parts.map(&:content_type)
# => []
Mail.new(File.read('tmp/1625443.eml')).parts.map(&:content_type)
# => ["text/plain; charset=utf-8; format=flowed", "application/pdf; name=\"20200819 - Aerial Images.pdf\""] |
Mail.read('tmp/1625443.eml').parts.map(&:content_type)
# => [] is equvailent to data = File.open('tmp/1625443.eml', 'rb') { |f| f.read }
Mail.open(data).parts.map(&:content_type)
# => [] and what we have in data = File.read('tmp/1625443.eml').force_encoding(Encoding::BINARY)
Mail.open(data).parts.map(&:content_type)
# => [] All are fixed with the proposed CRLF change: data = File.read('tmp/1625443.eml').force_encoding(Encoding::BINARY)
data = Mail::Utilities.binary_unsafe_to_crlf(data.to_s)
Mail.open(data).parts.map(&:content_type)
# => ["text/plain; charset=utf-8; format=flowed", "application/pdf; name=\"20200819 - Aerial Images.pdf\""] |
Further suspected example: https://www.whatdotheyknow.com/request/agency_staff_management_and_spen#incoming-1673828 |
Further suspected example: https://www.whatdotheyknow.com/request/use_of_agency_social_workers_894#incoming-1636554 |
We've another instance of this issue at: |
Apparent further case: Once fixed it would be good to try and identify as many cases as possible systematically. |
Possible other example, not clear if this is the same issue or not: https://www.whatdotheyknow.com/request/ipc_grants_11#incoming-1680749 |
Just to note that we do have a probable fix for this (#5936). The hold up at the moment is that I wanted to add a test that demonstrates the problem (#5970) and that the new code fixes it. I also want to make sure we really understand what's causing the problem we're seeing here, as I'm reluctant to push a fix when it may may turn out to cause other problems somewhere else because we've jumped to conclusions. |
Three further likely cases: |
Let's recap. Since upgrading to Mail 2.7.1 (c581d83) we've noticed a regression in handling some PDFs. Instead of being extracted to an
Since the introduction of Action Mailbox Rails parses raw emails through Mail through This (fixed) issue in Redmine appears to describe a similar problem. There are some notes suggesting that for some users the I've downloaded a I don't fully understand whether we're doing the right thing at the infrastructure level at the moment. Are we applying any line-ending conversions during the Exim pipe process? What should we be expecting the application to ingest, and is that what's happening? If we're confident that we're not mangling the email before entry into the app, then I'd like to understand what it is about these emails is "wrong". Is it that they don't follow the RFC, but are generated in this way due to a poor email client, or is it that there's an upstream issue in Mail that we should consider raising? It feels like 1dfb71f is probably the right thing to do. If we were building Alaveteli from scratch today we'd likely use Action Mailbox, and therefore go through I want to be sure that:
|
Regarding Exim: I don't think we're doing anything out of the ordinary on our infra at present. The emails are received and forwarded on without explicitly changing formats. There's a description in the Exim docs covering how line endings are handled during mail processing. There is an option to force CRLF when using a pipe transport rather than just LF. I'm not sure this is the right thing to do however; some attachments must be coming through correctly and the default is not to do this, so I'd want to understand why it was necessary before turning it on. It's also worth noting that we can't predict necessarily how a given Alaveteli's environment may be configured; it seems sensible to me to have an internal step like the one you mention above using |
What happens then is due to the presence of non-ASCII characters somewhere in the raw email text.
The part splitter uses a regexp(?!) which only spots |
Since upgrading to Mail 2.7.1 (c581d83) we've noticed a regression in handling some PDFs. Instead of being extracted to an FoiAttachment, the base64 encoded string gets rendered into the main part of the correspondence. We've isolated the cause to mikel/mail@21222e1. Since the introduction of Action Mailbox Rails parses raw emails through Mail through `Mail::Utilities.binary_unsafe_to_crlf` [1]. This appears to have been introduced in 627bbd34 with no specific explanation for the use of the helper. The issue appears with the presence of non-ASCII characters somewhere in the raw email text of multipart messages. Mail uses a Regexp to split parts, which only spots \r\n, and thus will fail on any LF-ended non-ASCII email leaving the whole email as one non-MIME part. Mail does convert the raw email to CRLF, but only when the String has a valid encoding, or if the String is encoded as `Encoding::BINARY`, then only if the String contains only ASCII characters. Alaveteli forces all raw emails to `Encoding::BINARY` as of 0f9d143, so `Mail::Utilities.safe_for_line_ending_conversion?` returns `false`. If the email is ASCII only, line endings are being converted back to CRLF and it works fine. Unilaterally forcing a conversion to CRLF, given the mail code parsing mandates CRLF and isn't always doing the conversion, seems like the right thing to do. As well as fixing the above issue, some cleanup is included in this commit: * Rely on `mail_from_raw_email` converting to CRLF rather than in `email_helpers` to ensure consistent behaviour. * Reset all email fixture files back to Unix line endings for consistency. Fixes #5905. * Initial problem description, specs and cleanup contributed by @garethrees * Bugfix contributed by @gbp * Conversion issues in Mail identified by @dracos [1] https://github.com/rails/rails/blob/a5b2fff/actionmailbox/lib/action_mailbox/mail_ext/from_source.rb#L5
Since upgrading to Mail 2.7.1 (c581d83) we've noticed a regression in handling some PDFs. Instead of being extracted to an FoiAttachment, the base64 encoded string gets rendered into the main part of the correspondence. We've isolated the cause to mikel/mail@21222e1. Since the introduction of Action Mailbox Rails parses raw emails through Mail through `Mail::Utilities.binary_unsafe_to_crlf` [1]. This appears to have been introduced in 627bbd34 with no specific explanation for the use of the helper. The issue appears with the presence of non-ASCII characters somewhere in the raw email text of multipart messages. Mail uses a Regexp to split parts, which only spots \r\n, and thus will fail on any LF-ended non-ASCII email leaving the whole email as one non-MIME part. Mail does convert the raw email to CRLF, but only when the String has a valid encoding, or if the String is encoded as `Encoding::BINARY`, then only if the String contains only ASCII characters. Alaveteli forces all raw emails to `Encoding::BINARY` as of 0f9d143, so `Mail::Utilities.safe_for_line_ending_conversion?` returns `false`. If the email is ASCII only, line endings are being converted back to CRLF and it works fine. Unilaterally forcing a conversion to CRLF, given the mail code parsing mandates CRLF and isn't always doing the conversion, seems like the right thing to do. As well as fixing the above issue, some cleanup is included in this commit: * Rely on `mail_from_raw_email` converting to CRLF rather than in `email_helpers` to ensure consistent behaviour. * Reset all email fixture files back to Unix line endings for consistency. Fixes #5905. * Initial problem description, specs and cleanup contributed by @garethrees * Bugfix contributed by @gbp * Conversion issues in Mail identified by @dracos [1] https://github.com/rails/rails/blob/a5b2fff/actionmailbox/lib/action_mailbox/mail_ext/from_source.rb#L5
Further likely example https://www.whatdotheyknow.com/request/supply_spend_20172022#incoming-1684055 |
Re-parses any incoming messages that look like they were affected by a regression in Mail [1] that caused some multipart emails to be parsed as a single part. The date to look for messages defaults to the commit date [2] that we introduced the gem version that causes the regression, but is tunable for re-users who may upgrade much later than us. [1] #5905 [2] c581d83285
Re-parses any incoming messages that look like they were affected by a regression in Mail [1] that caused some multipart emails to be parsed as a single part. The date to look for messages defaults to the commit date [2] that we introduced the gem version that causes the regression, but is tunable for re-users who may upgrade much later than us. [1] #5905 [2] c581d83285
Re-parses any incoming messages that look like they were affected by a regression in Mail [1] that caused some multipart emails to be parsed as a single part. To identify affected messages we look for 'Content-Type' in the cached body text, since that should almost never be seen there. It doesn't really matter if we re-parse a few false-positives, since caches will get generated on next view. The date to look for messages defaults to the commit date [2] that we introduced the gem version that causes the regression, but is tunable for re-users who may upgrade much later than us. [1] #5905 [2] c581d83285
Re-parses any incoming messages that look like they were affected by a regression in Mail [1] that caused some multipart emails to be parsed as a single part. To identify affected messages we look for 'Content-Type' in the cached body text, since that should almost never be seen there. It doesn't really matter if we re-parse a few false-positives, since caches will get generated on next view. The date to look for messages defaults to the commit date [2] that we introduced the gem version that causes the regression, but is tunable for re-users who may upgrade much later than us. [1] #5905 [2] c581d83285
I've now run the task in 7ae31c5 and all of the above messages are correctly parsed.
This one hasn't changed, so I expect a different, edge-case issue. |
I think this specific one is the message itself, rather than an Alaveteli (or upstream) bug. The raw message doesn't render correctly in an email client either, and the source suggests that it was munged before it arrived. |
Since upgrading to Mail 2.7.1 (c581d83) we've noticed a regression in handling some PDFs. Instead of being extracted to an FoiAttachment, the base64 encoded string gets rendered into the main part of the correspondence. We've isolated the cause to mikel/mail@21222e1. Since the introduction of Action Mailbox Rails parses raw emails through Mail through `Mail::Utilities.binary_unsafe_to_crlf` [1]. This appears to have been introduced in 627bbd34 with no specific explanation for the use of the helper. The issue appears with the presence of non-ASCII characters somewhere in the raw email text of multipart messages. Mail uses a Regexp to split parts, which only spots \r\n, and thus will fail on any LF-ended non-ASCII email leaving the whole email as one non-MIME part. Mail does convert the raw email to CRLF, but only when the String has a valid encoding, or if the String is encoded as `Encoding::BINARY`, then only if the String contains only ASCII characters. Alaveteli forces all raw emails to `Encoding::BINARY` as of 0f9d143, so `Mail::Utilities.safe_for_line_ending_conversion?` returns `false`. If the email is ASCII only, line endings are being converted back to CRLF and it works fine. Unilaterally forcing a conversion to CRLF, given the mail code parsing mandates CRLF and isn't always doing the conversion, seems like the right thing to do. As well as fixing the above issue, some cleanup is included in this commit: * Rely on `mail_from_raw_email` converting to CRLF rather than in `email_helpers` to ensure consistent behaviour. * Reset all email fixture files back to Unix line endings for consistency. Fixes #5905. * Initial problem description, specs and cleanup contributed by @garethrees * Bugfix contributed by @gbp * Conversion issues in Mail identified by @dracos [1] https://github.com/rails/rails/blob/a5b2fff/actionmailbox/lib/action_mailbox/mail_ext/from_source.rb#L5
Re-parses any incoming messages that look like they were affected by a regression in Mail [1] that caused some multipart emails to be parsed as a single part. To identify affected messages we look for 'Content-Type' in the cached body text, since that should almost never be seen there. It doesn't really matter if we re-parse a few false-positives, since caches will get generated on next view. The date to look for messages defaults to the commit date [2] that we introduced the gem version that causes the regression, but is tunable for re-users who may upgrade much later than us. [1] #5905 [2] c581d83285
We recently had a report of a corrupted/malformed PDF attachment.
Looks like this is a regression. I imported the raw email locally and get different attachment results between 0.37 and
develop
:The text was updated successfully, but these errors were encountered: