Broken B-encoding? #531

Closed
benhamill opened this Issue Apr 1, 2013 · 4 comments

Projects

None yet

2 participants

@benhamill

I'm working on a system that parses messages that works in tandem with some php code. We found a subject line that wasn't being decoded the same in the two, so I started digging and I wrote this test:

it "should decode an encoded string with lots of diacritical vowels" do
  # string = "=?UTF-8?B?Rlc6IMOuw7HDrsOqIMOkw6nDscOow6XDuMOpLcOjw6XDoyDDocOvIMOiw6U=?= =?UTF-8?B?w7jDqcOlw68gw67DusO4w6DDqcOpw68gw6zDqMOsw6XDpcOpw6bDqcOkIMOlw64=?= =?UTF-8?B?w6PDocO4IMOyw6wgw6TDtMOsw7nDusOpw7DDoMOpw60=?="
  # string = "=?UTF-8?B?Rlc6IMOuw7HDrsOqIMOkw6nDscOow6XDuMOpLcOjw6XDoyDDocOvIMOiw6XD?= =?UTF-8?B?uMOpw6XDryDDrsO6w7jDoMOpw6nDryDDrMOow6zDpcOlw6nDpsOpw6Qgw6XD?= =?UTF-8?B?rsOjw6HDuCDDssOsIMOkw7TDrMO5w7rDqcOww6DDqcOt?="
  result = "FW: îñîê äéñèåøé-ãåã áï âåøéåï îúøàééï ìèìååéæéä åîãáø òì äôìùúéðàéí"
  string = Mail::Encodings.b_value_encode(result)

  result.force_encoding('UTF-8') if RUBY_VERSION >= '1.9'
  Mail::Encodings.value_decode(string).should eq result
end

The first (commented out) assignment to string was the original subject we saw in the wild. The second (commented out) assignment to string is the intermediate result of the third (not commented out) assignment to string. They fail in different ways, but result is what comes out of the php.

I plan on developing a fix, but wanted to solicit feedback and/or guidance before I got started.

@benhamill

Oh. An interesting piece of data: If you take the first string (the one we'd found in the wild) and break it on ' ', and value_decode each one, it ends up adding up to the expected output.

> string = "=?UTF-8?B?Rlc6IMOuw7HDrsOqIMOkw6nDscOow6XDuMOpLcOjw6XDoyDDocOvIMOiw6U=?= =?UTF-8?B?w7jDqcOlw68gw67DusO4w6DDqcOpw68gw6zDqMOsw6XDpcOpw6bDqcOkIMOlw64=?= =?UTF-8?B?w6PDocO4IMOyw6wgw6TDtMOsw7nDusOpw7DDoMOpw60=?="
=> "=?UTF-8?B?Rlc6IMOuw7HDrsOqIMOkw6nDscOow6XDuMOpLcOjw6XDoyDDocOvIMOiw6U=?= =?UTF-8?B?w7jDqcOlw68gw67DusO4w6DDqcOpw68gw6zDqMOsw6XDpcOpw6bDqcOkIMOlw64=?= =?UTF-8?B?w6PDocO4IMOyw6wgw6TDtMOsw7nDusOpw7DDoMOpw60=?="
> string.split(' ').map { |s| Mail::Encodings.value_decode(s) }.join('') == result
=> true

This is not true of the other value.

> string = "=?UTF-8?B?Rlc6IMOuw7HDrsOqIMOkw6nDscOow6XDuMOpLcOjw6XDoyDDocOvIMOiw6XD?= =?UTF-8?B?uMOpw6XDryDDrsO6w7jDoMOpw6nDryDDrMOow6zDpcOlw6nDpsOpw6Qgw6XD?= =?UTF-8?B?rsOjw6HDuCDDssOsIMOkw7TDrMO5w7rDqcOww6DDqcOt?="
=> "=?UTF-8?B?Rlc6IMOuw7HDrsOqIMOkw6nDscOow6XDuMOpLcOjw6XDoyDDocOvIMOiw6XD?= =?UTF-8?B?uMOpw6XDryDDrsO6w7jDoMOpw6nDryDDrMOow6zDpcOlw6nDpsOpw6Qgw6XD?= =?UTF-8?B?rsOjw6HDuCDDssOsIMOkw7TDrMO5w7rDqcOww6DDqcOt?="
> string.split(' ').map { |s| Mail::Encodings.value_decode(s) }.join('') == result
=> false
> string = Mail::Encodings.b_value_encode(result)
=> "=?UTF-8?B?Rlc6IMOuw7HDrsOqIMOkw6nDscOow6XDuMOpLcOjw6XDoyDDocOvIMOiw6XD?= =?UTF-8?B?uMOpw6XDryDDrsO6w7jDoMOpw6nDryDDrMOow6zDpcOlw6nDpsOpw6Qgw6XD?= =?UTF-8?B?rsOjw6HDuCDDssOsIMOkw7TDrMO5w7rDqcOww6DDqcOt?="
> string.split(' ').map { |s| Mail::Encodings.value_decode(s) }.join('') == result
=> false
@benhamill

Ugh. I wrote a big, long comment and then reloaded the page and lost it. I win at the internet.

First, I found this method on Mail::Encodings:

    def Encodings.collapse_adjacent_encodings(str)
      lines = str.split(/(\?=)\s*(=\?)/).each_slice(2).map(&:join)
      results = []
      previous_encoding = nil

      lines.each do |line|
        encoding = split_value_encoding_from_string(line)

        if encoding == previous_encoding
          line = results.pop + line
          line.gsub!(/\?\=\=\?.+?\?[QqBb]\?/m, '')
        end

        previous_encoding = encoding
        results << line
      end

      results
    end

This seems naive to me because lines will end up being stuff like this:

[
  "=?utf-8?B?welfijwglke?=",
  "=?utf-8?B?wlefkjweflkjfew",
  "?==?",
  "utf-8?B?wekfjhweflkjhwef?="
]

And, even with a fix for how the line is split, I think we can't naively concatenate B-encoded strings with way. The Ruby library, at least, doesn't react well to padding in the middle of a string:

> Base64.encode64('b')
=> "Yg==\n"
> Base64.encode64('en')
=> "ZW4=\n"
> Base64.decode64("Yg==ZW4=\n")
=> "b"
> Base64.decode64("YgZW4===")
=> "b\x06V"

but

> Base64.encode64('ben')
=> "YmVu\n"
> Base64.encode64(' ')
=> "IA==\n"
> Base64.decode64("YmVuIA==\n")
=> "ben "
> Base64.decode64("====YmVu====IA==\n")
=> "ben "
> Base64.decode64("=YmVu=IA=\n")
=> "ben "
> Base64.decode64("=YmVu=IA\n")
=> "ben" # Note: no trailing space.

Collapsing adjacent encodings seems dangerous to me. Does anyone know why this method was written? Or what the fallout of removing it might be? Is it for efficiency? When I removed the call to this method, a bunch of tests started to fail, but I didn't look into it too deeply.

@ConradIrwin
Collaborator

@benhamill This was fixed by b832faf:

@benhamill

Awesome! Thanks a ton!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment