Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another problem with rfc822 #176

Closed
piernik opened this issue Jan 27, 2018 · 29 comments
Closed

Another problem with rfc822 #176

piernik opened this issue Jan 27, 2018 · 29 comments
Labels

Comments

@piernik
Copy link

piernik commented Jan 27, 2018

I have other problem with rfc822:
To: =?UTF-8?Q?aFakeowska_>>_Agnieszka_fake-Fakeowska_Przegl=c4=85dy_?= =?UTF-8?B?YnVkeW5rw7N3?= <aFakeowska@fake.com.pl> (it's a part of eml file).

Guess problem is >> in name. It's a real life example.
Is there any solution?

@eXorus
Copy link
Member

eXorus commented Jan 30, 2018

Hello,

What is the result expected for this field ?

Your are right the problem is about >> in the name.

According to this https://www.w3.org/Protocols/rfc822/

And 3.3. LEXICAL TOKENS
specials = "(" / ")" / "<" / ">" / "@" ; Must be in quoted-
/ "," / ";" / ":" / "" / <"> ; string, to use

It's not compliant with RFC822.

test <test@example.org> => OK
"test>" <test@example.org> => OK
test> <test@example.org> => KO
test">" <test@example.org> => OK

@eXorus eXorus added the invalid label Jan 30, 2018
@piernik
Copy link
Author

piernik commented Feb 1, 2018

I understand but this is real life example. This is my customer's email. What can I do?

@eXorus
Copy link
Member

eXorus commented Feb 1, 2018

I don't know what I can do for that so I don't know what you can do ...

What is the result expected for you ?

@piernik
Copy link
Author

piernik commented Feb 1, 2018

Right now it's throwing error. And that kind of wrong address cen be not only in "to" field, but also in "copy" and so on. You have method to fetch addresses:

public function getAddresses($name)
    {
        $value = $this->getHeader($name);

        return mailparse_rfc822_parse_addresses($value);
    }

Can't You make after assiging $value a parser that puts " if > sign is before <?

Somthing like:

public function getAddresses($name)
    {
        $value = $this->getHeader($name);
        $re = '/(.+>.+)(<.*@.*>)/';
        $subst = '"$1"$2';

        $value = preg_replace($re, $subst, $value);

        return mailparse_rfc822_parse_addresses($value);
    }

@eXorus
Copy link
Member

eXorus commented Feb 1, 2018

Yes it's possible but I'm not good enough in regex for that. If you have this regex I can include it.

@piernik
Copy link
Author

piernik commented Feb 1, 2018

Ok but I on the other hand don't know what are possible inputs. You posted some here:
est <test@example.org>, "test>" <test@example.org>, test> <test@example.org>, test">" <test@example.org>.
Are there more possibilities?

@piernik
Copy link
Author

piernik commented Feb 1, 2018

BTW for regexp I reccomend: https://regex101.com/

@eXorus
Copy link
Member

eXorus commented Feb 1, 2018

You can find a lot of example here: https://www.w3.org/Protocols/rfc822/#z10

Thanks for the links

@piernik
Copy link
Author

piernik commented Feb 1, 2018

I'm too weak with reg exp also :)

I posted question on stackoverflow https://stackoverflow.com/questions/48563429/regular-expression-for-rfc822-standard

If You want to add something feel free.

@piernik
Copy link
Author

piernik commented Feb 1, 2018

You 've got first anwser there - should work

@eXorus
Copy link
Member

eXorus commented Feb 1, 2018

Great but it doesn't work, I did this:

  public function getAddresses($name)
   {
       $value = $this->getRawHeader($name);
       $value = (is_array($value)) ? $value[0] : $value;
       var_dump($value);
       $value = preg_replace("/\".*?\"(*SKIP)(*FAIL)|(\w+\s[<>@]\s\w+)/", "\"$1\"", $value);
       var_dump($value);
       $addresses = mailparse_rfc822_parse_addresses($value);
       foreach ($addresses as $i => $item) {
           $addresses[$i]['display'] = $this->decodeHeader($item['display']);
       }
       return $addresses;
   }

First output:
=?UTF-8?Q?aFakeowska_>>_Agnieszka_fake-Fakeowska_Przegl=c4=85dy_?= =?UTF-8?B?YnVkeW5rw7N3?= <aFakeowska@fake.com.pl>

Second output:
=?UTF-8?Q?aFakeowska_>>_Agnieszka_fake-Fakeowska_Przegl=c4=85dy_?= =?UTF-8?B?YnVkeW5rw7N3?= <aFakeowska@fake.com.pl>

There is no difference, it's working for him because he did the sanitize after an decoding work but we need to do it before mailparse_rfc822_parse_addresses.

@piernik
Copy link
Author

piernik commented Feb 2, 2018

This string =?UTF-8?Q?aFakeowska_>>_Agnieszka_fake-Fakeowska_Przegl=c4=85dy_?= =?UTF-8?B?YnVkeW5rw7N3?= <aFakeowska@fake.com.pl> is a string in .eml file. But when I'm using Yours setText it's parsed to aFakeowska >> Agnieszka fake-Fakeowska Przeglądy budynków <aFakeowska@fake.com.pl>.
I tweaked that reg exp: https://regex101.com/r/VaA2mm/2 Try it.

@eXorus
Copy link
Member

eXorus commented Feb 2, 2018

setText is only to parse the metadata of the email.

if you are using getHeader('from') with master it's already working:

    public function testGetAddressesWithSpecialChars()
    {
        $file = __DIR__ . '/mails/m0124';
        $Parser = new Parser();
        $Parser->setText(file_get_contents($file));
        $to = $Parser->getHeader('to');
        $this->assertEquals('aFakeowska >> Agnieszka fake-Fakeowska Przeglądy budynków <aFakeowska@fake.com.pl>', $to);
    }

if you are using getAddresses('to') with master it's broken (input is not rfc822 compliant: not in < bracket):

    public function testGetAddressesWithSpecialChars()
    {
        $file = __DIR__ . '/mails/m0124';
        $Parser = new Parser();
        $Parser->setText(file_get_contents($file));
        $to = $Parser->getAddresses('to');
        $this->assertEquals('aFakeowska >> Agnieszka fake-Fakeowska Przeglądy budynków <aFakeowska@fake.com.pl>', $to);
    }

So to fix this issue I need to impact getAddresses but getAddresses take getRawHeader() that is returning =?UTF-8?Q?aFakeowska_>>_Agnieszka_fake-Fakeowska_Przegl=c4=85dy_?= =?UTF-8?B?YnVkeW5rw7N3?= <aFakeowska@fake.com.pl>

So I need a regex for this string to fix the issue before to pass the value to mailparse_rfc822_parse_addresses().

    public function getAddresses($name)
    {
        $value = $this->getRawHeader($name);
        $value = (is_array($value)) ? $value[0] : $value;
        $value = preg_replace("REGEX NEEDED", "\"$1\"", $value);
        $addresses = mailparse_rfc822_parse_addresses($value);
        foreach ($addresses as $i => $item) {
            $addresses[$i]['display'] = $this->decodeHeader($item['display']);
        }
        return $addresses;
    }

@piernik
Copy link
Author

piernik commented Feb 2, 2018

It turned out that I'm using 2.9.5

public function getAddresses($name)
    {
        $value = $this->getHeader($name);

        return mailparse_rfc822_parse_addresses($value);
    }

which gets header and then decodes it - before mailparse_rfc822_parse_addresses.

EDIT:
try with that regexp: ".*?"(*SKIP)(*FAIL)|(.+[<>\/\\(),.;:\[\]@]+.+)(\s<)/ https://regex101.com/r/VaA2mm/5

Seems to work:

public function getAddresses($name)
    {
        $value = $this->getRawHeader($name);
        $value = preg_replace('/".*?"(*SKIP)(*FAIL)|(.+[<>\/\\(),.;:\[\]@]+.+)(\s<)/', '"$1"$2', $value);
        $value = (is_array($value)) ? $value[0] : $value;
        $addresses = mailparse_rfc822_parse_addresses($value);
        foreach ($addresses as $i => $item) {
            $addresses[$i]['display'] = $this->decodeHeader($item['display']);
        }

        return $addresses;
    }

Edit2:
I did some tests - it works with every special char:

/**
     * @dataProvider daneParsera
     *
     * @param $value
     */
    public function testParsera($value)
    {
        $value = preg_replace('/".*?"(*SKIP)(*FAIL)|(.+[<>\/\\(),.;:\[\]@]+.+)(\s<)/', '"$1"$2', $value);
        $this->assertTrue(is_array(mailparse_rfc822_parse_addresses($value)));
    }

    public function daneParsera()
    {
        return [
            ['a(a <aFakeowska@fake.com.pl>'],
            ['a)a <aFakeowska@fake.com.pl>'],
            ['a/a <aFakeowska@fake.com.pl>'],
            ['a\a <aFakeowska@fake.com.pl>'],
            ['a,a <aFakeowska@fake.com.pl>'],
            ['a.a <aFakeowska@fake.com.pl>'],
            ['a;a <aFakeowska@fake.com.pl>'],
            ['a:a <aFakeowska@fake.com.pl>'],
            ['a[a <aFakeowska@fake.com.pl>'],
            ['a]a <aFakeowska@fake.com.pl>'],
            ['a@a <aFakeowska@fake.com.pl>'],
            ['a-a <aFakeowska@fake.com.pl>'],
            ['a_a <aFakeowska@fake.com.pl>'],
        ];
    }

@eXorus
Copy link
Member

eXorus commented Feb 2, 2018

It's working but I have one issue with:

Input: Alfred > Neuman <Neuman@BBN-TENEXA>, Alfred Neuman <Neuman@BBN-TENEXA>, "Alfred > Neuman" <Neuman@BBN-TENEXA>, Alfred > Neuman <Neuman@BBN-TENEXA>

Actual Output: "Alfred > Neuman <Neuman@BBN-TENEXA>, Alfred Neuman <Neuman@BBN-TENEXA>, "Alfred > Neuman" <Neuman@BBN-TENEXA>, Alfred > Neuman" <Neuman@BBN-TENEXA>

Expected Output: "Alfred > Neuman" <Neuman@BBN-TENEXA>, Alfred Neuman <Neuman@BBN-TENEXA>, "Alfred > Neuman" <Neuman@BBN-TENEXA>, "Alfred > Neuman" <Neuman@BBN-TENEXA>

@piernik
Copy link
Author

piernik commented Feb 3, 2018

I cannot deal with it:/ Maybe You can explode phrase with >, then reg exp for all items and implode them back?

@eXorus
Copy link
Member

eXorus commented Feb 5, 2018

it's become very complex for something that doesn't respect the RFC.

If I explode with >, I need to put again the > at the end of the string to have regex working like expected but only when there was a > ...

I try this but it doesn't work:

    public function sanitizeAddresses($addresses)
    {
        $addresses = explode(">,", $addresses);
        $addresses = array_map(function ($address) {
            return preg_replace('/".*?"(*SKIP)(*FAIL)|(.+[<>\/\\(),.;:\[\]@]+.+)(\s<)/', '"$1"$2', $address);
        }, $addresses);
        return implode(">,", $addresses);
    }
  1. PhpMimeMailParser\ParserTest::testSanitizeAddresses with data set #0 ('"Alfred > Neuman" <Neuman@BBN...ENEXA>', 'Alfred > Neuman <Neuman@BBN-T...ENEXA>')
    Failed asserting that two strings are equal.
    --- Expected
    +++ Actual
    @@ @@
    -'"Alfred > Neuman" Neuman@BBN-TENEXA, Alfred Neuman Neuman@BBN-TENEXA, "Alfred > Neuman" Neuman@BBN-TENEXA, "Alfred > Neuman" Neuman@BBN-TENEXA'
    +'"Alfred > Neuman" Neuman@BBN-TENEXA, Alfred Neuman Neuman@BBN-TENEXA," "Alfred > Neuman"" Neuman@BBN-TENEXA," Alfred > Neuman" Neuman@BBN-TENEXA'

Maybe the good solution could be to catch the error and return the string like it is ?

@piernik
Copy link
Author

piernik commented Feb 6, 2018

I hope I have good code. Comments in code. Hope it works.

    /**
     * @dataProvider daneParsera
     *
     * @param $addresses
     */
    public function testParsera($addresses)
    {
        //Reg Exp to explode string with `>, `. Before `>, ` has to be an email 
        $re = '/(.*<[a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+>),\s/U';
        //Adding `, ` to the end of string for non `>, ` strings 
        preg_match_all($re, $addresses . ', ', $matches, PREG_SET_ORDER, 0);
        if ($matches) {
            $parsed = [];
            //parsing every address
            foreach ($matches AS $k => $w) {
                $address = $w[1];
                $parsed[] = preg_replace('/".*?"(*SKIP)(*FAIL)|(.+[<>\/\\(),.;:\[\]@]+.+)(\s<)/', '"$1"$2', $address);
            }
            //implode result together
            $addresses = implode(', ', $parsed);
        }
        $this->assertTrue(is_array(mailparse_rfc822_parse_addresses($addresses)));
    }

    public function daneParsera()
    {
        return [
            ['a(a <aFakeowska@fake.com.pl>'],
            ['a)a <aFakeowska@fake.com.pl>'],
            ['a/a <aFakeowska@fake.com.pl>'],
            ['a\a <aFakeowska@fake.com.pl>'],
            ['a,a <aFakeowska@fake.com.pl>'],
            ['a.a <aFakeowska@fake.com.pl>'],
            ['a;a <aFakeowska@fake.com.pl>'],
            ['a:a <aFakeowska@fake.com.pl>'],
            ['a[a <aFakeowska@fake.com.pl>'],
            ['a]a <aFakeowska@fake.com.pl>'],
            ['a@a <aFakeowska@fake.com.pl>'],
            ['a-a <aFakeowska@fake.com.pl>'],
            ['a_a <aFakeowska@fake.com.pl>'],
            ['a_a <aFakeowska@fake.com.pl>, a-a <aFakeowska@fake.com.pl>'],

            ['a_a <aFakeowska@fake.com.pl>, a-a <aFakeowska@fake.com.pl>'],
            ['a>a <aFakeowska@fake.com.pl>'],
            ['a>,@a <aFakeowska@fake.com.pl>'],
            ['"a>, @a" <aFakeowska@fake.com.pl>'],
            ['a,a <aFakeowska@fake.com.pl>, a<a <aFakeowska@fake.com.pl>, a@a <aFakeowska@fake.com.pl>'],
            ['=?UTF-8?Q?aFakeowska_>>_Agnieszka_fake-Fakeowska_Przegl=c4=85dy_?= =?UTF-8?B?YnVkeW5rw7N3?= <aFakeowska@fake.com.pl>'],
            ['Alfred > Neuman <Neuman@BBN-TENEXA>, Alfred Neuman <Neuman@BBN-TENEXA>, "Alfred > Neuman" <Neuman@BBN-TENEXA>, Alfred > Neuman <Neuman@BBN-TENEXA>'],
        ];
    }

@eXorus
Copy link
Member

eXorus commented Feb 7, 2018

Thanks for your contribution. I don't know how I will include this in my lib because I don't know all the impacts.
Maybe I will autorize a flag to force or not to respect RFC. You will be able to set to false and I will play this code, but by default and for the moment it will be true.

One more issue:

Gourmets: Pompous Person <WhoZiWhatZit@Cordon-Bleu>, Childs@WGBH.Boston, Galloping Gourmet@ANT.Down-Under (Australian National Television), Cheapie@Discount-Liquors;, Cruisers: Port@Portugal, Jones@SEA;, Another@Somewhere.SomeOrg

Actual result is
"Gourmets: Pompous Person" <WhoZiWhatZit@Cordon-Bleu>

@piernik
Copy link
Author

piernik commented Feb 8, 2018

$re = '/(.*<?[a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+>?),\s/U'; = Email don't have to be within <>.

Idea with flag is very good.

@fijiwebdesign
Copy link
Contributor

It isn't a good idea to include regex in the library and also to support non-RFC compliant structures. Rather, you should leave this to an outside library that converts emails to RFC compliance or helps with parsing. There is too many implementation specific's in MTAs and mail clients creating mime messages that supporting those is opening up too much complexity in one library.

This can be seen in many regex based mime parsers, they have too many failures and there is no way to write tests for regex expressions in parsers covering all possibilities.

Instead there should be hooks, events and/or middleware implemented so that parsing can pass through a external library or function before handling.

Example:


$Parser = new PhpMimeMailParser\Parser();

// inline function handler
$Parser->use(function($event, $header) {
  if ($event == 'parse-header') {
    return preg_replace_all("/REGEX/i", "replacement", $header);
  }
});

// use external library

use ExternalParser;
$config = [];
$ExternalParser = new ExternalParser($config);
$Parser->use($ExternalParser);

This way the PhpMimeMailParser library remains statically testable and parsing is more deterministic and thus less complex while benefiting from being able to extend it's parsing ability through middleware provided from sources outside the library so users can pick and choose what to implement.

Email parsing is always going to have failures. Trying to handle every use case is not possible in one library.

@fijiwebdesign
Copy link
Contributor

Also, you can ship middleware for parsing RFC compliant Mime messages that php or the mailparse extension cannot handle correctly. For example in the case of issue #168
And/or middleware for the most common cases, so they can be turned on/off as per usecase.

@eXorus
Copy link
Member

eXorus commented Feb 8, 2018

@piernik The new regex works better but it's always failing:

1) PhpMimeMailParser\ParserTest::testSanitizeAddresses with data set #11 ('Gourmets:  Pompous Person <Wh...omeOrg', 'Gourmets:  Pompous Person <Wh...omeOrg')
Failed asserting that two strings are equal.
--- Expected
+++ Actual
@@ @@
-'Gourmets:  Pompous Person <WhoZiWhatZit@Cordon-Bleu>, Childs@WGBH.Boston, Galloping Gourmet@ANT.Down-Under (Australian National Television), Cheapie@Discount-Liquors;, Cruisers:  Port@Portugal, Jones@SEA;, Another@Somewhere.SomeOrg'
+'"Gourmets:  Pompous Person" <WhoZiWhatZit@Cordon-Bleu>, Childs@WGBH.Boston, Galloping Gourmet@ANT.Down-Under (Australian National Television), Cheapie@Discount-Liquors;, Cruisers:  Port@Portugal, Jones@SEA;, Another@Somewhere.SomeOrg'

@fijiwebdesign I agree most of you view point. I don't know yet how to do that but I find a way to manage all of this exception because it's the real life or at least let the dev be able to extends PhpMimeMailParser.
Thanks for your feedback it's help me a lot.

@piernik
Copy link
Author

piernik commented Feb 8, 2018

@eXorus : is in special chars, so "Gourmets: Pompous Person" <WhoZiWhatZit@Cordon-Bleu> should be expected.

@fijiwebdesign sure it's good idea, maybe best.

@eXorus
Copy link
Member

eXorus commented Feb 9, 2018

@piernik in this case it's a group of emails because you start by a label Gourmets: and finish with ;

@piernik
Copy link
Author

piernik commented Feb 9, 2018

Don't get it - what is group of emails? I do not know RFC pattern at all.

@eXorus
Copy link
Member

eXorus commented Feb 9, 2018

I just take the example here: https://www.w3.org/Protocols/rfc822/#z10
I don't know much about this, but it was working well before.

A.1.5. Address Lists

Gourmets: Pompous Person WhoZiWhatZit@Cordon-Bleu,
Childs@WGBH.Boston, Galloping Gourmet@
ANT.Down-Under (Australian National Television),
Cheapie@Discount-Liquors;,
Cruisers: Port@Portugal, Jones@SEA;,
Another@Somewhere.SomeOrg

This group list example points out the use of comments and the
mixing of addresses and groups.

@fijiwebdesign
Copy link
Contributor

fijiwebdesign commented Feb 9, 2018

I've added the middleware to this PR: #180

For example to fix the email address you'd do:

$Parser = new Parser();

$Parser->addMiddleware(function($mimePart, $next) {
            $regex ='/some cool regex to fix from headers/i';
            $part = $mimePart->getPart();

            if (isset($part['headers']['from'])) {
                $part['headers']['from'] = preg_replace($regex, "\"$1\"", $part['headers']['from']);
            }
            $mimePart->setPart($part);

            return $next($mimePart);
});

Or create a full middleware implementation

<?php

namespace PhpMimeMailParser;

/**
 * Wraps a callable as a Middleware
 */
class HeaderSpecialCharsMiddleware implements Contracts\Middleware
{

    /**
     * Process a mime part, optionally delegating parsing to the $next MiddlewareStack
     */
    public function parse(MimePart $part, MiddlewareStack $next)
    {
            $regex ='/some cool regex to fix from headers/i';
            $part = $mimePart->getPart();

            if (isset($part['headers']['from'])) {
                $part['headers']['from'] = preg_replace($regex, "\"$1\"", $part['headers']['from']);
            }
            $mimePart->setPart($part);

            return $next($mimePart);
    }
}

$Parser = new Parser();
$Parser->addMiddleware(new HeaderSpecialCharsMiddleware());

@eXorus
Copy link
Member

eXorus commented Feb 27, 2018

Version with middleware is released, you can do your regex in a middleware if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants