getMessageBody encoding issue #147

ainera · 2017-04-19T06:21:28Z

getMessageBody() changes the received content's character encoding but this has an undesired side effect.

When you get (X)HTML body and it is sinkhole-converted to another encoding there can be a situation where factual and charset specified inline don't mach. This means that a HTML (or XML) parser that honors this inline charset may doubly convert.

For example, this very possible email content

<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=ISO-8859-15">
</head>
<body>
<p>Zwölf Boxkämpfer jagen Viktor quer über den großen Sylter Deich</p>
</body>
</html>

Passed through this

$doc = new DOMDocument('1.0');
$doc->recover = true;
@$doc->loadHTML($email_body, LIBXML_NOCDATA | LIBXML_NOENT | LIBXML_NONET);
$xpather = new DOMXPath($doc);
foreach ($xpather->query('.//text()', $doc) as $text) {
	echo var_export($text->wholeText), PHP_EOL;
}

Prints the following

'ZwÃ¶lf BoxkÃ€mpfer jagen Viktor quer ÃŒber den groÃ�en Sylter Deich'

I suppose there are a few possible ways to fix this

Incorporate DOMDocument into getMessageBody(). Dealing with its error reporting is not the best experience and this does mean more dependencies/requirements.
Do a naive swap of inline charset specifications. Probably a very bad idea, because it would not be real XML parsing.
Have another function that returns the body untouched for "aware" processing.

The text was updated successfully, but these errors were encountered:

eXorus · 2017-04-19T07:26:53Z

I agree with you, it's like getHeadersRaw().

Maybe a getMessageBodyRaw() ...

eXorus · 2020-05-21T20:13:11Z

I can't reproduce the issue or the email is wrong (bad character encoding)

$file = 'From: mail@exemple.com
To: mail@exemple.com, mail2@exemple3.com, mail3@exemple2.com
Subject: =?windows-1251?Q?occurs_when_divided_into_an_array?=
=?windows-1251?Q?=2C_and_the_last_e_of_the_array!_=CF=F3=F2?=
=?windows-1251?Q?=B3=ED_=F5=F3=E9=EB=EE!!!!!!?=
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=ISO-8859-15">
</head>
<body>
<p>Zwölf Boxkämpfer jagen Viktor quer über den großen Sylter Deich</p>
</body>
</html>
';

$Parser = new Parser();
$Parser->setText($file);

$expectedString = 'Zwölf Boxkämpfer jagen Viktor quer über den großen Sylter Deich';

$this->assertStringContainsString(
    $expectedString,
    $Parser->getMessageBody('text')
);
//It works

$doc = new \DOMDocument('1.0');
$doc->recover = true;
@$doc->loadHTML($Parser->getMessageBody('text'), LIBXML_NOCDATA | LIBXML_NOENT | LIBXML_NONET);
$xpather = new \DOMXPath($doc);
$this->assertEquals(
    $expectedString, 
    $xpather->query('.//text()', $doc)[5]->wholeText
); 
//It doesn't work

So I'm closing the issue but I added in the next version the methods:

getTextRaw
getHtmlRaw

eXorus added the enhancement label Apr 19, 2017

eXorus closed this as completed May 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getMessageBody encoding issue #147

getMessageBody encoding issue #147

ainera commented Apr 19, 2017

eXorus commented Apr 19, 2017

eXorus commented May 21, 2020

getMessageBody encoding issue #147

getMessageBody encoding issue #147

Comments

ainera commented Apr 19, 2017

eXorus commented Apr 19, 2017

eXorus commented May 21, 2020