Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getMessageBody encoding issue #147

Closed
ainera opened this issue Apr 19, 2017 · 2 comments
Closed

getMessageBody encoding issue #147

ainera opened this issue Apr 19, 2017 · 2 comments

Comments

@ainera
Copy link

ainera commented Apr 19, 2017

getMessageBody() changes the received content's character encoding but this has an undesired side effect.

When you get (X)HTML body and it is sinkhole-converted to another encoding there can be a situation where factual and charset specified inline don't mach. This means that a HTML (or XML) parser that honors this inline charset may doubly convert.

For example, this very possible email content

<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=ISO-8859-15">
</head>
<body>
<p>Zwölf Boxkämpfer jagen Viktor quer über den großen Sylter Deich</p>
</body>
</html>

Passed through this

$doc = new DOMDocument('1.0');
$doc->recover = true;
@$doc->loadHTML($email_body, LIBXML_NOCDATA | LIBXML_NOENT | LIBXML_NONET);
$xpather = new DOMXPath($doc);
foreach ($xpather->query('.//text()', $doc) as $text) {
	echo var_export($text->wholeText), PHP_EOL;
}

Prints the following

'Zwölf BoxkÀmpfer jagen Viktor quer Ìber den gro�en Sylter Deich'

I suppose there are a few possible ways to fix this

  • Incorporate DOMDocument into getMessageBody(). Dealing with its error reporting is not the best experience and this does mean more dependencies/requirements.
  • Do a naive swap of inline charset specifications. Probably a very bad idea, because it would not be real XML parsing.
  • Have another function that returns the body untouched for "aware" processing.
@eXorus
Copy link
Member

eXorus commented Apr 19, 2017

I agree with you, it's like getHeadersRaw().

Maybe a getMessageBodyRaw() ...

@eXorus
Copy link
Member

eXorus commented May 21, 2020

I can't reproduce the issue or the email is wrong (bad character encoding)

$file = 'From: mail@exemple.com
To: mail@exemple.com, mail2@exemple3.com, mail3@exemple2.com
Subject: =?windows-1251?Q?occurs_when_divided_into_an_array?=
=?windows-1251?Q?=2C_and_the_last_e_of_the_array!_=CF=F3=F2?=
=?windows-1251?Q?=B3=ED_=F5=F3=E9=EB=EE!!!!!!?=
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=ISO-8859-15">
</head>
<body>
<p>Zwölf Boxkämpfer jagen Viktor quer über den großen Sylter Deich</p>
</body>
</html>
';

$Parser = new Parser();
$Parser->setText($file);

$expectedString = 'Zwölf Boxkämpfer jagen Viktor quer über den großen Sylter Deich';

$this->assertStringContainsString(
    $expectedString,
    $Parser->getMessageBody('text')
);
//It works

$doc = new \DOMDocument('1.0');
$doc->recover = true;
@$doc->loadHTML($Parser->getMessageBody('text'), LIBXML_NOCDATA | LIBXML_NOENT | LIBXML_NONET);
$xpather = new \DOMXPath($doc);
$this->assertEquals(
    $expectedString, 
    $xpather->query('.//text()', $doc)[5]->wholeText
); 
//It doesn't work
    

So I'm closing the issue but I added in the next version the methods:

  • getTextRaw
  • getHtmlRaw

@eXorus eXorus closed this as completed May 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants