Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to attach logger for loadHTML #59

Closed
jtojnar opened this issue Nov 12, 2020 · 2 comments
Closed

Unable to attach logger for loadHTML #59

jtojnar opened this issue Nov 12, 2020 · 2 comments

Comments

@jtojnar
Copy link
Contributor

jtojnar commented Nov 12, 2020

Since 00f622e, constructor runs loadHTML again so we cannot attach logger before that. And loadHTML is private so we cannot re-run it after logger has been attached by setLogger.

(Trying to debug why div img + noscript img is turned into (p img) + noscript img preventing the wordpress noscript removal from triggering for some users but I cannot reproduce it.)

@j0k3r
Copy link
Owner

j0k3r commented Nov 13, 2020

And for an unknown reason I moved loadHtml out of init with a message without context @todo This should be called in init() instead of from __construct 😕

Looks like I moved code out of the constructor into loadHtml

public function __construct($html, $url = null, $parser = 'libxml', $use_tidy = true)
{
$this->url = $url;
$this->debugText = 'Parsing URL: '.$url."\n";
if ($url) {
$this->domainRegExp = '/'.strtr(preg_replace('/www\d*\./', '', parse_url($url, PHP_URL_HOST)), array('.' => '\.')).'/';
}
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_regex_encoding('UTF-8');
// HACK: dirty cleanup to replace some stuff; shouldn't use regexps with HTML but well...
if (!$this->flagIsActive(self::FLAG_DISABLE_PREFILTER)) {
foreach ($this->pre_filters as $search => $replace) {
$html = preg_replace($search, $replace, $html);
}
unset($search, $replace);
}
if (trim($html) === '') {
$html = '<html></html>';
}
/*
* Use tidy (if it exists).
* This fixes problems with some sites which would otherwise trouble DOMDocument's HTML parsing.
* Although sometimes it makes matters worse, which is why there is an option to disable it.
*
*/
if ($use_tidy && function_exists('tidy_parse_string')) {
$this->debugText .= 'Tidying document'."\n";
$tidy = tidy_parse_string($html, $this->tidy_config, 'UTF8');
if (tidy_clean_repair($tidy)) {
$this->original_html = $html;
$this->tidied = true;
$html = $tidy->value;
$html = preg_replace('/[\r\n]+/is', "\n", $html);
}
unset($tidy);
}
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
if (!($parser == 'html5lib' && ($this->dom = \HTML5_Parser::parse($html)))) {
libxml_use_internal_errors(true);
$this->dom = new \DOMDocument();
$this->dom->preserveWhiteSpace = false;
if (PHP_VERSION_ID >= 50400) {
$this->dom->loadHTML($html, LIBXML_NOBLANKS | LIBXML_COMPACT | LIBXML_NOERROR);
} else {
$this->dom->loadHTML($html);
}
libxml_use_internal_errors(false);
}
$this->dom->registerNodeClass('DOMElement', 'Readability\JSLikeHTMLElement');
}

But moving the call of loadHtml outside the constructor create a BC.

Maybe we should add ability to define a logger right from the constructor?

@j0k3r
Copy link
Owner

j0k3r commented Feb 4, 2022

In the upcoming 2.0, loadHtml is out of the constructor. Which should solve your problem.

#69

@j0k3r j0k3r closed this as completed Feb 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants