Incorrect behaviour on parsing html #41

behzadsh · 2015-11-19T22:36:20Z

I was trying to pars html content of cnn.com news pages, and when I get body tag, using both find() and getElementByTag() half the content was gone. I put parsed content into a file, and realized some tags like <article> are out of <body> or <html> tag, something like this:

<html>
  <head>...</head>
  <body>...</body>
  <article>...</article>
  <div>...</div>
</html>
<div>...</div>

php code:

<?php

$dom = new PHPHtmlParser\Dom();
$url = 'http://edition.cnn.com/2015/11/19/tennis/world-tour-finals-federer-nishikori/index.html';
$dom->load($url);
file_put_contents('test.html', (string) $dom);

The text was updated successfully, but these errors were encountered:

paquettg · 2016-03-20T19:26:27Z

I am not able to duplicate your issue, please reopen if you are stil experience your issue with more details and I will continue to look into it.

paquettg added bug enhancement and removed bug enhancement labels Mar 20, 2016

paquettg closed this as completed Mar 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect behaviour on parsing html #41

Incorrect behaviour on parsing html #41

behzadsh commented Nov 19, 2015

paquettg commented Mar 20, 2016

Incorrect behaviour on parsing html #41

Incorrect behaviour on parsing html #41

Comments

behzadsh commented Nov 19, 2015

paquettg commented Mar 20, 2016