Some pages can't be parsed #18

panzi · 2013-07-06T02:43:53Z

For example this page makes problems:
http://pinterest.com/pin/315885361334150660/

On http://iframely.com/debug I get 403. Locally the page loads but then the SAX parser fails. It gets confused by the messy HTML and skips all the important meta tags. Strangely I can't reproduce this in a minimal setup, but I can when using iframely. I'm pretty sure it's the SAX parser after debugging and adding appropriate log statements in iframely-meta.js.

So maybe another HTML parser would be better? Maybe node-htmlparser2? They claim to be a lot faster than SAX, too. However, they don't resolve entities. But that is easily done with entities.

Should I port parsing of meta tags to node-htmlparser2? Maybe I'll do that even if it's just to find out whether it's really SAX who messes up.

Btw. when using $selector on valid HTML that is not well-formed XML you also get errors. E.g. <p> tags don't need </p> in HTML. Any further <p> automatically closes the last one. With $selector they are all nested, though. Yes, this is ugly and I don't write such HTML, but some people do.

The text was updated successfully, but these errors were encountered:

nleush · 2013-07-06T05:39:09Z

The known issue is explicit parser error: http://iframely.com/debug?uri=%20http%3A%2F%2Fdabblet.com%2F%20

Unexpected end Line: 186 Column: 7 Char:

SAX really have some issues with processing invalid tags. I was trying to fix that but haven't enough time yet to handle it.

We have two common requirements for parser to use in current algorithm:

Working with stream - to process meta before all response retrieved.
Working with tags - to simplify generation of meta.

So looks like its possible to use node-htmlparser2.

You can try to port iframely-meta to another parser. We will test in on all our plugins to see how is it stable.

P.S. all plugins testing is available with forever start ./modules/test-dashboard/tester.js but tests config section should be enabled, results on '/tests'.

nleush · 2013-07-06T05:49:45Z

About issue with skipping meta tags.

Its really strange, reproduced only locally and not all the time. As I see parser stops working on META viewport tag and continues on META pinterest:following tag. Maybe its because of data-app attribute:

 <meta property="twitter:app:name:googleplay" name="twitter:app:name:googleplay" content="Pinterest" data-app>

panzi · 2013-07-06T13:32:37Z

Yeah I also noticed the data-app attribute. This is valid HTML5, but completely invalid XML. I'll do the porting to node-htmlparser2 later (in a couple of hours).

panzi · 2013-07-06T21:06:32Z

Erm, what should happen when I run forever start ./modules/test-dashboard/tester.js? I now have this process:

$ ~/node0.10/bin/forever start ./modules/test-dashboard/tester.js
warn:    --minUptime not set. Defaulting to: 1000ms
warn:    --spinSleepTime not set. Your script will exit if it does not stay up for at least 1000ms
info:    Forever processing file: ./modules/test-dashboard/tester.js
$ ps aux| grep node
root      1040  0.0  0.0   6960   884 ?        Ss   19:11   0:00 /usr/sbin/mcelog --ignorenodev --daemon --foreground
panzi    16805  3.2  0.4 679052 33780 ?        Ssl  22:58   0:00 /usr/bin/node /home/panzi/node0.10/lib/node_modules/forever/bin/monitor ./modules/test-dashboard/tester.js
panzi    16817  0.0  0.0 109256   884 pts/2    S+   22:59   0:00 grep --color=auto node

But it doesn't seem to do anything. (No CPU or IO usage.) What am I supposed to do now? I did uncomment the tests section in config.local.js. There is something about mongodb. Do I need to install and run mongodb? Does it need some special configuration? If so, why haven't I got an error about this?

nleush · 2013-07-07T05:05:13Z

Yes, mongodb must be run.

Try /tests url - there should be error if no mongodb. It is test result dashboard, and also there are buttons to run tests for one plugin or force test plugins.
Also forever logs 0 (where 0 is index of this proccess in forever logs).

This process tests all plugins with configured periodicity, and also tests modified plugins first.

nleush · 2013-07-07T05:16:44Z

upd: dashboard url not /debug but /tests

nleush closed this as completed Jul 8, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some pages can't be parsed #18

Some pages can't be parsed #18

panzi commented Jul 6, 2013

nleush commented Jul 6, 2013

nleush commented Jul 6, 2013

panzi commented Jul 6, 2013

panzi commented Jul 6, 2013

nleush commented Jul 7, 2013

nleush commented Jul 7, 2013

Some pages can't be parsed #18

Some pages can't be parsed #18

Comments

panzi commented Jul 6, 2013

nleush commented Jul 6, 2013

nleush commented Jul 6, 2013

panzi commented Jul 6, 2013

panzi commented Jul 6, 2013

nleush commented Jul 7, 2013

nleush commented Jul 7, 2013