Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some pages can't be parsed #18

Closed
panzi opened this issue Jul 6, 2013 · 6 comments
Closed

Some pages can't be parsed #18

panzi opened this issue Jul 6, 2013 · 6 comments

Comments

@panzi
Copy link
Contributor

panzi commented Jul 6, 2013

For example this page makes problems:
http://pinterest.com/pin/315885361334150660/

On http://iframely.com/debug I get 403. Locally the page loads but then the SAX parser fails. It gets confused by the messy HTML and skips all the important meta tags. Strangely I can't reproduce this in a minimal setup, but I can when using iframely. I'm pretty sure it's the SAX parser after debugging and adding appropriate log statements in iframely-meta.js.

So maybe another HTML parser would be better? Maybe node-htmlparser2? They claim to be a lot faster than SAX, too. However, they don't resolve entities. But that is easily done with entities.

Should I port parsing of meta tags to node-htmlparser2? Maybe I'll do that even if it's just to find out whether it's really SAX who messes up.

Btw. when using $selector on valid HTML that is not well-formed XML you also get errors. E.g. <p> tags don't need </p> in HTML. Any further <p> automatically closes the last one. With $selector they are all nested, though. Yes, this is ugly and I don't write such HTML, but some people do.

@nleush
Copy link
Member

nleush commented Jul 6, 2013

The known issue is explicit parser error: http://iframely.com/debug?uri=%20http%3A%2F%2Fdabblet.com%2F%20

Unexpected end Line: 186 Column: 7 Char:

SAX really have some issues with processing invalid tags. I was trying to fix that but haven't enough time yet to handle it.

We have two common requirements for parser to use in current algorithm:

  1. Working with stream - to process meta before all response retrieved.
  2. Working with tags - to simplify generation of meta.

So looks like its possible to use node-htmlparser2.

You can try to port iframely-meta to another parser. We will test in on all our plugins to see how is it stable.

P.S. all plugins testing is available with forever start ./modules/test-dashboard/tester.js but tests config section should be enabled, results on '/tests'.

@nleush
Copy link
Member

nleush commented Jul 6, 2013

About issue with skipping meta tags.

Its really strange, reproduced only locally and not all the time. As I see parser stops working on META viewport tag and continues on META pinterest:following tag. Maybe its because of data-app attribute:

 <meta property="twitter:app:name:googleplay" name="twitter:app:name:googleplay" content="Pinterest" data-app>

@panzi
Copy link
Contributor Author

panzi commented Jul 6, 2013

Yeah I also noticed the data-app attribute. This is valid HTML5, but completely invalid XML. I'll do the porting to node-htmlparser2 later (in a couple of hours).

@panzi
Copy link
Contributor Author

panzi commented Jul 6, 2013

Erm, what should happen when I run forever start ./modules/test-dashboard/tester.js? I now have this process:

$ ~/node0.10/bin/forever start ./modules/test-dashboard/tester.js
warn:    --minUptime not set. Defaulting to: 1000ms
warn:    --spinSleepTime not set. Your script will exit if it does not stay up for at least 1000ms
info:    Forever processing file: ./modules/test-dashboard/tester.js
$ ps aux| grep node
root      1040  0.0  0.0   6960   884 ?        Ss   19:11   0:00 /usr/sbin/mcelog --ignorenodev --daemon --foreground
panzi    16805  3.2  0.4 679052 33780 ?        Ssl  22:58   0:00 /usr/bin/node /home/panzi/node0.10/lib/node_modules/forever/bin/monitor ./modules/test-dashboard/tester.js
panzi    16817  0.0  0.0 109256   884 pts/2    S+   22:59   0:00 grep --color=auto node

But it doesn't seem to do anything. (No CPU or IO usage.) What am I supposed to do now? I did uncomment the tests section in config.local.js. There is something about mongodb. Do I need to install and run mongodb? Does it need some special configuration? If so, why haven't I got an error about this?

@nleush
Copy link
Member

nleush commented Jul 7, 2013

Yes, mongodb must be run.

Try /tests url - there should be error if no mongodb. It is test result dashboard, and also there are buttons to run tests for one plugin or force test plugins.
Also forever logs 0 (where 0 is index of this proccess in forever logs).

This process tests all plugins with configured periodicity, and also tests modified plugins first.

@nleush
Copy link
Member

nleush commented Jul 7, 2013

upd: dashboard url not /debug but /tests

@nleush nleush closed this as completed Jul 8, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants