-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some pages can't be parsed #18
Comments
The known issue is explicit parser error: http://iframely.com/debug?uri=%20http%3A%2F%2Fdabblet.com%2F%20
SAX really have some issues with processing invalid tags. I was trying to fix that but haven't enough time yet to handle it. We have two common requirements for parser to use in current algorithm:
So looks like its possible to use node-htmlparser2. You can try to port iframely-meta to another parser. We will test in on all our plugins to see how is it stable. P.S. all plugins testing is available with |
About issue with skipping meta tags. Its really strange, reproduced only locally and not all the time. As I see parser stops working on META viewport tag and continues on META pinterest:following tag. Maybe its because of
|
Yeah I also noticed the |
Erm, what should happen when I run
But it doesn't seem to do anything. (No CPU or IO usage.) What am I supposed to do now? I did uncomment the tests section in |
Yes, mongodb must be run. Try This process tests all plugins with configured periodicity, and also tests modified plugins first. |
upd: dashboard url not |
For example this page makes problems:
http://pinterest.com/pin/315885361334150660/
On http://iframely.com/debug I get 403. Locally the page loads but then the SAX parser fails. It gets confused by the messy HTML and skips all the important meta tags. Strangely I can't reproduce this in a minimal setup, but I can when using iframely. I'm pretty sure it's the SAX parser after debugging and adding appropriate log statements in
iframely-meta.js
.So maybe another HTML parser would be better? Maybe node-htmlparser2? They claim to be a lot faster than SAX, too. However, they don't resolve entities. But that is easily done with entities.
Should I port parsing of meta tags to node-htmlparser2? Maybe I'll do that even if it's just to find out whether it's really SAX who messes up.
Btw. when using
$selector
on valid HTML that is not well-formed XML you also get errors. E.g.<p>
tags don't need</p>
in HTML. Any further<p>
automatically closes the last one. With$selector
they are all nested, though. Yes, this is ugly and I don't write such HTML, but some people do.The text was updated successfully, but these errors were encountered: