New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cnn.com/edition.cnn.com no longer working #159
Comments
My guess (after few tests) is that the HTML from CNN is now a real piece of shit with too much styles & scripts inlined (just check the source yourself it's really ugly) and the parser can't properly parse the HTML which means we then can't extract data from it. |
Hmm, their source is super ugly but I don't remember it looking very different from a couple of months ago when looking at the issue of it redirecting to an unsupported browser page. What's weird is that when I run it with f43.me, it seems to get the title and other
|
Actually just noticed something, they seem to start working with f43.me when I switch the parser to 'External'. It's too bad there's no source code for the Mercury parser, it would be interesting to see what that's doing differently. Is there any way to make graby dump what it actually parsed on a failure? I've spent far too much time on this (don't even really read CNN except for breaking news lol), but I'm frustrated that it went back to not working after getting fixed with 15aa9c6. Before, I know that I could see the unsupported browser page URL in the debug logs, I really want to know what happens in between when it appears to parse the OpenGraph data correctly but then fails to come up with anything. |
The problem seems to come from Readability. There are pre filters there to hard remove code from the html page, like style & script tags: https://github.com/j0k3r/php-readability/blob/master/src/Readability.php#L122 And it seems that removing the style tag (which are god too heavy on cnn) seems to remove the whole page. And that's why nothing come out from graby. |
Argh, I just remembered that they have m.cnn.com, the source on that is way cleaner and is parsable. Instead of messing with Readability filters I can just use those URLs. Thank you for taking the time to look! |
I know that the issue with the IE conditional was recently fixed, but just discovered that it's no longer working. It's not getting redirected to the "Unsupported browser" page like before, however. From poking around their site in dev tools, the layout hasn't changed at all. By messing with it on f43.me, the only thing I'm able to see is that when it tries to grab the exact same div as before, the content-length is way smaller than it should be.
The text was updated successfully, but these errors were encountered: