Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cnn.com/edition.cnn.com no longer working #159

Closed
4oo4 opened this issue Jul 16, 2018 · 5 comments
Closed

cnn.com/edition.cnn.com no longer working #159

4oo4 opened this issue Jul 16, 2018 · 5 comments

Comments

@4oo4
Copy link

4oo4 commented Jul 16, 2018

I know that the issue with the IE conditional was recently fixed, but just discovered that it's no longer working. It's not getting redirected to the "Unsupported browser" page like before, however. From poking around their site in dev tools, the layout hasn't changed at all. By messing with it on f43.me, the only thing I'm able to see is that when it tries to grab the exact same div as before, the content-length is way smaller than it should be.

[2018-07-09 12:02:23] graby.DEBUG: Graby is ready to fetch [] []
[2018-07-09 12:02:23] graby.DEBUG: . looking for site config for cnn.com in primary folder {"host":"cnn.com"} []
[2018-07-09 12:02:23] graby.DEBUG: ... found site config cnn.com.txt {"host":"cnn.com.txt"} []
[2018-07-09 12:02:23] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-07-09 12:02:23] graby.DEBUG: . looking for site config for global in primary folder {"host":"global"} []
[2018-07-09 12:02:23] graby.DEBUG: ... found site config global.txt {"host":"global.txt"} []
[2018-07-09 12:02:23] graby.DEBUG: Cached site config with key: cnn.com {"key":"cnn.com"} []
[2018-07-09 12:02:23] graby.DEBUG: . looking for site config for global in primary folder {"host":"global"} []
[2018-07-09 12:02:23] graby.DEBUG: ... found site config global.txt {"host":"global.txt"} []
[2018-07-09 12:02:23] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-07-09 12:02:23] graby.DEBUG: Cached site config with key: global {"key":"global"} []
[2018-07-09 12:02:23] graby.DEBUG: Cached site config with key: cnn.com.merged {"key":"cnn.com.merged"} []
[2018-07-09 12:02:23] graby.DEBUG: Fetching url: https://www.cnn.com/2018/07/09/politics/steve-bannon-bookstore-harassment/ {"url":"https://www.cnn.com/2018/07/09/politics/steve-bannon-bookstore-harassment/"} []
[2018-07-09 12:02:23] graby.DEBUG: Trying using method "get" on url "https://www.cnn.com/2018/07/09/politics/steve-bannon-bookstore-harassment/" {"method":"get","url":"https://www.cnn.com/2018/07/09/politics/steve-bannon-bookstore-harassment/"} []
[2018-07-09 12:02:23] graby.DEBUG: Use default user-agent "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2" for url "https://www.cnn.com/2018/07/09/politics/steve-bannon-bookstore-harassment/" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://www.cnn.com/2018/07/09/politics/steve-bannon-bookstore-harassment/"} []
[2018-07-09 12:02:23] graby.DEBUG: Use default referer "http://www.google.co.uk/url?sa=t&source=web&cd=1" for url "https://www.cnn.com/2018/07/09/politics/steve-bannon-bookstore-harassment/" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://www.cnn.com/2018/07/09/politics/steve-bannon-bookstore-harassment/"} []
[2018-07-09 12:02:24] graby.DEBUG: Data fetched: [array] {"data":{"effective_url":"https://edition.cnn.com/2018/07/09/politics/steve-bannon-bookstore-harassment/","body":"(only length for debug): 1987844","headers":"text/html; charset=utf-8","all_headers":{"content-type":"text/html; charset=utf-8","x-servedbyhost":"::ffff:172.17.3.30","access-control-allow-origin":"*","cache-control":"max-age=60","content-security-policy":"default-src 'self' blob: https://*.cnn.com:* http://*.cnn.com:* *.cnn.io:* *.cnn.net:* *.turner.com:* *.turner.io:* *.ugdturner.com:* courageousstudio.com *.vgtf.net:*; script-src 'unsafe-eval' 'unsafe-inline' 'self' *; style-src 'unsafe-inline' 'self' blob: *; child-src 'self' blob: *; frame-src 'self' *; object-src 'self' *; img-src 'self' data: blob: *; media-src 'self' data: blob: *; font-src 'self' data: *; connect-src 'self' *; frame-ancestors 'self' https://*.cnn.com:* http://*.cnn.com https://*.cnn.io:* http://*.cnn.io:* *.turner.com:* courageousstudio.com;","x-content-type-options":"nosniff","x-xss-protection":"1; mode=block","via":"1.1 varnish, 1.1 varnish","content-length":"1988345","accept-ranges":"bytes","date":"Mon, 09 Jul 2018 17:02:24 GMT","age":"0","connection":"keep-alive","set-cookie":"countryCode=US; Domain=.cnn.com; Path=/, geoData=middletown|NY|10941|US|NA; Domain=.cnn.com; Path=/","x-served-by":"cache-iad2149-IAD, cache-msp9223-MSP","x-cache":"MISS, MISS","x-cache-hits":"0, 0","x-timer":"S1531155744.985921,VS0,VE732","vary":"Accept-Encoding, Fastly-SSL"},"status":200}} []
[2018-07-09 12:02:24] graby.DEBUG: Treating as UTF-8 {"encoding":"utf-8"} []
[2018-07-09 12:02:25] graby.DEBUG: Opengraph data: [array] {"ogData":{"og_pubdate":"2018-07-09T13:35:34Z","og_url":"https://www.cnn.com/2018/07/09/politics/steve-bannon-bookstore-harassment/index.html","og_title":"Steve Bannon called 'piece of trash' by heckler at bookstore","og_description":"Former White House chief strategist Steve Bannon became the latest figure from President Donald Drumpf's world to be targeted with public harassment while browsing books in Richmond, Virginia, on Saturday afternoon. ","og_site_name":"CNN","og_type":"article","og_image":"https://cdn.cnn.com/cnnnext/dam/assets/180523145816-steve-bannon-05-22-2018-super-tease.jpg","og_image_width":"1100","og_image_height":"619"}} []
[2018-07-09 12:02:25] graby.DEBUG: Looking for site config files to see if single page link exists [] []
[2018-07-09 12:02:25] graby.DEBUG: . looking for site config for edition.cnn.com in primary folder {"host":"edition.cnn.com"} []
[2018-07-09 12:02:25] graby.DEBUG: ... found site config edition.cnn.com.txt {"host":"edition.cnn.com.txt"} []
[2018-07-09 12:02:25] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-07-09 12:02:25] graby.DEBUG: . looking for site config for global in primary folder {"host":"global"} []
[2018-07-09 12:02:25] graby.DEBUG: ... site config for global already loaded in this request {"host":"global"} []
[2018-07-09 12:02:25] graby.DEBUG: Cached site config with key: edition.cnn.com {"key":"edition.cnn.com"} []
[2018-07-09 12:02:25] graby.DEBUG: . looking for site config for global in primary folder {"host":"global"} []
[2018-07-09 12:02:25] graby.DEBUG: ... site config for global already loaded in this request {"host":"global"} []
[2018-07-09 12:02:25] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-07-09 12:02:25] graby.DEBUG: Cached site config with key: edition.cnn.com.merged {"key":"edition.cnn.com.merged"} []
[2018-07-09 12:02:25] graby.DEBUG: No "single_page_link" config found [] []
[2018-07-09 12:02:25] graby.DEBUG: Attempting to extract content [] []
[2018-07-09 12:02:25] graby.DEBUG: Returning cached and merged site config for edition.cnn.com {"host":"edition.cnn.com"} []
[2018-07-09 12:02:25] graby.DEBUG: Strings replaced: 0 (find_string and/or replace_string) {"count":0} []
[2018-07-09 12:02:25] graby.DEBUG: Attempting to parse HTML with libxml {"parser":"libxml"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying //meta[@property="og:title"]/@content for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying //meta[@property="article:published_time"]/@content for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying //html[@lang]/@lang for language {"pattern":"//html[@lang]/@lang"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying //meta[@name="DC.language"]/@content for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying highlights to strip element {"string":"highlights"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying //section[contains(@class, 'body-text')] for body (content length: 232) {"pattern":"//section[contains(@class, 'body-text')]","content_length":232} []
[2018-07-09 12:02:25] graby.DEBUG: Using Readability [] []
[2018-07-09 12:02:25] graby.DEBUG: Detected title:  {"title":""} []
[2018-07-09 12:02:25] graby.DEBUG: Trying again without tidy [] []
[2018-07-09 12:02:25] graby.DEBUG: Strings replaced: 0 (find_string and/or replace_string) {"count":0} []
[2018-07-09 12:02:25] graby.DEBUG: Attempting to parse HTML with libxml {"parser":"libxml"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying //meta[@property="og:title"]/@content for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying //meta[@property="article:published_time"]/@content for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying //html[@lang]/@lang for language {"pattern":"//html[@lang]/@lang"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying //meta[@name="DC.language"]/@content for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying highlights to strip element {"string":"highlights"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying //section[contains(@class, 'body-text')] for body (content length: 154) {"pattern":"//section[contains(@class, 'body-text')]","content_length":154} []
[2018-07-09 12:02:25] graby.DEBUG: Using Readability [] []
[2018-07-09 12:02:25] graby.DEBUG: Detected title:  {"title":""} []
[2018-07-09 12:02:25] graby.DEBUG: Success ?  {"is_success":false} []
[2018-07-09 12:02:25] graby.DEBUG: Extract failed [] []
[2018-07-09 12:02:25] app.DEBUG: DownloadImagesSubscriber: disabled. [] []
[2018-07-09 12:02:25] security.DEBUG: Stored the security token in the session. {"key":"_security_secured_area"} []
----------------------
[2018-07-09 11:59:11] app.DEBUG: Restricted access config enabled? {"enabled":0} []
[2018-07-09 11:59:11] graby.DEBUG: Graby is ready to fetch [] []
[2018-07-09 11:59:11] graby.DEBUG: . looking for site config for cnn.com in primary folder {"host":"cnn.com"} []
[2018-07-09 11:59:11] graby.DEBUG: ... found site config cnn.com.txt {"host":"cnn.com.txt"} []
[2018-07-09 11:59:11] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-07-09 11:59:11] graby.DEBUG: . looking for site config for global in primary folder {"host":"global"} []
[2018-07-09 11:59:11] graby.DEBUG: ... found site config global.txt {"host":"global.txt"} []
[2018-07-09 11:59:11] graby.DEBUG: Cached site config with key: cnn.com {"key":"cnn.com"} []
[2018-07-09 11:59:11] graby.DEBUG: . looking for site config for global in primary folder {"host":"global"} []
[2018-07-09 11:59:11] graby.DEBUG: ... found site config global.txt {"host":"global.txt"} []
[2018-07-09 11:59:11] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-07-09 11:59:11] graby.DEBUG: Cached site config with key: global {"key":"global"} []
[2018-07-09 11:59:11] graby.DEBUG: Cached site config with key: cnn.com.merged {"key":"cnn.com.merged"} []
[2018-07-09 11:59:11] graby.DEBUG: Fetching url: https://www.cnn.com/2018/07/09/asia/thai-cave-rescue-intl/index.html {"url":"https://www.cnn.com/2018/07/09/asia/thai-cave-rescue-intl/index.html"} []
[2018-07-09 11:59:11] graby.DEBUG: Trying using method "get" on url "https://www.cnn.com/2018/07/09/asia/thai-cave-rescue-intl/index.html" {"method":"get","url":"https://www.cnn.com/2018/07/09/asia/thai-cave-rescue-intl/index.html"} []
[2018-07-09 11:59:11] graby.DEBUG: Use default user-agent "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2" for url "https://www.cnn.com/2018/07/09/asia/thai-cave-rescue-intl/index.html" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://www.cnn.com/2018/07/09/asia/thai-cave-rescue-intl/index.html"} []
[2018-07-09 11:59:11] graby.DEBUG: Use default referer "http://www.google.co.uk/url?sa=t&source=web&cd=1" for url "https://www.cnn.com/2018/07/09/asia/thai-cave-rescue-intl/index.html" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://www.cnn.com/2018/07/09/asia/thai-cave-rescue-intl/index.html"} []
[2018-07-09 11:59:12] graby.DEBUG: Data fetched: [array] {"data":{"effective_url":"https://www.cnn.com/2018/07/09/asia/thai-cave-rescue-intl/index.html","body":"(only length for debug): 2144783","headers":"text/html; charset=utf-8","all_headers":{"content-type":"text/html; charset=utf-8","x-servedbyhost":"::ffff:172.17.93.27","access-control-allow-origin":"*","cache-control":"max-age=60","content-security-policy":"default-src 'self' blob: https://*.cnn.com:* http://*.cnn.com:* *.cnn.io:* *.cnn.net:* *.turner.com:* *.turner.io:* *.ugdturner.com:* courageousstudio.com *.vgtf.net:*; script-src 'unsafe-eval' 'unsafe-inline' 'self' *; style-src 'unsafe-inline' 'self' blob: *; child-src 'self' blob: *; frame-src 'self' *; object-src 'self' *; img-src 'self' data: blob: *; media-src 'self' data: blob: *; font-src 'self' data: *; connect-src 'self' *; frame-ancestors 'self' https://*.cnn.com:* http://*.cnn.com https://*.cnn.io:* http://*.cnn.io:* *.turner.com:* courageousstudio.com;","x-content-type-options":"nosniff","x-xss-protection":"1; mode=block","via":"1.1 varnish, 1.1 varnish","content-length":"2145284","accept-ranges":"bytes","date":"Mon, 09 Jul 2018 16:59:12 GMT","age":"188","connection":"keep-alive","set-cookie":"countryCode=US; Domain=.cnn.com; Path=/, geoData=middletown|NY|10941|US|NA; Domain=.cnn.com; Path=/, tryThing00=0732; Domain=.cnn.com; Path=/; Expires=Mon Jul 01 2019 00:00:00 GMT, tryThing01=3094; Domain=.cnn.com; Path=/; Expires=Fri Mar 01 2019 00:00:00 GMT, tryThing02=2413; Domain=.cnn.com; Path=/; Expires=Wed Jan 01 2020 00:00:00 GMT","x-served-by":"cache-iad2150-IAD, cache-jfk8148-JFK","x-cache":"HIT, HIT","x-cache-hits":"2, 1","x-timer":"S1531155552.081472,VS0,VE4","vary":"Accept-Encoding, Fastly-SSL"},"status":200}} []
[2018-07-09 11:59:12] graby.DEBUG: Treating as UTF-8 {"encoding":"utf-8"} []
[2018-07-09 11:59:13] graby.DEBUG: Opengraph data: [array] {"ogData":{"og_pubdate":"2018-07-09T05:56:08Z","og_url":"https://www.cnn.com/2018/07/09/asia/thai-cave-rescue-intl/index.html","og_title":"Thai cave rescue suspended for the day after four more boys freed","og_description":"The second day of rescue operations at the cave site in northern Thailand has ended after four more boys were brought out of the flooded cave system Monday.","og_site_name":"CNN","og_type":"article","og_image":"https://cdn.cnn.com/cnnnext/dam/assets/180709063655-01-thai-cave-fifth-boy-rescue-0709-super-tease.jpg","og_image_width":"1100","og_image_height":"619"}} []
[2018-07-09 11:59:13] graby.DEBUG: Looking for site config files to see if single page link exists [] []
[2018-07-09 11:59:13] graby.DEBUG: Returning cached and merged site config for cnn.com {"host":"cnn.com"} []
[2018-07-09 11:59:13] graby.DEBUG: No "single_page_link" config found [] []
[2018-07-09 11:59:13] graby.DEBUG: Attempting to extract content [] []
[2018-07-09 11:59:13] graby.DEBUG: Returning cached and merged site config for cnn.com {"host":"cnn.com"} []
[2018-07-09 11:59:13] graby.DEBUG: Strings replaced: 0 (find_string and/or replace_string) {"count":0} []
[2018-07-09 11:59:13] graby.DEBUG: Attempting to parse HTML with libxml {"parser":"libxml"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //meta[@property="og:title"]/@content for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //meta[@property="article:published_time"]/@content for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //html[@lang]/@lang for language {"pattern":"//html[@lang]/@lang"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //meta[@name="DC.language"]/@content for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying highlights to strip element {"string":"highlights"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //section[contains(@class, ' body-text ')] for body (content length: 232) {"pattern":"//section[contains(@class, ' body-text ')]","content_length":232} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //section[contains(@class, ' l-container ')] for body (content length: 232) {"pattern":"//section[contains(@class, ' l-container ')]","content_length":232} []
[2018-07-09 11:59:13] graby.DEBUG: Using Readability [] []
[2018-07-09 11:59:13] graby.DEBUG: Detected title:  {"title":""} []
[2018-07-09 11:59:13] graby.DEBUG: Trying again without tidy [] []
[2018-07-09 11:59:13] graby.DEBUG: Strings replaced: 0 (find_string and/or replace_string) {"count":0} []
[2018-07-09 11:59:13] graby.DEBUG: Attempting to parse HTML with libxml {"parser":"libxml"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //meta[@property="og:title"]/@content for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //meta[@property="article:published_time"]/@content for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //html[@lang]/@lang for language {"pattern":"//html[@lang]/@lang"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //meta[@name="DC.language"]/@content for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying highlights to strip element {"string":"highlights"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //section[contains(@class, ' body-text ')] for body (content length: 154) {"pattern":"//section[contains(@class, ' body-text ')]","content_length":154} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //section[contains(@class, ' l-container ')] for body (content length: 154) {"pattern":"//section[contains(@class, ' l-container ')]","content_length":154} []
[2018-07-09 11:59:13] graby.DEBUG: Using Readability [] []
[2018-07-09 11:59:13] graby.DEBUG: Detected title:  {"title":""} []
[2018-07-09 11:59:13] graby.DEBUG: Success ?  {"is_success":false} []
[2018-07-09 11:59:13] graby.DEBUG: Extract failed [] []
@j0k3r
Copy link
Owner

j0k3r commented Aug 24, 2018

My guess (after few tests) is that the HTML from CNN is now a real piece of shit with too much styles & scripts inlined (just check the source yourself it's really ugly) and the parser can't properly parse the HTML which means we then can't extract data from it.

@4oo4
Copy link
Author

4oo4 commented Aug 26, 2018

Hmm, their source is super ugly but I don't remember it looking very different from a couple of months ago when looking at the issue of it redirecting to an unsupported browser page. What's weird is that when I run it with f43.me, it seems to get the title and other <meta> tags OK, but then they don't show up in the final result. Based on that I was wondering if it might be some kind of redirect again (possibly to the unsupported browser page like before).

{
    "og_pubdate": "2018-08-26T18:20:59Z",
    "og_url": "https://www.cnn.com/2018/08/26/us/jacksonville-madden-shooting/index.html",
    "og_title": "Mass shooting at video game tournament in Jacksonville leaves multiple dead",
    "og_description": "Multiple people were killed in a shooting during a video game tournament at a shopping and dining complex in downtown Jacksonville, Florida, the Jacksonville Sheriff's Office said Sunday afternoon.",
    "og_site_name": "CNN",
    "og_type": "article",
    "og_image": "https://cdn.cnn.com/cnnnext/dam/assets/180826145832-01-jacksonville-shooting-0826-super-tease.jpg",
    "og_image_width": "1100",
    "og_image_height": "619"
}

@4oo4
Copy link
Author

4oo4 commented Aug 26, 2018

Actually just noticed something, they seem to start working with f43.me when I switch the parser to 'External'. It's too bad there's no source code for the Mercury parser, it would be interesting to see what that's doing differently.

Is there any way to make graby dump what it actually parsed on a failure? I've spent far too much time on this (don't even really read CNN except for breaking news lol), but I'm frustrated that it went back to not working after getting fixed with 15aa9c6. Before, I know that I could see the unsupported browser page URL in the debug logs, I really want to know what happens in between when it appears to parse the OpenGraph data correctly but then fails to come up with anything.

@j0k3r
Copy link
Owner

j0k3r commented Aug 27, 2018

The problem seems to come from Readability. There are pre filters there to hard remove code from the html page, like style & script tags: https://github.com/j0k3r/php-readability/blob/master/src/Readability.php#L122

And it seems that removing the style tag (which are god too heavy on cnn) seems to remove the whole page. And that's why nothing come out from graby.
pre_filters parameters are defined globally and not on per site_config basis.

@4oo4
Copy link
Author

4oo4 commented Aug 27, 2018

Argh, I just remembered that they have m.cnn.com, the source on that is way cleaner and is parsable. Instead of messing with Readability filters I can just use those URLs.

Thank you for taking the time to look!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants