Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Goose fails in extracting articles from Gizmodo and NY Times. #34

Closed
MojoJolo opened this issue Aug 18, 2013 · 6 comments
Closed

Goose fails in extracting articles from Gizmodo and NY Times. #34

MojoJolo opened this issue Aug 18, 2013 · 6 comments

Comments

@MojoJolo
Copy link

I tried Goose to extract articles in Gizmodo and NY Times but it just returned a blank string or a faulty extraction.

screen shot 2013-08-18 at 7 52 05 pm

screen shot 2013-08-18 at 7 53 10 pm

@grangier
Copy link
Owner

Hello,

First of all, why using screen shot to report an issue, there is no way to copy/paste URL.
Regarding the NYT it seems this is a cookie issue. Even a curl is not able to retrive the raw html :

curl -I "http://www.nytimes.com/2013/08/18/world/middleeast/pressure-by-us-failed-to-sway-egypts-leaders.html"
HTTP/1.1 303 See Other
Date: Sun, 18 Aug 2013 14:17:10 GMT
Server: Apache
Set-Cookie: RMID=007f01000c0c5210d766001c; Expires=Mon, 18 Aug 2014 14:17:10 GMT; Path=/; Domain=.nytimes.com;
Vary: Host
Location: http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2013/08/18/world/middleeast/pressure-by-us-failed-to-sway-egypts-leaders.html&OQ=_rQ3D0&OP=15f69d57Q2FQ3Cg_ZQ3C.tZQ3CsssQ3CjZ!0Q3CzgMQ3AQ7EggZAQ3CART@Q3CRFQ3CTFQ3CsgQ7E0zQ3C!Q60zz0ccQ5CQ3AZQ3C_Q7EcQ3AQ3AQ7BQ7EcQ7CetQ7CQ7BQ3AQ7COQ5CQ600czQ7CZgQ7CQ3AsQ5CtQ7CcQ7Dt_ZQ3AQ7C0cQ5CzcQ7EQ3A3jZ!0
Connection: close
Content-Type: text/plain

Regarding Gizmodo, thanks to google I found the url : http://gizmodo.com/the-gear-and-apps-you-need-to-survive-the-next-semester-1141460933

It seems that the data structure of the html page is to complicated for goose

@MojoJolo
Copy link
Author

Hi, sorry about using a screenshot. This is my first time reporting an issue. Will take note of it. Thanks for the reply.

Any recommendation or fallback to extract those kinds of websites?

@grangier
Copy link
Owner

The Gizmodo issue should be fixed in the latest head :

>>> url = "http://gizmodo.com/the-gear-and-apps-you-need-to-survive-the-next-semester-1141460933"
>>> import goose
>>> g = goose.Goose()
>>> a = g.extract(url=url)
>>> a.cleaned_text[:150]
u"Okay, this is it. Back to school, again. Whether it's your first college semester or you can see graduation on the horizon, these tools will make the "

@grangier
Copy link
Owner

For the NYT the issue seems to be cookie handeling. I guess the commit 4d1ccaf is not in favor of cookie handeling.

At the moment the only way to extract NYT content will be using the raw_html method :

>>> import urllib2
>>> import goose
>>> 
>>> # fetch html
... url = "http://www.nytimes.com/2013/08/18/world/middleeast/pressure-by-us-failed-to-sway-egypts-leaders.html?hp"
>>> opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
>>> response = opener.open(url)
>>> raw_html = response.read()
>>> 
>>> # goose
... g = goose.Goose()
>>> a = g.extract(raw_html=raw_html)
>>> a.cleaned_text
u'CAIRO \u2014 For a moment, at least, American and European diplomats trying to defuse the volatile standoff in Egypt thought they had a breakthrough.\n\nAs thousands of Islamist supporters of the ousted president, Mohamed Morsi, braced for a crackdown by the military-imposed government, a senior European diplomat, Bernardino Le\xf3n, told the Islamists of \u201cindications\u201d from the leadership that within hours it would free two imprisoned opposition leaders. In turn, the Islamists had agreed to reduce the size of two protest camps by about half.\n\nAn hour passed, and nothing happened. Another hour passed, and still no one had been released.\n\nThe Americans heightened the pressure. Two senators visiting Cairo, John McCain of Arizona and Lindsey Graham of South Carolina, met with Gen. Abdul-Fattah el-Sisi, the officer who ousted Mr. Morsi and appointed the new government, and the interim prime minister, Hazem el-Beblawi, and pushed for the release of the two prisoners. But the Egyptians brushed them off.\n\n\u201cYou could tell people were itching for a fight,\u201d Mr. Graham recalled in an interview. \u201cThe prime minister was a disaster. He kept preaching to me: \u2018You can\u2019t negotiate with these people. They\u2019ve got to get out of the streets and respect the rule of law.\u2019 I said: \u2018Mr. Prime Minister, it\u2019s pretty hard for you to lecture anyone on the rule of law. How many votes did you get? Oh, yeah, you didn\u2019t have an election.\u2019\xa0\u201d\n\nGeneral Sisi, Mr. Graham said, seemed \u201ca little bit intoxicated by power.\u201d\n\nThe senators walked out that day, Aug. 6, gloomy and convinced that a violent showdown was looming. But the diplomats still held out hope, believing they had persuaded Egypt\u2019s government at least not to declare the talks a failure.\n\nThe next morning, the government issued a statement declaring that diplomatic efforts had been exhausted and blaming the Islamists for any casualties from the coming crackdown. A week later, Egyptian forces opened a ferocious assault that so far has killed more than 1,000 protesters.\n\nAll of the efforts of the United States government, all the cajoling, the veiled threats, the high-level envoys from Washington and the 17 personal phone calls by Defense Secretary Chuck Hagel, failed to forestall the worst political bloodletting in modern Egyptian history. The generals in Cairo felt free to ignore the Americans first on the prisoner release and then on the statement, in a cold-eyed calculation that they would not pay a significant cost \u2014 a conclusion bolstered when President Obama responded by canceling a joint military exercise but not $1.5 billion in annual aid.\n\nThe violent crackdown has left Mr. Obama in a no-win position: risk a partnership that has been the bedrock of Middle East peace for 35 years, or stand by while longtime allies try to hold on to power by mowing down opponents. From one side, the Israelis, Saudis and other Arab allies have lobbied him to go easy on the generals in the interest of thwarting what they see as the larger and more insidious Islamist threat. From the other, an unusual mix of conservatives and liberals has urged him to stand more forcefully against the sort of autocracy that has been a staple of Egyptian life for decades.\n\nFor now the administration has decided to keep the close relationship with the Egyptian military fundamentally unchanged. But the death toll is climbing, the streets are descending into chaos, and the government and the Islamists are vowing to escalate. It is unclear if the military\u2019s new government can reimpose a version of the old order now that the public believes street protests have toppled two leaders in less than three years, or if, after winning democratic elections, the Islamists will ever again compliantly retreat.\n\nAs Mr. Obama acknowledged in a statement on Thursday, the American response turns not only on humanitarian values but also on national interests. A country consumed by civil strife may no longer function as a stabilizing ally in a volatile region.'

@grangier
Copy link
Owner

I close this issue. I opened a ticket for cookie handeling #35

@MojoJolo
Copy link
Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants