Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urllib2 uncatched exception when request is not answered #2

Closed
newlog opened this issue Sep 19, 2012 · 2 comments
Closed

urllib2 uncatched exception when request is not answered #2

newlog opened this issue Sep 19, 2012 · 2 comments

Comments

@newlog
Copy link

newlog commented Sep 19, 2012

The problem is in the google.py file, line 96 (the line might change given some modifications I've done to the code). The exception goes like this:


Traceback (most recent call last):
File "mini_qa.py", line 325, in
pretty_qa("Who is the world's best tennis player?")
File "mini_qa.py", line 78, in pretty_qa
for (j, (answer, score)) in enumerate(qa(question, source)[:num]):
File "mini_qa.py", line 96, in qa
gqa = google_qa(question)
File "mini_qa.py", line 111, in google_qa
for summary in get_summaries(query.query):
File "mini_qa.py", line 187, in get_summaries
results = search(query)
File "/Users/newlog/Documents/Proyectos/misc/github/mini_qa/google.py", line 181, in search
html = get_page(url)
File "/Users/newlog/Documents/Proyectos/misc/github/mini_qa/google.py", line 95, in get_page
response = urllib2.urlopen(request)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 432, in error
result = self._call_chain(_args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
result = func(_args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 619, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
return self._call_chain(_args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
result = func(_args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)

urllib2.HTTPError: HTTP Error 503: Service Unavailable

My suggested solution is:

1 try:¬
0 response = urllib2.urlopen(request)¬
1 cookie_jar.extract_cookies(response, request)¬
2 html = response.read()¬
3 cookie_jar.save()¬
4 response.close()¬
5 return html¬
6 except urllib2.URLError, err:¬
7 print "[-] Error making the request: " + str(err)¬

In the get_page method.

And in the in the search method from the same file (the last lines):

10 # Request the Google Search results page.¬
9 html = get_page(url)¬
8 ¬
7 # Parse the response and extract the summaries¬
6 if html:¬
5 soup = BeautifulSoup.BeautifulSoup(html)¬
4 return soup.findAll("div", {"class": "s"})¬
3 else:¬
2 return []¬

Thanks for your work.

@mnielsen
Copy link
Owner

If you're getting a 503 it's possible that Google is unhappy you're making multiple requests over a short period of time. In earlier versions I had the same problem.

If that's the case, then you may wish to change the line which reads (in google.py)

time.sleep(pause+(random.random()-0.5)*5)

to something with a longer pause, say 10 or even 20 seconds. The problem, of course, is that this slows things down. But at least with the caching of results it should only be a problem once.

@newlog
Copy link
Author

newlog commented Sep 26, 2012

Hello Michael,

I know, I know. The reason is what you said ;) I just meant that the exception was not caught, just for being purist.

BTW, I commented that line hahaha.

Again, great work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants