Best practice for handling InvalidResponseError exceptions? #6

pommygranite · 2012-04-03T16:53:36Z

Sometimes sess.visit() throws an exceptions similar to this:

Error while loading URL https://apis.google.com/_/apps-static/_/js/gapi/googleapis_client,iframes_styles_bubble_internal/rt=j/ver=8ruqBK5Rz68.en_GB./sv=1/am=!Ze6NnRS0VYCICGRMrA/d=1/rs=AItRSTMCkBLPuEGW-K5opwJvfmORrpspJQ: Operation canceled (error code 5)

And this:

Error while loading URL http://www.asite.com/images/poster/mirror-mirror.jpg: Error downloading http://www.asite.com/images/poster/mirror-mirror.jpg - server replied: Not Found (error code 203)

It looks like one or more elements on the page I have requested has failed to load. The exceptions are thrown before sess.wait() can be called and are thrown despite having set sess.set_error_tolerant(True)

I might like to 1) retry the page with a specified timeout or 2) use the fetched page anyway, despite the absence of some images or nested frames .
What are the best practices for achieving 1 and 2?

niklasb · 2012-04-03T17:37:18Z

In both cases, it doesn't seem like a timeout would help a lot. As for suggestion 2) I'd have to look into the webkit_server sources to make it more robust. For the reason it's usually used (web application testing), it's in fact desirable that exceptions are thrown in these cases, so I don't think they have such a functionality built in. Will look at it as soon as I find the time :)

It would help a lot if you could provide minimal scripts that try to access these pages, so I can reproduce the problem.

pommygranite · 2012-04-04T10:12:50Z

from dryscrape import Session
from dryscrape.driver.webkit import Driver
from webkit_server import InvalidResponseError

link = 'http://www.omlinks.com'
sess = Session(driver = Driver())
sess.set_error_tolerant(True)
try:
sess.visit(link)
sess.wait()
print 'Success'
except InvalidResponseError as e:
print 'InvalidResponseError:', e

Result:

InvalidResponseError: Error while loading URL http://www.omlinks.com/images/poster/mirror-mirror.jpg: Error downloading http://www.omlinks.com/images/poster/mirror-mirror.jpg - server replied: Not Found (error code 203)

pommygranite · 2012-04-04T10:25:07Z

and for error code 5, try above script with link = 'http://www.nutorrent.com'

pommygranite · 2012-04-04T14:40:43Z

I should also add, occasionally an InvalidResponseError is thrown by sess.render(), after sess.visit() and sess.wait() have returned successfully.

File "/home/user1/projects/MyBot/MyScraper.py", line 87, in Scrape
sess.render(path_screenshot)
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 260, in render
self.conn.issue_command("Render", path, width, height)
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 429, in issue_command
return self._read_response()
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 438, in _read_response
raise InvalidResponseError, self._read_message()
webkit_server.InvalidResponseError

pommygranite · 2012-04-05T12:34:10Z

List of some Exception errors thrown when loading sub elements of some sites, even with sess.set_error_tolerant(True):

Connection Refused (error code 1) eg. 'http://www/dl4all.com'
Connection Closed (error code 2) eg. 'http://www.youku.com/v'
Socket operation timed out (error code 4) eg. 'http://www.warezinfinite.net'
Operation Canceled (error code 5) eg. 'http://www.avaxhome.ws'
Host unreachable (error code 99) eg. 'http://bitspyder.net'
Forbidden (error code 202) eg. 'http://www.torrentfreak.com'
Noone Here (error code 203) eg. http://'www.mechoddl.com'

pommygranite · 2012-04-10T10:34:12Z

For my purposes, it would be ideal to pass in a list of error codes that should not throw exceptions when loading referenced sub-elements of a page. Specifying error codes that throw when accessing the base page itself is less important, as they generally mean the page isn't available.

niklasb · 2012-04-11T01:14:27Z

@pommygranite: While the general possibility of ignoring any error in associated resources seems like a useful addition, what you described in your last comment is very specific. I don't think I will implement it in that form (but I'll accept clean pull requests :)

tsiomenko · 2012-12-09T23:45:26Z

This oversight is actually critical and the main reason why I stopped using dryscrape for my scraping project.. a headless browser for scraping must be robust, I don't understand why there's an "error tolerant" mode that does nothing. It is literally impossible to get past 203 or error code 5 errors.

niklasb · 2012-12-10T00:04:21Z

@tsiomenko: This is unfortunate. I completely agree that having set_error_tolerant is unintuitive and bad. It was mainly a workaround for errors I encountered while using this library in my own projects. I should remove that method now that it became clear that it doesn't help with most of the errors.

The thing is that webkit_server is not my project, it's borrowed from capybara-webkit, where robustness is not really an issue (in fact, for web testing, you actually want to catch those errors). I spent some time trying to figure out how Qt's Webkit implementation handles this kind of stuff and was under the impression that the interface doesn't really provide mechanisms for selective dropping of errors. I guess this due to the unclear semantics in this case (how should a site load if a javascript is missing? what to do if images or inner frames cannot be loaded? Should this maybe dependend on whether the image/frame is necessary for the page to work? a real browser probably implements nasty heuristics on how to handle these cases). Remember that this is a community project, so if you have an idea on how to improve this, feel free to provide a pull request or outline your idea in a ticket.

Bazza74 · 2014-01-07T23:20:40Z

So is there no way to properly deal with these kind of errors now?

niklasb · 2014-01-07T23:23:06Z

I remember that I tried several different approaches to get it to "ignore" errors that happen during "certain" requests (for example, images etc.). I didn't have much luck though, but maybe the open pull request on webkit-server will have some code improvements that make something like this more easy to implement. I still don't know what the correct semantics should be though.

Bazza74 · 2014-01-07T23:31:11Z

Okay, thanks for the reply! Hopefully you or someone else can figure it out. Sadly it's not something I can help with.

niklasb · 2014-05-10T01:30:37Z

@tsiomenko @Bazza74 This should be fixed in version 0.9.

niklasb closed this as completed May 10, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practice for handling InvalidResponseError exceptions? #6

Best practice for handling InvalidResponseError exceptions? #6

pommygranite commented Apr 3, 2012

niklasb commented Apr 3, 2012

pommygranite commented Apr 4, 2012

pommygranite commented Apr 4, 2012

pommygranite commented Apr 4, 2012

pommygranite commented Apr 5, 2012

pommygranite commented Apr 10, 2012

niklasb commented Apr 11, 2012

tsiomenko commented Dec 9, 2012

niklasb commented Dec 10, 2012

Bazza74 commented Jan 7, 2014

niklasb commented Jan 7, 2014

Bazza74 commented Jan 7, 2014

niklasb commented May 10, 2014

Best practice for handling InvalidResponseError exceptions? #6

Best practice for handling InvalidResponseError exceptions? #6

Comments

pommygranite commented Apr 3, 2012

niklasb commented Apr 3, 2012

pommygranite commented Apr 4, 2012

pommygranite commented Apr 4, 2012

pommygranite commented Apr 4, 2012

pommygranite commented Apr 5, 2012

pommygranite commented Apr 10, 2012

niklasb commented Apr 11, 2012

tsiomenko commented Dec 9, 2012

niklasb commented Dec 10, 2012

Bazza74 commented Jan 7, 2014

niklasb commented Jan 7, 2014

Bazza74 commented Jan 7, 2014

niklasb commented May 10, 2014