Skip to content
This repository has been archived by the owner on Dec 9, 2018. It is now read-only.

Best practice for handling InvalidResponseError exceptions? #6

Closed
pommygranite opened this issue Apr 3, 2012 · 13 comments
Closed

Best practice for handling InvalidResponseError exceptions? #6

pommygranite opened this issue Apr 3, 2012 · 13 comments

Comments

@pommygranite
Copy link

Sometimes sess.visit() throws an exceptions similar to this:

Error while loading URL https://apis.google.com/_/apps-static/_/js/gapi/googleapis_client,iframes_styles_bubble_internal/rt=j/ver=8ruqBK5Rz68.en_GB./sv=1/am=!Ze6NnRS0VYCICGRMrA/d=1/rs=AItRSTMCkBLPuEGW-K5opwJvfmORrpspJQ: Operation canceled (error code 5)

And this:

Error while loading URL http://www.asite.com/images/poster/mirror-mirror.jpg: Error downloading http://www.asite.com/images/poster/mirror-mirror.jpg - server replied: Not Found (error code 203)

It looks like one or more elements on the page I have requested has failed to load. The exceptions are thrown before sess.wait() can be called and are thrown despite having set sess.set_error_tolerant(True)

I might like to 1) retry the page with a specified timeout or 2) use the fetched page anyway, despite the absence of some images or nested frames .
What are the best practices for achieving 1 and 2?

@niklasb
Copy link
Owner

niklasb commented Apr 3, 2012

In both cases, it doesn't seem like a timeout would help a lot. As for suggestion 2) I'd have to look into the webkit_server sources to make it more robust. For the reason it's usually used (web application testing), it's in fact desirable that exceptions are thrown in these cases, so I don't think they have such a functionality built in. Will look at it as soon as I find the time :)

It would help a lot if you could provide minimal scripts that try to access these pages, so I can reproduce the problem.

@pommygranite
Copy link
Author

from dryscrape import Session
from dryscrape.driver.webkit import Driver
from webkit_server import InvalidResponseError

link = 'http://www.omlinks.com'
sess = Session(driver = Driver())
sess.set_error_tolerant(True)
try:
sess.visit(link)
sess.wait()
print 'Success'
except InvalidResponseError as e:
print 'InvalidResponseError:', e


Result:

InvalidResponseError: Error while loading URL http://www.omlinks.com/images/poster/mirror-mirror.jpg: Error downloading http://www.omlinks.com/images/poster/mirror-mirror.jpg - server replied: Not Found (error code 203)

@pommygranite
Copy link
Author

and for error code 5, try above script with link = 'http://www.nutorrent.com'

@pommygranite
Copy link
Author

I should also add, occasionally an InvalidResponseError is thrown by sess.render(), after sess.visit() and sess.wait() have returned successfully.

File "/home/user1/projects/MyBot/MyScraper.py", line 87, in Scrape
sess.render(path_screenshot)
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 260, in render
self.conn.issue_command("Render", path, width, height)
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 429, in issue_command
return self._read_response()
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 438, in _read_response
raise InvalidResponseError, self._read_message()
webkit_server.InvalidResponseError

@pommygranite
Copy link
Author

List of some Exception errors thrown when loading sub elements of some sites, even with sess.set_error_tolerant(True):

Connection Refused (error code 1) eg. 'http://www/dl4all.com'
Connection Closed (error code 2) eg. 'http://www.youku.com/v'
Socket operation timed out (error code 4) eg. 'http://www.warezinfinite.net'
Operation Canceled (error code 5) eg. 'http://www.avaxhome.ws'
Host unreachable (error code 99) eg. 'http://bitspyder.net'
Forbidden (error code 202) eg. 'http://www.torrentfreak.com'
Noone Here (error code 203) eg. http://'www.mechoddl.com'

@pommygranite
Copy link
Author

For my purposes, it would be ideal to pass in a list of error codes that should not throw exceptions when loading referenced sub-elements of a page. Specifying error codes that throw when accessing the base page itself is less important, as they generally mean the page isn't available.

@niklasb
Copy link
Owner

niklasb commented Apr 11, 2012

@pommygranite: While the general possibility of ignoring any error in associated resources seems like a useful addition, what you described in your last comment is very specific. I don't think I will implement it in that form (but I'll accept clean pull requests :)

@tsiomenko
Copy link

This oversight is actually critical and the main reason why I stopped using dryscrape for my scraping project.. a headless browser for scraping must be robust, I don't understand why there's an "error tolerant" mode that does nothing. It is literally impossible to get past 203 or error code 5 errors.

@niklasb
Copy link
Owner

niklasb commented Dec 10, 2012

@tsiomenko: This is unfortunate. I completely agree that having set_error_tolerant is unintuitive and bad. It was mainly a workaround for errors I encountered while using this library in my own projects. I should remove that method now that it became clear that it doesn't help with most of the errors.

The thing is that webkit_server is not my project, it's borrowed from capybara-webkit, where robustness is not really an issue (in fact, for web testing, you actually want to catch those errors). I spent some time trying to figure out how Qt's Webkit implementation handles this kind of stuff and was under the impression that the interface doesn't really provide mechanisms for selective dropping of errors. I guess this due to the unclear semantics in this case (how should a site load if a javascript is missing? what to do if images or inner frames cannot be loaded? Should this maybe dependend on whether the image/frame is necessary for the page to work? a real browser probably implements nasty heuristics on how to handle these cases). Remember that this is a community project, so if you have an idea on how to improve this, feel free to provide a pull request or outline your idea in a ticket.

@Bazza74
Copy link

Bazza74 commented Jan 7, 2014

So is there no way to properly deal with these kind of errors now?

@niklasb
Copy link
Owner

niklasb commented Jan 7, 2014

I remember that I tried several different approaches to get it to "ignore" errors that happen during "certain" requests (for example, images etc.). I didn't have much luck though, but maybe the open pull request on webkit-server will have some code improvements that make something like this more easy to implement. I still don't know what the correct semantics should be though.

@Bazza74
Copy link

Bazza74 commented Jan 7, 2014

Okay, thanks for the reply! Hopefully you or someone else can figure it out. Sadly it's not something I can help with.

@niklasb
Copy link
Owner

niklasb commented May 10, 2014

@tsiomenko @Bazza74 This should be fixed in version 0.9.

@niklasb niklasb closed this as completed May 10, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants