-
Notifications
You must be signed in to change notification settings - Fork 67
Best practice for handling InvalidResponseError exceptions? #6
Comments
In both cases, it doesn't seem like a timeout would help a lot. As for suggestion 2) I'd have to look into the webkit_server sources to make it more robust. For the reason it's usually used (web application testing), it's in fact desirable that exceptions are thrown in these cases, so I don't think they have such a functionality built in. Will look at it as soon as I find the time :) It would help a lot if you could provide minimal scripts that try to access these pages, so I can reproduce the problem. |
from dryscrape import Session link = 'http://www.omlinks.com' Result: InvalidResponseError: Error while loading URL http://www.omlinks.com/images/poster/mirror-mirror.jpg: Error downloading http://www.omlinks.com/images/poster/mirror-mirror.jpg - server replied: Not Found (error code 203) |
and for error code 5, try above script with link = 'http://www.nutorrent.com' |
I should also add, occasionally an InvalidResponseError is thrown by sess.render(), after sess.visit() and sess.wait() have returned successfully. File "/home/user1/projects/MyBot/MyScraper.py", line 87, in Scrape |
List of some Exception errors thrown when loading sub elements of some sites, even with sess.set_error_tolerant(True): Connection Refused (error code 1) eg. 'http://www/dl4all.com' |
For my purposes, it would be ideal to pass in a list of error codes that should not throw exceptions when loading referenced sub-elements of a page. Specifying error codes that throw when accessing the base page itself is less important, as they generally mean the page isn't available. |
@pommygranite: While the general possibility of ignoring any error in associated resources seems like a useful addition, what you described in your last comment is very specific. I don't think I will implement it in that form (but I'll accept clean pull requests :) |
This oversight is actually critical and the main reason why I stopped using dryscrape for my scraping project.. a headless browser for scraping must be robust, I don't understand why there's an "error tolerant" mode that does nothing. It is literally impossible to get past 203 or error code 5 errors. |
@tsiomenko: This is unfortunate. I completely agree that having The thing is that webkit_server is not my project, it's borrowed from capybara-webkit, where robustness is not really an issue (in fact, for web testing, you actually want to catch those errors). I spent some time trying to figure out how Qt's Webkit implementation handles this kind of stuff and was under the impression that the interface doesn't really provide mechanisms for selective dropping of errors. I guess this due to the unclear semantics in this case (how should a site load if a javascript is missing? what to do if images or inner frames cannot be loaded? Should this maybe dependend on whether the image/frame is necessary for the page to work? a real browser probably implements nasty heuristics on how to handle these cases). Remember that this is a community project, so if you have an idea on how to improve this, feel free to provide a pull request or outline your idea in a ticket. |
So is there no way to properly deal with these kind of errors now? |
I remember that I tried several different approaches to get it to "ignore" errors that happen during "certain" requests (for example, images etc.). I didn't have much luck though, but maybe the open pull request on webkit-server will have some code improvements that make something like this more easy to implement. I still don't know what the correct semantics should be though. |
Okay, thanks for the reply! Hopefully you or someone else can figure it out. Sadly it's not something I can help with. |
@tsiomenko @Bazza74 This should be fixed in version 0.9. |
Sometimes sess.visit() throws an exceptions similar to this:
Error while loading URL https://apis.google.com/_/apps-static/_/js/gapi/googleapis_client,iframes_styles_bubble_internal/rt=j/ver=8ruqBK5Rz68.en_GB./sv=1/am=!Ze6NnRS0VYCICGRMrA/d=1/rs=AItRSTMCkBLPuEGW-K5opwJvfmORrpspJQ: Operation canceled (error code 5)
And this:
Error while loading URL http://www.asite.com/images/poster/mirror-mirror.jpg: Error downloading http://www.asite.com/images/poster/mirror-mirror.jpg - server replied: Not Found (error code 203)
It looks like one or more elements on the page I have requested has failed to load. The exceptions are thrown before sess.wait() can be called and are thrown despite having set sess.set_error_tolerant(True)
I might like to 1) retry the page with a specified timeout or 2) use the fetched page anyway, despite the absence of some images or nested frames .
What are the best practices for achieving 1 and 2?
The text was updated successfully, but these errors were encountered: