Django + Dryscrape #42

arvindr21 · 2015-11-11T11:20:11Z

Hello,

I am using Django + Dryscrape to provide a web based on demand crawling. The code is as below

from django.shortcuts import render
from django.http import HttpResponse
from django.views.generic import View
from .forms import ScrapeForm

import dryscrape

class Scrape(View):
    def get(self, request):
        form = ScrapeForm()
        return render(request, 'index.html', {'form': form})

    def post(self, request, *args, **kwargs):
        dryscrape.start_xvfb()
        form = ScrapeForm(request.POST)
        if form.is_valid():

            try:
                sess = dryscrape.Session(base_url = form.data['BASE_URL'])
                sess.set_attribute('auto_load_images', True)
                # sess.set_timeout(30)

                sess.visit(form.data['BASE_URL'] + form.data['URL'])

                x = sess.wait_for_safe(lambda: sess.at_xpath(form.data['XPATH']))
                # x = sess.at_xpath(form.data['XPATH'])

                if x:
                    return HttpResponse(x.text())
                else:
                    return HttpResponse('No Element Found with the given xpath')

            except Exception as e:
                if(e.__doc__):
                    print e.__doc__

                if(e.message):
                    print e.message

                if (e.__doc__):
                    return HttpResponse('Scraping of page failed :: \n' + e.__doc__ +'\n'+ e.message)
                else:
                    return HttpResponse('Scraping of page failed');

        return render(request, 'index.html', {'form': form})

The problem I am facing right now is that, if the scraping fails for one url, the webkit server does not seem to restart unless I kill all services and restart again.

Is there a simple way that I can restart the webkit server when it crashes?

The error message

 Raised when the Webkit server closed the connection unexpectedly. 
Unexpected end of file

After the above error, unless I restart the services, the scraping does not work. The web app however is functional, presenting the error message Scraping of page failed.

Any solution would be highly helpful.

Note: This happens only on AWS ubuntu instance. Works fine on my Macbook Pro.

Thanks.

The text was updated successfully, but these errors were encountered:

arvindr21 · 2015-11-12T03:57:37Z

Upon deeper dwelling, the issue I am facing is a combination of EndOfStreamError and SocketError when visiting multiple urls during one session and unable to restart the webkit server when it crashes.

To summarize the issues,

Is there way I can restart the webkit_server when I land in the exception block? (Probably without running Webkit Server in a separate thread)?
Can I scrape multiple times using the same session or do I need to reset this?
Any idea if AWS EC2 instances will block request that are made from dryscrape? (Ex: URL - http://thejackalofjavascript.com/)

Thanks.

MRHarrison · 2016-01-29T19:44:43Z

@arvindr21 Did you ever figure this out?

vdraceil · 2016-02-05T09:45:22Z

Stuck at the same issue here!
Server crashes once and thats it.

JermellB · 2016-02-26T05:34:08Z

@arvindr21 EC2 won't block requests, @MRHarrison I ended up using a combination of selenium and PhantomJS to get around these issues with webkit_server

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Django + Dryscrape #42

Django + Dryscrape #42

arvindr21 commented Nov 11, 2015

arvindr21 commented Nov 12, 2015

MRHarrison commented Jan 29, 2016

vdraceil commented Feb 5, 2016

JermellB commented Feb 26, 2016

Django + Dryscrape #42

Django + Dryscrape #42

Comments

arvindr21 commented Nov 11, 2015

arvindr21 commented Nov 12, 2015

MRHarrison commented Jan 29, 2016

vdraceil commented Feb 5, 2016

JermellB commented Feb 26, 2016