Skip to content
This repository has been archived by the owner on Dec 9, 2018. It is now read-only.

Django + Dryscrape #42

Open
arvindr21 opened this issue Nov 11, 2015 · 4 comments
Open

Django + Dryscrape #42

arvindr21 opened this issue Nov 11, 2015 · 4 comments

Comments

@arvindr21
Copy link

Hello,

I am using Django + Dryscrape to provide a web based on demand crawling. The code is as below

from django.shortcuts import render
from django.http import HttpResponse
from django.views.generic import View
from .forms import ScrapeForm

import dryscrape

class Scrape(View):
    def get(self, request):
        form = ScrapeForm()
        return render(request, 'index.html', {'form': form})

    def post(self, request, *args, **kwargs):
        dryscrape.start_xvfb()
        form = ScrapeForm(request.POST)
        if form.is_valid():

            try:
                sess = dryscrape.Session(base_url = form.data['BASE_URL'])
                sess.set_attribute('auto_load_images', True)
                # sess.set_timeout(30)

                sess.visit(form.data['BASE_URL'] + form.data['URL'])

                x = sess.wait_for_safe(lambda: sess.at_xpath(form.data['XPATH']))
                # x = sess.at_xpath(form.data['XPATH'])

                if x:
                    return HttpResponse(x.text())
                else:
                    return HttpResponse('No Element Found with the given xpath')

            except Exception as e:
                if(e.__doc__):
                    print e.__doc__

                if(e.message):
                    print e.message

                if (e.__doc__):
                    return HttpResponse('Scraping of page failed :: \n' + e.__doc__ +'\n'+ e.message)
                else:
                    return HttpResponse('Scraping of page failed');

        return render(request, 'index.html', {'form': form})

The problem I am facing right now is that, if the scraping fails for one url, the webkit server does not seem to restart unless I kill all services and restart again.

Is there a simple way that I can restart the webkit server when it crashes?

The error message

 Raised when the Webkit server closed the connection unexpectedly. 
Unexpected end of file

After the above error, unless I restart the services, the scraping does not work. The web app however is functional, presenting the error message Scraping of page failed.

Any solution would be highly helpful.

Note: This happens only on AWS ubuntu instance. Works fine on my Macbook Pro.

Thanks.

@arvindr21
Copy link
Author

Upon deeper dwelling, the issue I am facing is a combination of EndOfStreamError and SocketError when visiting multiple urls during one session and unable to restart the webkit server when it crashes.

To summarize the issues,

  1. Is there way I can restart the webkit_server when I land in the exception block? (Probably without running Webkit Server in a separate thread)?
  2. Can I scrape multiple times using the same session or do I need to reset this?
  3. Any idea if AWS EC2 instances will block request that are made from dryscrape? (Ex: URL - http://thejackalofjavascript.com/)

Thanks.

@MRHarrison
Copy link

@arvindr21 Did you ever figure this out?

@vdraceil
Copy link

vdraceil commented Feb 5, 2016

Stuck at the same issue here!
Server crashes once and thats it.

@JermellB
Copy link

@arvindr21 EC2 won't block requests, @MRHarrison I ended up using a combination of selenium and PhantomJS to get around these issues with webkit_server

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants