Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
85 lines (63 sloc) 2.64 KB

Spider Transport

Spider transport is a component of Spider that controls network connections i.e. makes possible multiple network requests to run in parallel.

Multicurl transport

This is default spider transport. It operates with multiple pycurl instances. You can use only pycurl Grab transport with multicurl Spider transport.

from grab.spider import Spider, Task
from grab import Grab
import logging

class SimpleSpider(Spider):
    def task_generator(self):
        yield Task('reddit', 'http://reddit.com')

    def task_reddit(self, grab, task):
        url = grab.doc('//p[contains(@class, "title")]/a/@href').text()
        # DO NOT DO THAT:
        # > g = Grab()
        # > g.go(url)
        # Do not use Grab directly
        # that will blocks all other parallel network requests
        # Only use `yield Task(...)`
        url = grab.make_url_absolute(url)
        yield Task('link', url=url)

    def task_link(self, grab, task):
        print('Title: %s' % grab.doc('//title').text())


logging.basicConfig(level=logging.DEBUG)
bot = SimpleSpider();
bot.run()

Threaded transport

The threaded transport operates with a pool of threads. Network requests are spread by these threads. You can use pycurl or urllib3 Grab transport with threaded transport.

Grab can use two libraries to submit network requests: pycurl and urllib3. You may acess transport object with Grab.transport attribute. In most cases you do not need direct access to transport object.

from grab.spider import Spider, Task
from grab import Grab
import logging

class SimpleSpider(Spider):
    def task_generator(self):
        yield Task('reddit', 'http://reddit.com')

    def task_reddit(self, grab, task):
        url = grab.doc('//p[contains(@class, "title")]/a/@href').text()
        # DO NOT DO THAT:
        # > g = Grab()
        # > g.go(url)
        # Do not use Grab directly
        # that will blocks all other parallel network requests
        # Only use `yield Task(...)`
        url = grab.make_url_absolute(url)
        yield Task('link', url=url)

    def task_link(self, grab, task):
        print('Title: %s' % grab.doc('//title').text())


logging.basicConfig(level=logging.DEBUG)
bot = SimpleSpider(transport='threaded', grab_transport='urllib3')
# Also you can use pycurl Grab transport with threaded transport
# bot = SimpleSpider(transport='threaded', grab_transport='pycurl')
bot.run()