Calling `self.start` as an instance method for a `Spider` #101

abmyii · 2020-02-22T00:42:56Z

I have the following parent class which has reusable code for all the spiders in my project (this is just a basic example):

class Downloader(Spider):
    concurrency = 15
    worker_numbers = 2

    # RETRY_DELAY (secs) is time between retries
    request_config = {
        "RETRIES": 10,
        "DELAY": 0,
        "RETRY_DELAY": 0.1
    }

    db_name = "DB"
    db_url = "postgresql://..."
    main_table = "test"

    def __init__(self, *args, **kwargs):
        # Initialise DB connection
        self.db = DB(self.db_url, self.db_name, self.main_table)

    def download(self):
        self.start()
        
		# After completion, commit to DB
        self.db.commit()

I use it by sub-classing for each different spider. However, it seems that self.start cannot be accessed as an instance for spiders (since it's a classmethod) - giving this error:

Traceback (most recent call last):
  File "src/scraper.py", line 107, in <module>
    scraper = Scraper()
  File "src/downloader.py", line 31, in __init__
    super(Downloader, self).__init__(*args, **kwargs)
  File "/usr/lib/python3.8/site-packages/ruia/spider.py", line 159, in __init__
    self.request_session = ClientSession()
  File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 210, in __init__
    loop = get_running_loop(loop)
  File "/usr/lib/python3.8/site-packages/aiohttp/helpers.py", line 269, in get_running_loop
    loop = asyncio.get_event_loop()
  File "/usr/lib/python3.8/asyncio/events.py", line 639, in get_event_loop
    raise RuntimeError('There is no current event loop in thread %r.'
RuntimeError: There is no current event loop in thread 'MainThread'.
Exception ignored in: <function ClientSession.__del__ at 0x7f28875e8b80>
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 302, in __del__
    if not self.closed:
  File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 916, in closed
    return self._connector is None or self._connector.closed
AttributeError: 'ClientSession' object has no attribute '_connector'

Any idea how I can solve this issue whilst maintaining the structure I am trying to implement?

The text was updated successfully, but these errors were encountered:

howie6879 · 2020-02-22T01:54:49Z

Why not build a plugin to solve your problem?

Just like this https://github.com/python-ruia/ruia-motor

abmyii · 2020-02-22T15:25:01Z

I need to be able to run a function after the spider has ended - would that be possible with a plugin?

howie6879 · 2020-02-23T03:20:07Z

Ruia has a hook function called after_start

abmyii · 2020-02-23T09:39:34Z

before_stop is what I was using before, but I would have to set it for every spider. Is there any way to subclass Spider?

howie6879 · 2020-02-23T10:37:50Z

you can use @classmethod:

for example:

@classmethod
def download(cls):
    cls.start()

or make the Spider as a Downloader's property

def __init__(self, *args, **kwargs):
    self.target_spider = Spider

abmyii · 2020-02-23T10:49:38Z

I tried the classmethod way but got this error:

Traceback (most recent call last):
  File "src/scraper.py", line 106, in <module>
    scraper = Scraper()
  File "src/downloader.py", line 31, in __init__
    super(Downloader, self).__init__(*args, **kwargs)
  File "/usr/lib/python3.8/site-packages/ruia/spider.py", line 159, in __init__
    self.request_session = ClientSession()
  File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 210, in __init__
    loop = get_running_loop(loop)
  File "/usr/lib/python3.8/site-packages/aiohttp/helpers.py", line 269, in get_running_loop
    loop = asyncio.get_event_loop()
  File "/usr/lib/python3.8/asyncio/events.py", line 639, in get_event_loop
    raise RuntimeError('There is no current event loop in thread %r.'
RuntimeError: There is no current event loop in thread 'MainThread'.
Exception ignored in: <function ClientSession.__del__ at 0x7fc7ad261b80>
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 302, in __del__
    if not self.closed:
  File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 916, in closed
    return self._connector is None or self._connector.closed
AttributeError: 'ClientSession' object has no attribute '_connector'

I expect I'll have to use the property method which isn't too bad either. Thanks for pointing those out!

howie6879 · 2020-02-23T11:14:54Z

from ruia import Spider


async def retry_func(request):
    request.request_config["TIMEOUT"] = 10


class RetryDemo(Spider):
    start_urls = ["http://httpbin.org/get"]

    request_config = {"RETRIES": 3, "DELAY": 0, "TIMEOUT": 10, "RETRY_FUNC": retry_func}

    @classmethod
    def downloader(cls):
        cls.start()

    async def parse(self, response):
        pages = ["http://httpbin.org/get?p=1", "http://httpbin.org/get?p=2"]
        async for resp in self.multiple_request(pages):
            yield self.parse_item(response=resp)

    async def parse_item(self, response):
        json_data = await response.json()
        print(json_data)


if __name__ == "__main__":
    RetryDemo.downloader()

abmyii · 2020-02-23T14:04:43Z

Awesome! That worked, even with a parent class:

from ruia import Spider
import time


async def retry_func(request):
    request.request_config["TIMEOUT"] = 10


class Downloader(Spider):
    concurrency = 150
    worker_numbers = 8

    # RETRY_DELAY (secs) is time between retries
    request_config = {
        "RETRIES": 10,
        "DELAY": 0,
        "RETRY_DELAY": 0.1
    }

    @classmethod
    def download(cls):
        cls.start()


class RetryDemo(Downloader):
    start_urls = ["http://httpbin.org/get"]

    request_config = {"RETRIES": 3, "DELAY": 0, "TIMEOUT": 10, "RETRY_FUNC": retry_func}

    async def parse(self, response):
        pages = ["http://httpbin.org/get?p=1", "http://httpbin.org/get?p=2"]
        async for resp in self.multiple_request(pages):
            yield self.parse_item(response=resp)

    async def parse_item(self, response):
        json_data = await response.json()
        print(json_data)


if __name__ == "__main__":
    RetryDemo.download()

My mistake was that I was trying to run my spider like this:

spider = Spider()
spider.download()

Which isn't possible as the download function is a classmethod (due to the constraint of the start function from Spider being a classmethod). Once I changed it to:

Spider.download()

It worked perfectly.

Thank you very much for your help!

howie6879 · 2020-02-24T01:25:46Z

This way of use is also a plugin

abmyii · 2020-02-24T11:11:59Z

This way of use is also a plugin

I see, I wasn't sure what you meant before! Thanks again!

BTW, is there documentation for a plugin (in Chinese or English)? If not I may write some, if you would like.

howie6879 · 2020-02-24T12:26:07Z

Although Ruia has a short documentation on plug-in implementation, I would expect you to implement a tutorial on plug-ins

abmyii closed this as completed Feb 23, 2020

abmyii mentioned this issue Feb 24, 2020

Add plugin documentation and examples #102

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calling `self.start` as an instance method for a `Spider` #101

Calling `self.start` as an instance method for a `Spider` #101

abmyii commented Feb 22, 2020

howie6879 commented Feb 22, 2020

abmyii commented Feb 22, 2020

howie6879 commented Feb 23, 2020

abmyii commented Feb 23, 2020

howie6879 commented Feb 23, 2020

abmyii commented Feb 23, 2020

howie6879 commented Feb 23, 2020

abmyii commented Feb 23, 2020 •

edited

howie6879 commented Feb 24, 2020

abmyii commented Feb 24, 2020

howie6879 commented Feb 24, 2020

Calling self.start as an instance method for a Spider #101

Calling self.start as an instance method for a Spider #101

Comments

abmyii commented Feb 22, 2020

howie6879 commented Feb 22, 2020

abmyii commented Feb 22, 2020

howie6879 commented Feb 23, 2020

abmyii commented Feb 23, 2020

howie6879 commented Feb 23, 2020

abmyii commented Feb 23, 2020

howie6879 commented Feb 23, 2020

abmyii commented Feb 23, 2020 • edited

howie6879 commented Feb 24, 2020

abmyii commented Feb 24, 2020

howie6879 commented Feb 24, 2020

Calling `self.start` as an instance method for a `Spider` #101

Calling `self.start` as an instance method for a `Spider` #101

abmyii commented Feb 23, 2020 •

edited