Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling self.start as an instance method for a Spider #101

Closed
abmyii opened this issue Feb 22, 2020 · 11 comments
Closed

Calling self.start as an instance method for a Spider #101

abmyii opened this issue Feb 22, 2020 · 11 comments

Comments

@abmyii
Copy link
Contributor

abmyii commented Feb 22, 2020

I have the following parent class which has reusable code for all the spiders in my project (this is just a basic example):

class Downloader(Spider):
    concurrency = 15
    worker_numbers = 2

    # RETRY_DELAY (secs) is time between retries
    request_config = {
        "RETRIES": 10,
        "DELAY": 0,
        "RETRY_DELAY": 0.1
    }

    db_name = "DB"
    db_url = "postgresql://..."
    main_table = "test"

    def __init__(self, *args, **kwargs):
        # Initialise DB connection
        self.db = DB(self.db_url, self.db_name, self.main_table)

    def download(self):
        self.start()
        
		# After completion, commit to DB
        self.db.commit()

I use it by sub-classing for each different spider. However, it seems that self.start cannot be accessed as an instance for spiders (since it's a classmethod) - giving this error:

Traceback (most recent call last):
  File "src/scraper.py", line 107, in <module>
    scraper = Scraper()
  File "src/downloader.py", line 31, in __init__
    super(Downloader, self).__init__(*args, **kwargs)
  File "/usr/lib/python3.8/site-packages/ruia/spider.py", line 159, in __init__
    self.request_session = ClientSession()
  File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 210, in __init__
    loop = get_running_loop(loop)
  File "/usr/lib/python3.8/site-packages/aiohttp/helpers.py", line 269, in get_running_loop
    loop = asyncio.get_event_loop()
  File "/usr/lib/python3.8/asyncio/events.py", line 639, in get_event_loop
    raise RuntimeError('There is no current event loop in thread %r.'
RuntimeError: There is no current event loop in thread 'MainThread'.
Exception ignored in: <function ClientSession.__del__ at 0x7f28875e8b80>
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 302, in __del__
    if not self.closed:
  File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 916, in closed
    return self._connector is None or self._connector.closed
AttributeError: 'ClientSession' object has no attribute '_connector'

Any idea how I can solve this issue whilst maintaining the structure I am trying to implement?

@howie6879
Copy link
Owner

Why not build a plugin to solve your problem?

Just like this https://github.com/python-ruia/ruia-motor

@abmyii
Copy link
Contributor Author

abmyii commented Feb 22, 2020

I need to be able to run a function after the spider has ended - would that be possible with a plugin?

@howie6879
Copy link
Owner

Ruia has a hook function called after_start

@abmyii
Copy link
Contributor Author

abmyii commented Feb 23, 2020

before_stop is what I was using before, but I would have to set it for every spider. Is there any way to subclass Spider?

@howie6879
Copy link
Owner

you can use @classmethod:

for example:

@classmethod
def download(cls):
    cls.start()

or make the Spider as a Downloader's property

def __init__(self, *args, **kwargs):
    self.target_spider = Spider

@abmyii
Copy link
Contributor Author

abmyii commented Feb 23, 2020

I tried the classmethod way but got this error:

Traceback (most recent call last):
  File "src/scraper.py", line 106, in <module>
    scraper = Scraper()
  File "src/downloader.py", line 31, in __init__
    super(Downloader, self).__init__(*args, **kwargs)
  File "/usr/lib/python3.8/site-packages/ruia/spider.py", line 159, in __init__
    self.request_session = ClientSession()
  File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 210, in __init__
    loop = get_running_loop(loop)
  File "/usr/lib/python3.8/site-packages/aiohttp/helpers.py", line 269, in get_running_loop
    loop = asyncio.get_event_loop()
  File "/usr/lib/python3.8/asyncio/events.py", line 639, in get_event_loop
    raise RuntimeError('There is no current event loop in thread %r.'
RuntimeError: There is no current event loop in thread 'MainThread'.
Exception ignored in: <function ClientSession.__del__ at 0x7fc7ad261b80>
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 302, in __del__
    if not self.closed:
  File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 916, in closed
    return self._connector is None or self._connector.closed
AttributeError: 'ClientSession' object has no attribute '_connector'

I expect I'll have to use the property method which isn't too bad either. Thanks for pointing those out!

@howie6879
Copy link
Owner

from ruia import Spider


async def retry_func(request):
    request.request_config["TIMEOUT"] = 10


class RetryDemo(Spider):
    start_urls = ["http://httpbin.org/get"]

    request_config = {"RETRIES": 3, "DELAY": 0, "TIMEOUT": 10, "RETRY_FUNC": retry_func}

    @classmethod
    def downloader(cls):
        cls.start()

    async def parse(self, response):
        pages = ["http://httpbin.org/get?p=1", "http://httpbin.org/get?p=2"]
        async for resp in self.multiple_request(pages):
            yield self.parse_item(response=resp)

    async def parse_item(self, response):
        json_data = await response.json()
        print(json_data)


if __name__ == "__main__":
    RetryDemo.downloader()

@abmyii
Copy link
Contributor Author

abmyii commented Feb 23, 2020

Awesome! That worked, even with a parent class:

from ruia import Spider
import time


async def retry_func(request):
    request.request_config["TIMEOUT"] = 10


class Downloader(Spider):
    concurrency = 150
    worker_numbers = 8

    # RETRY_DELAY (secs) is time between retries
    request_config = {
        "RETRIES": 10,
        "DELAY": 0,
        "RETRY_DELAY": 0.1
    }

    @classmethod
    def download(cls):
        cls.start()


class RetryDemo(Downloader):
    start_urls = ["http://httpbin.org/get"]

    request_config = {"RETRIES": 3, "DELAY": 0, "TIMEOUT": 10, "RETRY_FUNC": retry_func}

    async def parse(self, response):
        pages = ["http://httpbin.org/get?p=1", "http://httpbin.org/get?p=2"]
        async for resp in self.multiple_request(pages):
            yield self.parse_item(response=resp)

    async def parse_item(self, response):
        json_data = await response.json()
        print(json_data)


if __name__ == "__main__":
    RetryDemo.download()

My mistake was that I was trying to run my spider like this:

spider = Spider()
spider.download()

Which isn't possible as the download function is a classmethod (due to the constraint of the start function from Spider being a classmethod). Once I changed it to:

Spider.download()

It worked perfectly.

Thank you very much for your help!

@abmyii abmyii closed this as completed Feb 23, 2020
@howie6879
Copy link
Owner

This way of use is also a plugin

@abmyii
Copy link
Contributor Author

abmyii commented Feb 24, 2020

This way of use is also a plugin

I see, I wasn't sure what you meant before! Thanks again!

BTW, is there documentation for a plugin (in Chinese or English)? If not I may write some, if you would like.

@howie6879
Copy link
Owner

Although Ruia has a short documentation on plug-in implementation, I would expect you to implement a tutorial on plug-ins

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants