Skip to content
This repository has been archived by the owner on Mar 24, 2022. It is now read-only.
/ aiospider Public archive

Use asyncio and aiohttp write a little spider

License

Notifications You must be signed in to change notification settings

hxzhao527/aiospider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AioSpider

This is a little toy I write when I learn python's asynchron.
By reference to the nodejs package node-spider, I write this "aiospider".They have similar apis.

But because of the difference between python and javascript, I can't convert splice to remove simplily when I want to control the maximum of tasks.

Fortunately I found this crawler-demo. It just uses a list of which length exactly is the maximum concurrent number. Just look like this:

workers = [asyncio.Task(work(), loop=loop) for _ in range(max_tasks)]
await Queue.join()
for w in workers:
    w.cancel()

Install:

    python3 setup.py install
    pip install -r requirements.txt

Example:

from aiospider import Spider
with Spider() as ss:
    async def parse_page(response):
        '''
        callback
        The response is just an aiohttp.ClientResponse object now.
        '''
        print("request url is %s, response status is %d"%(response.url,response.status))
    ss.start('https://www.python.org/',parse_page)
'''
my result : request url is https://www.python.org/, response status is 200
'''

TODO

  1. request and callback exception handle
  2. taskqueue call task with multi-parameter
  3. wrap request , add proxy, etc.
  4. wrap response

About

Use asyncio and aiohttp write a little spider

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages