项目通过celery实现分布式爬虫,使用redis去重,但BOSS直聘网通过封禁IP的策略进行了封锁影响了爬取的效率,最好另起一个单独的项目服务提供大量的代理IP,Github上有大量这样的项目。每次爬取任务的有效JD大概是30~40万,不断爬取去重后,整个的数据规模也不算很大,尝试通过不同维度的入口文件去执行爬取任务对效果的提升不大。
-
Notifications
You must be signed in to change notification settings - Fork 11
hjlarry/bosszhipin
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
BOSS直聘网爬虫
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published