Anchor some data in the web and automatically save periodically.
Currently running tasks:
- Tencent Careers
- Baidu Careers
- ByteDance Careers
- Alibaba Careers
- JD Careers
- Bilibili Careers
- Meituan Careers
- NetEase Careers
- PDD Careers
- 360 Careers
Things are always changing, and I want to find a way to record this change easily.
A web crawler is a great tool to get data from web efficiently. So does GitHub Action, which automate the process.
Solve real problems by combing existing tools is what anchor will do.
- GitHub Action
- Python3
Inspired by Scrapy.
The process is very simple.
stateDiagram
[*] --> Requester
Requester --> Exception
Exception --> [*]
Requester --> Processor
Processor --> Exception
Processor --> Exporter
Exporter --> Exception
Exporter --> [*]
- DataItem: user-defined data model
- Requester: issue a network request
- Responser: store information of response
- Processor: pure function to convert data from requester to DataItem
- Exporter: deal with DataItem, like saving to
.json
file or exporting to DB, etc. - Task: a task scheduled by Anchor Engine
- Anchor Engine: asynchronous style task handler
- add jd, bilibili, meituan, netease, pdd, 360 career task
- add alibaba-career-task and byte-dance-career-task
- add retry to GitHub Action
- basic functions completed
- add tencent-career-task and baidu-career-task