Cyborg

Cyborg is an asyncio Python 3 web scraping framework that helps you write programs to extract information from websites by reading and inspecting their HTML.

What?

Scraping websites for data can be fairly complex when you are dealing with data across multiple pages, request limits and error handling. Cyborg aims to handle all of this for you transparently, so that you can focus on the actual extraction of data rather than all the stuff around it. It does this by helping you break the process down into smaller chunks, which can be combined into a Pipeline, for example below is a Pipeline that scrapes takeaway reviews from Just-Eat (the complete example can be found in examples/just-eat):

with open("output.json", "w") as output_fd:
    pipeline = Job("ReviewScraper") | scrape_places | unique("id") | scrape_reviews.parallel(5)
    pipeline < string.ascii_lowercase
    pipeline > output_fd

    pipeline.monitor() > sys.stdout

    pipeline.run_until_complete()

The pipeline has several stages:

scrape_places - This scrapes the list of takeaways from a particular area. The area is found by the first letter of the postcode, so we brute-force this by inputting a-z (pipeline < string.ascii_lowercase)
unique('id') - Takeaways may serve more than one area, this filters out any duplicate takeaways based on their ID
scrape_reviews.parallel(5) - This starts 5 parallel tasks to scrape the reviews from a particular takeaway.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
cyborg		cyborg
examples/just-eat		examples/just-eat
test		test
.gitignore		.gitignore
.travis.yml		.travis.yml
readme.md		readme.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cyborg

What?

About

Releases

Packages

Languages

orf/cyborg

Folders and files

Latest commit

History

Repository files navigation

Cyborg

What?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages