GitHub - ivbeg/lazyscraper: Lazy helper tool to make easier scraping with simple tasks

About Lazyscraper

Lazyscraper is a simple command line tool and library, a swiss knife for scraper writers. It's created to work only from command line and to make easier scraper writing for very simple tasks like extraction of external urls or simple table.

Supported patterns

simpleul - Extracts list of urls with pattern ul/li/a. Returns array of urls with "_text" and "href" fields
simpleopt - Extracts list of options values with pattern select/option. Returns array: "_text", "value"
exturls - Extracts list of urls that leads to external websites. Returns array of urls with "_text" and "href" fields
getforms - Extracts all forms from website. Returns complex JSON data with each form on the page

Command-line tool

Usage: lazyscraper.py [OPTIONS] COMMAND [ARGS]...

Options:

--help

Show this message and exit.

Commands: * extract Extract data with xpath * gettable Extracts table with data from html * use Uses predefined pattern to extract page data

Examples

Extracts list of photos and names of Russian government ministers and outputs it to "gov_persons.csv"

python lscraper.py extract --url http://government.ru/en/gov/persons/ --xpath "//img[@class='photo']" --fieldnames src,srcset,alt --absolutize True --output gov_persons.csv --format csv

Extracts list of ministries from Russian government website using pattern "simpleul" and from UL tag with class "departments col col__wide" and outputs absolutized urls.

python lscraper.py use --pattern simpleul --nodeclass "departments col col__wide" --url http://government.ru/en/ministries --absolutize True

Extracts list of territorial organizations urls from Russian tax service website using pattern "simpleopt".

python lscraper.py use --pattern simpleopt --url http://nalog.ru

Extracts all forms from Russian tax service website using pattern "getforms". Returns JSON with each form and each button, input and select

python lscraper.py use --pattern getforms --url http://nalog.ru

Extracts list of websites urls of Russian Federal Treasury and uses awk to extract domains.

python lscraper.py extract --url http://roskazna.ru --xpath "//ul[@class='site-list']/li/a" --fieldnames href | awk -F/ '{print $3}'

How to use library

Extracts all urls with fields: src, alt, href and _text from gov.uk website

>>> from lazyscraper import extract_data_xpath
>>> extract_data_xpath('http://gov.uk', xpath='//a', fieldnames='src,alt,href,_text', absolutize=True)

Run pattern 'simpleopt' against Russian federal treasury website

>>> from lazyscraper import use_pattern
>>> use_pattern('http://roskazna.ru', 'simpleopt')

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.idea		.idea
bin		bin
docs		docs
examples		examples
lazyscraper		lazyscraper
.coveragerc		.coveragerc
.editorconfig		.editorconfig
.gitignore		.gitignore
.travis.yml		.travis.yml
AUTHORS.rst		AUTHORS.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
Makefile		Makefile
README.rst		README.rst
flake8		flake8
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About Lazyscraper

Supported patterns

Command-line tool

Examples

How to use library

Requirements

About

Releases

Packages

Languages

License

ivbeg/lazyscraper

Folders and files

Latest commit

History

Repository files navigation

About Lazyscraper

Supported patterns

Command-line tool

Examples

How to use library

Requirements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages