Purifier

A simple scraping library.

It allows you to easily create simple and concise scrapers, even when the input is quite messy.

Example usage

Extract titles and URLs of articles from Hacker News:

from purifier import request, html, xpath, maps, fields, one

scraper = (
    request()
    | html()
    | xpath('//a[@class="titlelink"]')
    | maps(
        fields(
            title=xpath("text()") | one(),
            url=xpath("@href") | one(),
        )
    )
)

result = scraper.scrape("https://news.ycombinator.com")

result == [
     {
         "title": "Why Is the Web So Monotonous? Google",
         "url": "https://reasonablypolymorphic.com/blog/monotonous-web/index.html",
     },
     {
         "title": "Old jokes",
         "url": "https://dynomight.net/old-jokes/",
     },
     ...
]

Installation

pip install purifier

Docs

Tutorial
Available scrapers — API reference.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
docs		docs
test_data		test_data
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
purifier.py		purifier.py
pyproject.toml		pyproject.toml
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

test_data

test_data

.gitattributes

.gitattributes

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

poetry.lock

poetry.lock

purifier.py

purifier.py

pyproject.toml

pyproject.toml

tests.py

tests.py

Repository files navigation

Purifier

Example usage

Installation

Docs

About

Releases

Languages

License

gleb-akhmerov/purifier

Folders and files

Latest commit

History

Repository files navigation

Purifier

Example usage

Installation

Docs

About

Topics

Resources

License

Stars

Watchers

Forks

Languages