A simple scraping library.
It allows you to easily create simple and concise scrapers, even when the input is quite messy.
Extract titles and URLs of articles from Hacker News:
from purifier import request, html, xpath, maps, fields, one
scraper = (
request()
| html()
| xpath('//a[@class="titlelink"]')
| maps(
fields(
title=xpath("text()") | one(),
url=xpath("@href") | one(),
)
)
)
result = scraper.scrape("https://news.ycombinator.com")
result == [
{
"title": "Why Is the Web So Monotonous? Google",
"url": "https://reasonablypolymorphic.com/blog/monotonous-web/index.html",
},
{
"title": "Old jokes",
"url": "https://dynomight.net/old-jokes/",
},
...
]
pip install purifier
- Tutorial
- Available scrapers — API reference.