Skip to content
master
Switch branches/tags
Code

Latest commit

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
wip
Apr 24, 2021
Jul 7, 2020
Oct 10, 2020
Sep 30, 2019
wip
Apr 24, 2021

gazpacho

PyPI PyPI - Python Version Downloads

About

gazpacho is a simple, fast, and modern web scraping library. The library is stable, actively maintained, and installed with zero dependencies.

Install

Install with pip at the command line:

pip install -U gazpacho

Quickstart

Give this a try:

from gazpacho import get, Soup

url = 'https://scrape.world/books'
html = get(url)
soup = Soup(html)
books = soup.find('div', {'class': 'book-'}, partial=True)

def parse(book):
    name = book.find('h4').text
    price = float(book.find('p').text[1:].split(' ')[0])
    return name, price

[parse(book) for book in books]

Tutorial

Import

Import gazpacho following the convention:

from gazpacho import get, Soup

get

Use the get function to download raw HTML:

url = 'https://scrape.world/soup'
html = get(url)
print(html[:50])
# '<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <met'

Adjust get requests with optional params and headers:

get(
    url='https://httpbin.org/anything',
    params={'foo': 'bar', 'bar': 'baz'},
    headers={'User-Agent': 'gazpacho'}
)

Soup

Use the Soup wrapper on raw html to enable parsing:

soup = Soup(html)

Soup objects can alternatively be initialized with the .get classmethod:

soup = Soup.get(url)

.find

Use the .find method to target and extract HTML tags:

h1 = soup.find('h1')
print(h1)
# <h1 id="firstHeading" class="firstHeading" lang="en">Soup</h1>

attrs=

Use the attrs argument to isolate tags that contain specific HTML element attributes:

soup.find('div', attrs={'class': 'section-'})

partial=

Element attributes are partially matched by default. Turn this off by setting partial to False:

soup.find('div', {'class': 'soup'}, partial=False)

mode=

Override the mode argument {'auto', 'first', 'all'} to guarantee return behaviour:

print(soup.find('span', mode='first'))
# <span class="navbar-toggler-icon"></span>
len(soup.find('span', mode='all'))
# 8

dir()

Soup objects have html, tag, attrs, and text attributes:

dir(h1)
# ['attrs', 'find', 'get', 'html', 'strip', 'tag', 'text']

Use them accordingly:

print(h1.html)
# '<h1 id="firstHeading" class="firstHeading" lang="en">Soup</h1>'
print(h1.tag)
# h1
print(h1.attrs)
# {'id': 'firstHeading', 'class': 'firstHeading', 'lang': 'en'}
print(h1.text)
# Soup

Support

If you use gazpacho, consider adding the scraper: gazpacho badge to your project README.md:

[![scraper: gazpacho](https://img.shields.io/badge/scraper-gazpacho-C6422C)](https://github.com/maxhumber/gazpacho)

Contribute

For feature requests or bug reports, please use Github Issues

For PRs, please read the CONTRIBUTING.md document