Python HTML Makefile
Clone or download
Latest commit 8c2d0de Aug 12, 2018
Permalink
Failed to load latest commit information.
docs Update grablab link in documentation Jul 1, 2018
grab Bump version: 0.6.40 → 0.6.41 Jun 24, 2018
tests Fix #344: raise GrabInvalidUrl for pycurl error #3 May 12, 2018
.bumpversion.cfg Bump version: 0.6.40 → 0.6.41 Jun 24, 2018
.flake8 Rename test into tests Apr 14, 2018
.gitignore Add .python-version to .gitignore Feb 8, 2017
.travis.yml Add travis/tox test for released grab version May 1, 2018
CHANGELOG.md Update changelog Jun 24, 2018
LICENSE Update year in LICENSE Nov 6, 2016
MANIFEST.in Fix bug: pypi package misses http api html file Apr 14, 2018
Makefile Merge pull request #245 from lorien/readthedocs Mar 12, 2017
README.rst Update README Aug 12, 2018
appveyor.yml Fix grab_defusedxml test on windows Apr 14, 2018
appveyor_settings.py Update postgresql appveyor config Jan 12, 2017
pylintrc Update pylintrc May 8, 2018
requirements_dev.txt Fix #356: appveyor install wrong pycurl version on macos Jun 26, 2018
requirements_dev_backend.txt Fix #315: use psycopg2-binary package for postgres cache Apr 22, 2018
requirements_readthedocs.txt Fix tox doc env issues Mar 8, 2017
runtest.py Fix #324: refactor response header processing May 8, 2018
setup.py Bump version: 0.6.40 → 0.6.41 Jun 24, 2018
test_settings.py Update backend configuration in test suit Nov 6, 2016
tox.ini Migrate tests to test_server>=0.0.30 May 1, 2018
travis_linux_settings.py Create osx travis settings.py file Jan 30, 2017
travis_osx_settings.py Create osx travis settings.py file Jan 30, 2017

README.rst

Grab

https://travis-ci.org/lorien/grab.png?branch=master https://ci.appveyor.com/api/projects/status/uxj24vjin7gptdlg https://coveralls.io/repos/lorien/grab/badge.svg?branch=master https://api.codacy.com/project/badge/Grade/18465ca1458b4c5e99026aafa5b58e98 https://readthedocs.org/projects/grab/badge/?version=latest

Project Status

Project Grab is not abandoned but it is not being actively developed. At current time I am working on another crawling framework which I want to be simple, fast and does not leak memory. New project is located here: https://github.com/lorien/crawler First, I've tried to use mix of asyncio (network) and classic threads (parsing HTML with lxml on multiple CPU cores) but then I've decided to use classic threads for everything for the sake of simplicity. Network requests are processed with pycurl because it is fast, feature-rich and supports socks5 proxies. You can try new framework but be aware it does not have many features yet. In particular, its options to configure network requests are very pure. If you need some option, feel free to create new issue.

What is Grab?

Grab is a python web scraping framework. Grab provides a number of helpful methods to perform network requests, scrape web sites and process the scraped content:

  • Automatic cookies (session) support
  • HTTP and SOCKS proxy with/without authorization
  • Keep-Alive support
  • IDN support
  • Tools to work with web forms
  • Easy multipart file uploading
  • Flexible customization of HTTP requests
  • Automatic charset detection
  • Powerful API to extract data from DOM tree of HTML documents with XPATH queries
  • Asynchronous API to make thousands of simultaneous queries. This part of library called Spider. See list of spider fetures below.
  • Python 3 ready

Spider is a framework for writing web-site scrapers. Features:

  • Rules and conventions to organize the request/parse logic in separate blocks of codes
  • Multiple parallel network requests
  • Automatic processing of network errors (failed tasks go back to task queue)
  • You can create network requests and parse responses with Grab API (see above)
  • HTTP proxy support
  • Caching network results in permanent storage
  • Different backends for task queue (in-memory, redis, mongodb)
  • Tools to debug and collect statistics

Grab Example

import logging

from grab import Grab

logging.basicConfig(level=logging.DEBUG)

g = Grab()

g.go('https://github.com/login')
g.doc.set_input('login', '****')
g.doc.set_input('password', '****')
g.doc.submit()

g.doc.save('/tmp/x.html')

g.doc('//ul[@id="user-links"]//button[contains(@class, "signout")]').assert_exists()

home_url = g.doc('//a[contains(@class, "header-nav-link name")]/@href').text()
repo_url = home_url + '?tab=repositories'

g.go(repo_url)

for elem in g.doc.select('//h3[@class="repo-list-name"]/a'):
    print('%s: %s' % (elem.text(),
                      g.make_url_absolute(elem.attr('href'))))

Grab::Spider Example

import logging

from grab.spider import Spider, Task

logging.basicConfig(level=logging.DEBUG)


class ExampleSpider(Spider):
    def task_generator(self):
        for lang in 'python', 'ruby', 'perl':
            url = 'https://www.google.com/search?q=%s' % lang
            yield Task('search', url=url, lang=lang)

    def task_search(self, grab, task):
        print('%s: %s' % (task.lang,
                          grab.doc('//div[@class="s"]//cite').text()))


bot = ExampleSpider(thread_number=2)
bot.run()

Installation

$ pip install -U grab

See details about installing Grab on different platforms here http://docs.grablib.org/en/latest/usage/installation.html

Documentation and Help

Documentation: http://docs.grablib.org/en/latest/

Mailing list (mostly russian): http://groups.google.com/group/python-grab/

Contribution

To report a bug please use GitHub issue tracker: https://github.com/lorien/grab/issues

If you want to develop new feature in Grab please use issue tracker to describe what you want to do or contact me at lorien@lorien.name