[not actively maintained] A lightweight Python library that uses Webkit to enable easy scraping of dynamic, Javascript-heavy web pages
Switch branches/tags
Nothing to show
Clone or download
Latest commit 4d3dabd Aug 22, 2017
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
debian Debian package Sep 23, 2015
docs Updating version in setup.py Oct 3, 2016
dryscrape Two space indent size to be consistent Feb 19, 2017
examples clarify example w.r.t. xvfb Feb 7, 2016
.gitignore add setup.py Jan 11, 2012
LICENSE add license Jan 11, 2012
MANIFEST.in add setup.py Jan 11, 2012
README.md Update README.md Aug 22, 2017
requirements.txt add py3 support, add xvfb helper, bump to 1.0 Jul 27, 2015
setup.py Updating version in setup.py Oct 3, 2016

README.md

NOTE: This package is not actively maintained. It uses QtWebkit, which is end-of-life and probably doesn't get security fixes backported. Consider using a similar package like Spynner instead.

Overview

Author: Niklas Baumstark

dryscrape is a lightweight web scraping library for Python. It uses a headless Webkit instance to evaluate Javascript on the visited pages. This enables painless scraping of plain web pages as well as Javascript-heavy “Web 2.0” applications like Facebook.

It is built on the shoulders of capybara-webkit's webkit-server. A big thanks goes to thoughtbot, inc. for building this excellent piece of software!

Changelog

  • 1.0: Added Python 3 support, small performance fixes, header names are now properly normalized. Also added the function dryscrape.start_xvfb() to easily start Xvfb.
  • 0.9.1: Changed semantics of the headers function in a backwards-incompatible way: It now returns a list of (key, value) pairs instead of a dictionary.

Supported Platforms

The library has been confirmed to work on the following platforms:

  • Mac OS X 10.9 Mavericks and 10.10 Yosemite
  • Ubuntu Linux
  • Arch Linux

Other unixoid systems should work just fine.

Windows is not officially supported, although dryscrape should work with cygwin.

A word about Qt 5.6

The 5.6 version of Qt removes the Qt WebKit module in favor of the new module Qt WebEngine. So far webkit-server has not been ported to WebEngine (and likely won't be in the near future), so Qt <= 5.5 is a requirement.

Installation, Usage, API Docs

Documentation can be found at dryscrape's ReadTheDocs page.

Quick installation instruction for Ubuntu:

# apt-get install qt5-default libqt5webkit5-dev build-essential python-lxml python-pip xvfb
# pip install dryscrape

Contact, Bugs, Contributions

If you have any problems with this software, don't hesitate to open an issue on Github or open a pull request or write a mail to niklas baumstark at Gmail.