Python wrapper around phantomjs headless borwser
Python JavaScript
Latest commit 36fe9cf Sep 23, 2013 @jakub-m inc version
Permalink
Failed to load latest commit information.
phantomcurl inc version Sep 23, 2013
.gitignore added project files Aug 8, 2013
LICENSE Initial commit Aug 8, 2013
Makefile make install Aug 8, 2013
README.md minor Aug 10, 2013
setup.py minor Aug 8, 2013

README.md

PhantomCurl

Python wrapper around PhantomJS headless browser for advanced web scraping. Apart of the page content, it can record all requests and responses from the web page, or collect content of all the IFrames.

If used as a command line tool, it returns data in JSON format.

Installation

PhantomCurl is a wrapper around PhantomJS, so first you should install PhantomJS from the project's page

PhantomJS should be visible system wise:

which phantomjs

If the binary is not visible system-wide, you should set the environment variable PHANTOMJS_BIN to point to the PhantomJS binary.

Now, build and install the python egg:

make && make install

Command line tool

You can use the script as a command line tool with:

python -mphantomcurl --help

Returns data in JSON format

Returned values

fetch() returns dictionary with the following fields:

url             - URL fed to the fetch function
requests        - all requests captured
responses       - all responses captured
content         - content of the web page
timestamps      - [start, end], seconds
version         - version of the JS script
command_line    - command line arguments passed to the JS 
frames          - IFrames found on the page. `frames` can contain other frames recursively

IFrames inspection

The script allows deep iframes inspection (-f option). For each iframe it reports src, id and its content. Then for each frame it check if it contains more iframes and reports them, recursively.