Permalink
Browse files

initial commit

  • Loading branch information...
0 parents commit ddea3388d12264032017d4d3d9122b15bc2f8223 @nrabinowitz committed Jun 29, 2011
Showing with 522 additions and 0 deletions.
  1. +21 −0 LICENSE.txt
  2. +34 −0 README.md
  3. +1 −0 VERSION.txt
  4. +18 −0 client/jquery.js
  5. +5 −0 client/pjscrape_client.js
  6. +443 −0 pjscrape.js
@@ -0,0 +1,21 @@
+The MIT License
+
+Copyright (c) 2011 Nick Rabinowitz.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
@@ -0,0 +1,34 @@
+==Overview==
+
+**pjscrape** is a framework for anyone who's ever wanted a command-line tool for web scraping using Javascript and [jQuery](http://jquery.com/). Built to run with [PhantomJS](http://phantomjs.org), it allows you to scrape pages in a fully rendered, Javascript-enabled context from the command line, no browser required.
+
+==Usage==
+
+ 1. [Download and install PhantomJS](http://code.google.com/p/phantomjs/downloads/list) or PyPhantomJS, v.1.2. In order to use file-based logging or data writes, you'll need to use PyPhantomJS with the [Save to File plugin](http://dev.umaclan.com/projects/pyphantomjs/wiki/Plugins#Save-to-File) (though I think this feature will be rolled into the PhantomJS core in the next version).
+
+ 2. Make a config file to define your scraper(s). Config files can set global pjscrape settings via `pjs.config()` and add one or more scraper suites via `pjs.addSuite()`.
+
+ 3. A scraper suite defines a set of scraper functions for one or more URLs. More docs on this coming soon, but a sample config file might look like this:
+
+ pjs.addSuite({
+ title: 'My Scraper Suite',
+ // single URL or array
+ urls: [
+ 'http://www.example.com/page1',
+ 'http://www.example.com/page2'
+ ],
+ // one or more functions, evaluated in the client
+ scrapers: [
+ function() {
+ var items = [];
+ $('h2').each(function() {
+ items.push($(this).text());
+ });
+ return items;
+ }
+ ]
+ });
+
+ 4. To run pjscrape from the command line, type: `pyphantomjs /path/to/pjscrape.js my_config_file.js`
+
+By default, the log output is pretty verbose, and the scraped data is written to stdout at the end of the scrape.
@@ -0,0 +1 @@
+0.1

Large diffs are not rendered by default.

Oops, something went wrong.
@@ -0,0 +1,5 @@
+window._pjs = (function() {
+ // XXX: it would be nice to offer utilities for a) testing if a link is local,
+ // and b) converting relative URLs to fully qualified URLs
+ return {};
+});
Oops, something went wrong.

0 comments on commit ddea338

Please sign in to comment.