initial commit

n1k0 · Jun 29, 2011 · ddea338 · ddea338
commit ddea338
Show file tree

Hide file tree

Showing 6 changed files with 522 additions and 0 deletions.
diff --git a/LICENSE.txt b/LICENSE.txt
@@ -0,0 +1,21 @@
+The MIT License
+
+Copyright (c) 2011 Nick Rabinowitz.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,34 @@
+==Overview==
+
+**pjscrape** is a framework for anyone who's ever wanted a command-line tool for web scraping using Javascript and [jQuery](http://jquery.com/). Built to run with [PhantomJS](http://phantomjs.org), it allows you to scrape pages in a fully rendered, Javascript-enabled context from the command line, no browser required.
+
+==Usage==
+
+ 1. [Download and install PhantomJS](http://code.google.com/p/phantomjs/downloads/list) or PyPhantomJS, v.1.2. In order to use file-based logging or data writes, you'll need to use PyPhantomJS with the [Save to File plugin](http://dev.umaclan.com/projects/pyphantomjs/wiki/Plugins#Save-to-File) (though I think this feature will be rolled into the PhantomJS core in the next version).
+
+ 2. Make a config file to define your scraper(s). Config files can set global pjscrape settings via `pjs.config()` and add one or more scraper suites via `pjs.addSuite()`. 
+
+ 3. A scraper suite defines a set of scraper functions for one or more URLs. More docs on this coming soon, but a sample config file might look like this: 
+
+    pjs.addSuite({
+        title: 'My Scraper Suite',
+        // single URL or array
+        urls: [
+            'http://www.example.com/page1',
+            'http://www.example.com/page2'
+        ],
+        // one or more functions, evaluated in the client
+        scrapers: [
+            function() {
+                var items = [];
+                $('h2').each(function() {
+                    items.push($(this).text());
+                });
+                return items;
+            }
+        ]
+    });
+
+ 4. To run pjscrape from the command line, type: `pyphantomjs /path/to/pjscrape.js my_config_file.js`
+
+By default, the log output is pretty verbose, and the scraped data is written to stdout at the end of the scrape.
diff --git a/VERSION.txt b/VERSION.txt
@@ -0,0 +1 @@
+0.1
diff --git a/client/jquery.js b/client/jquery.js
diff --git a/client/pjscrape_client.js b/client/pjscrape_client.js
@@ -0,0 +1,5 @@
+window._pjs = (function() {
+    // XXX: it would be nice to offer utilities for a) testing if a link is local,
+    // and b) converting relative URLs to fully qualified URLs
+    return {};
+});