Permalink
Browse files

initial social scraper import

  • Loading branch information...
1 parent b1278e1 commit 09eb5c0e37052eb67a56444d3f332ac5dae78e56 @pkrumins committed Feb 26, 2010
Showing with 1,748 additions and 1 deletion.
  1. +46 −0 examples/picurls.sb.txt
  2. +52 −0 examples/picurls.sn.txt
  3. +136 −1 readme.txt
  4. +203 −0 scraper.pl
  5. +112 −0 sites/boingboing.pm
  6. +100 −0 sites/delicious.pm
  7. +117 −0 sites/digg.pm
  8. +125 −0 sites/flickr.pm
  9. +93 −0 sites/furl.pm
  10. +180 −0 sites/reddit.pm
  11. +241 −0 sites/scraper.pm
  12. +109 −0 sites/simpy.pm
  13. +140 −0 sites/stumbleupon.pm
  14. +94 −0 sites/wired.pm
View
@@ -0,0 +1,46 @@
+#
+# Peteris Krumins (peter@catonmat.net), 2007.10.10
+# http://www.catonmat.net - good coders code, great reuse
+#
+# Scraper pattern file for picurls.com website
+#
+# This is another pattern filtering format that the scraper accepts.
+#
+# Here we can specify a list of predicates as perl subroutines which
+# get called on each item found on the site being scraped.
+#
+# If the predicate subroutine returns true, the item gets accepted.
+# If it returns false, it gets discarded and the next subroutine gets
+# called until all of them have either failed or one has accepted the item.
+#
+# WARNING: code must be correctly indented, otherwise program will fail
+# to extract the perl subroutine.
+#
+
+#
+# These patterns are mostly for social bookmarking (sb) sites, where people
+# do not write "[PIC]" or "(Pic)" in title.
+#
+
+# Discard items which point to index pages
+#
+perl: sub {
+ use URI;
+ my $post = shift;
+
+ my $uri = URI->new($post->{url});
+ my $path = $uri->path;
+
+ if (!length $path) { # empty path
+ return 0;
+ }
+ elsif ($path =~ m!^/+$!) { # just a slash '/'
+ return 0;
+ }
+ elsif ($path =~ m!^/(home|index)\.(php|html|htm|aspx?)$!i) { # index files
+ return 0;
+ }
+
+ return 1;
+}
+
View
@@ -0,0 +1,52 @@
+#
+# Peteris Krumins (peter@catonmat.net), 2007.09.08
+# http://www.catonmat.net - good coders code, great reuse
+#
+# Scraper pattern file for picurls.com website
+#
+# The format of the file is the following:
+# [url:|title:|desc:] regex pattern
+#
+# url:, title:, desc: are optional. they specify if the entry
+# on a website should be matched against its url, title or description.
+#
+# if url:, title:, desc: are not specified, it defaults to matching
+# pattern against title and description.
+#
+
+# match picture urls
+#
+url: \.jpg$
+url: \.gif$
+url: \.png$
+
+# match common patterns describing posts having pictures in them
+#
+[[(].*picture.*[])]
+[[(].*pic.*[])]
+[[(].*image.*[])]
+[[(].*photo.*[])]
+[[(].*comic.*[])]
+[[(].*chart.*[])]
+[[(].*graph.*[])]
+
+photos? of
+pics? of
+images? of
+pictures? of
+comics? of
+charts? of
+graphs? of
+grapics? of
+(this|these|those) photos?
+(this|these|those) pics?
+(this|these|those) images?
+photosets? (on|of)
+
+# match domains containing just pics
+url: xkcd\.com
+url: flickr\.com
+url: photobucket\.com
+url: imageshack\.us
+url: bestpicever\.com
+
View
@@ -14,6 +14,141 @@ http://www.catonmat.net/blog/making-of-picurls-popurls-for-pictures-part-one/
------------------------------------------------------------------------------
-will import it soon, just remembered about this project.
+The basic idea of the data scraper is to crawl websites and to extract the
+posts in a human readable output format. I want it to be easily extensible via
+plugins and be highly reusable. Also I want the scraper to have basic
+filtering capabilities to select just the posts which I am interested in.
+There are two parts to the scraper - the scraper library and the scraper
+program which uses the library and makes it easier to scrape many sites at
+once.
+
+The scraper library consists of the base class 'sites::scraper' and plugins
+for many various websites. For example, Digg's scraper plugin is 'sites::digg'
+(it inherits from sites::scraper).
+
+The constructor of each plugin takes 4 optional arguments - pages, vars,
+patterns or pattern_file:
+
+ * pages - integer, specifies how many pages to scrape in a single run,
+ * vars - hashref, specifies parameters for the plugin,
+ * patterns - hashref, specifies string regex patterns for filtering posts,
+ * pattern_file - string, path to file containing patterns for filtering posts
+
+Here is a Perl one-liner example of scraper library usage (without scraper
+program). This example scrapes 2 most popular pages of stories from Digg's
+programming section, filtering just the posts matching 'php' (case
+insensitive):
+
+perl -Msites::digg -e '
+ $digg = sites::digg->new(
+ pages => 2,
+ patterns => {
+ title => [ q/php/ ],
+ desc => [ q/php/ ]
+ },
+ vars => {
+ popular => 1,
+ topic => q/programming/
+ }
+ );
+ $digg->scrape_verbose'
+
+
+Here is the output of the plugin:
+
+ comments: 27
+ container_name: Technology
+ container_short_name: technology
+ description: With WordPress 2.3 launching this week, a bunch of themes \
+ and plugins needed updating. If you're not that familiar with PHP, \
+ this might present a slight problem. Not to worry, though - we've \
+ collected together 20+ tools for you to discover the secrets of PHP.
+ human_time: 2007-09-26 18:18:02
+ id: 3587383
+ score: 921
+ status: popular
+ title: The PHP Toolbox: 20+ PHP Resources
+ topic_name: Programming
+ topic_short_name: programming
+ unix_time: 1190819882
+ url: http://mashable.com/2007/09/26/php-toolbox/
+ user: ace77
+ user_icon: http://digg.com/users/ace77/l.png
+ user_profileviews: 17019
+ user_registrered: 1162332420
+ site: digg
+
+Each story is represented as a paragraph of key: value pairs. In this case the
+scraper found 2 posts matching PHP.
+
+Any program taking this output as input is free to choose parts of information
+they want to use.
+
+It is guaranteed that each plugin produces output with at least 'title', 'url'
+and 'site' fields.
+
+The date of the post is extracted, if available, is extracted by two fields
+'unix_time' and 'human_time'.
+
+To create a plugin, one must override just three methods from the base class:
+
+ * site_name - method should return a unique site id which will be output
+ in each post as 'site' field,
+ * get_page_url - given a page number, the method should construct a URL to
+ the page containing posts,
+ * get_posts - given the content of the page located at last get_page_url
+ call, the subroutine should return an array of hashrefs
+ containing key => val pairs containing the post information.
+
+It's very difficult to document everything the library does. It would take a
+few pages of documentation to document this simple library. If you are more
+interested in it, please take a look at the sources.
+
+The program is called scraper.pl. Running it without arguments prints its
+basic usage:
+
+ Usage: ./scraper.pl <site[:M][:{var1=val1; var2=val2 ...}]> ...
+ [/path/to/pattern_file]
+
+ Crawls given sites extracting entries matching optional patterns in
+ pattern_file.
+ Optional argument M specifies how many pages to crawl, default 1.
+ Arguments (variables) for plugins can be passed via an optional { }.
+
+The arguments in { } get parsed and then get passed to constructor of site.
+Also a number of sites can be scraped at once.
+
+For example, running the program with the following arguments:
+
+ ./scraper.pl reddit:2:{subreddit=science} stumbleupon:{tag=photography}
+ picurls.txt
+
+Would scrape two pages of science.reddit.com and a page of StumbleUpon website
+tagged 'photography' and use filtering rules in the file 'picurls.txt'.
+
+This is how the output of this program looks:
+
+ desc: Morning Glory at rest before another eruption, \
+ Yellow Stone National Park.
+ human_time: 2007-02-14 04:34:41
+ title: public-domain-photos.com/free-stock-photos-4/travel/yellowstone
+ unix_time: 1171420481
+ url: http://www.public-domain-photos.com/free-stock-photos-4/travel/ \
+ yellowstone/morning-glory-pool.jpg
+ site: stumbleupon
+
+See the original post for more documentation:
+
+http://www.catonmat.net/blog/making-of-picurls-popurls-for-pictures-part-one/
+
+
+------------------------------------------------------------------------------
+
+Have fun scraping the Internet!
+
+
+Sincerely,
+Peteris Krumins
+http://www.catonmat.net
Oops, something went wrong.

0 comments on commit 09eb5c0

Please sign in to comment.