Permalink
Browse files

Merge branch 'jshirley/master'

  • Loading branch information...
2 parents 74e2b0a + bd1d680 commit fad055ab8eea1c4834192cd103237e6c49a9f9c3 @miyagawa miyagawa committed Dec 8, 2009
Showing with 31 additions and 9 deletions.
  1. +31 −9 lib/Web/Scraper.pm
View
@@ -280,15 +280,20 @@ __END__
=head1 NAME
-Web::Scraper - Web Scraping Toolkit inspired by Scrapi
+Web::Scraper - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions
=head1 SYNOPSIS
use URI;
use Web::Scraper;
+ # First, create your scraper block
my $tweets = scraper {
+ # Parse all LIs with the class "status", store them into a resulting
+ # array 'tweets'. We embed another scraper for each tweet.
process "li.status", "tweets[]" => scraper {
+ # And, in that array, pull in the elementy with the class
+ # "entry-content", "entry-date" and the link
process ".entry-content", body => 'TEXT';
process ".entry-date", when => 'TEXT';
process 'a[rel="bookmark"]', link => '@href';
@@ -297,27 +302,38 @@ Web::Scraper - Web Scraping Toolkit inspired by Scrapi
my $res = $tweets->scrape( URI->new("http://twitter.com/miyagawa") );
+ # The result has the populated tweets array
for my $tweet (@{$res->{tweets}}) {
print "$tweet->{body} $tweet->{when} (link: $tweet->{link})\n";
}
+The structure would resemble this (visually)
+ {
+ tweets => [
+ { body => $body, when => $date, link => $uri },
+ { body => $body, when => $date, link => $uri },
+ ]
+ }
+
=head1 DESCRIPTION
Web::Scraper is a web scraper toolkit, inspired by Ruby's equivalent
-Scrapi. It allows you to write a web scraping script or class in a
-DSL-ish but still pure-perl language.
+Scrapi. It provides a DSL-ish interface for traversing HTML documents and
+returning a neatly arranged Perl data strcuture.
-=head1 METHODS
+The I<scraper> and I<process> blocks provide a method to define what segments
+of a document to extract. It understands HTML and CSS Selectors as well as
+XPath expressions.
-=over 4
+=head1 METHODS
-=item scraper
+=head2 scraper
$scraper = scraper { ... };
Creates a new Web::Scraper object by wrapping the DSL code that will be fired when I<scrape> method is called.
-=item scrape
+=head2 scrape
$res = $scraper->scrape(URI->new($uri));
$res = $scraper->scrape($html_content);
@@ -341,7 +357,7 @@ a string instead of URI or HTTP::Response.
This way Web::Scraper can resolve the relative links found in the document.
-=item process
+=head2 process
scraper {
process "tag.class", key => 'TEXT';
@@ -375,7 +391,11 @@ XPath expression and otherwise CSS selector.
# list => [ { id => "1", text => "foo" }, { id => "2", text => "bar" } ];
process "li", "list[]" => { id => '@id', text => "TEXT" };
-=back
+=head1 EXAMPLES
+
+There are many examples in the C<eg/> dir packaged in this distribution.
+It is recommended to look through these.
+
=head1 NESTED SCRAPERS
@@ -398,4 +418,6 @@ it under the same terms as Perl itself.
L<http://blog.labnotes.org/category/scrapi/>
+L<HTML::TreeBuilder::XPath>
+
=cut

0 comments on commit fad055a

Please sign in to comment.