NAME

WWW::Crawler::Lite - A single-threaded crawler/spider for the web.

SYNOPSIS

my %pages = ( );
my $pattern = 'https?://example\.com\/';
my %links = ( );
my $downloaded = 0;

my $crawler;
$crawler = WWW::Crawler::Lite->new(
  agent       => 'MySuperBot/1.0',
  url_pattern => $pattern,
  http_accept => [qw( text/plain text/html application/xhtml+xml )],
  link_parser => 'default',
  on_response => sub {
    my ($url, $res) = @_;

  warn "$url contains " . $res->content;
  $downloaded++;
  $crawler->stop() if $downloaded++ > 5;
},
follow_ok   => sub {
  my ($url) = @_;
  # If you like this url and want to use it, then return a true value:
  return 1;
},
on_link     => sub {
  my ($from, $to, $text) = @_;

  return if exists($pages{$to}) && $pages{$to} eq 'BAD';
  $pages{$to}++;
  $links{$to} ||= [ ];
  push @{$links{$to}}, { from => $from, text => $text };
},
on_bad_url => sub {
  my ($url) = @_;

    # Mark this url as 'bad':
    $pages{$url} = 'BAD';
  }
);
$crawler->crawl( url => "http://example.com/" );

warn "DONE!!!!!";

use Data::Dumper;
map {
  warn "$_ ($pages{$_} incoming links) -> " . Dumper($links{$_})
} sort keys %links;

DESCRIPTION

WWW::Crawler::Lite is a single-threaded spider/crawler for the web. It can be used within a mod_perl, CGI or Catalyst-style environment because it does not fork or use threads.

The callback-based interface is fast and simple, allowing you to focus on simply processing the data that WWW::Crawler::Lite extracts from the target website.

PUBLIC METHODS

new( %args )

Creates and returns a new WWW::Crawler::Lite object.

The %args hash is not required, but may contain the following elements:

agent - String

Used as the user-agent string for HTTP requests.

Default Value: - WWW-Crawler-Lite/$VERSION $^O

url_pattern - RegExp or String

New links that do not match this pattern will not be added to the processing queue.

Default Value: https?://.+

http_accept - ArrayRef

This can be used to filter out unwanted responses.

link_parser - String

Valid values: 'default' and 'HTML::LinkExtor'

The default value is 'default' which uses a naive regexp to do the link parsing.

The upshot of using 'default' is that the regexp will also find the hyperlinked text or alt-text (of a hyperlinked img tag) and give that to your 'on_link' handler.

Default Value: [qw( text/html text/plain application/xhtml+xml )]

on_response($url, $response) - CodeRef

Called whenever a successful response is returned.

on_link($from, $to, $text) - CodeRef

Called whenever a new link is found. Arguments are:

- $from

The URL that is linked *from*

- $to

The URL that is linked *to*

- $text

The anchor text (eg: The HTML within the link - <a href="...">__This Text Here__</a>)

on_bad_url($url) - CodeRef

Called whenever an unsuccessful response is received.

delay_seconds - Number

Indicates the length of time (in seconds) that the crawler should pause before making each request. This can be useful when you want to spider a website, not launch a denial of service attack on it.

stop( )

Causes the crawler to stop processing its queue of URLs.

AUTHOR

John Drago jdrago_999@yahoo.com

COPYRIGHT

This software is Free software and may be used and redistributed under the same terms as perl itself.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
inc/Module		inc/Module
lib/WWW/Crawler		lib/WWW/Crawler
t/010-basic		t/010-basic
Changes		Changes
MANIFEST		MANIFEST
META.yml		META.yml
Makefile.PL		Makefile.PL
README.markdown		README.markdown

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inc/Module

inc/Module

lib/WWW/Crawler

lib/WWW/Crawler

t/010-basic

t/010-basic

Changes

Changes

MANIFEST

MANIFEST

META.yml

META.yml

Makefile.PL

Makefile.PL

README.markdown

README.markdown

Repository files navigation

NAME

SYNOPSIS

DESCRIPTION

PUBLIC METHODS

new( %args )

stop( )

AUTHOR

COPYRIGHT

About

Releases

Packages

jdrago999/WWW-Crawler-Lite

Folders and files

Latest commit

History

Repository files navigation

NAME

SYNOPSIS

DESCRIPTION

PUBLIC METHODS

new( %args )

stop( )

AUTHOR

COPYRIGHT

About

Resources

Stars

Watchers

Forks