You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi. This is imacat from Taiwan. I was trying LWP::RobotUA, and
found that WWW::RobotRules does not respect Crawl-delay:. The test
script is (an exact copy in WWW::RobotRules's POD):
==========
#! /usr/bin/perl -w
use WWW::RobotRules;
my $rules = WWW::RobotRules->new('MOMspider/1.0');
use LWP::Simple qw(get);
my $url = "http://sourceforge.net/robots.txt";
my $robots_txt = get $url;
$rules->parse($url, $robots_txt) if defined $robots_txt;
==========
The result I got is:
==========
imacat@rinse ~/tmp % ./test.pl
RobotRules <http://sourceforge.net/robots.txt>: Unexpected line:
Crawl-delay: 10
RobotRules <http://sourceforge.net/robots.txt>: Unexpected line:
Crawl-delay: 2
RobotRules <http://sourceforge.net/robots.txt>: Unexpected line:
Crawl-delay: 2
imacat@rinse ~/tmp %
==========
Crawl-delay: is a popular instruction that is used all over the
world, and is obeyed by Yahoo, MSN and many robots. A package written
with LWP::RobotUA with such a warning all the time could not be used.
This would make LWP::RobotUA quite useless. Besides, if a website has
specified Crawl-delay:, LWP::RobotUA should respect it instead of its
own $ua->delay(). Could you look into this and fix this soon? Thank you.
The text was updated successfully, but these errors were encountered:
Migrated from rt.cpan.org#19539 (status was 'new')
Requestors:
From imacat@cpan.org on 2006-05-28 13:38:41:
The text was updated successfully, but these errors were encountered: