Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upcan_fetch() returns TRUE ... #2
Comments
|
I discussed this here: seomoz/rep-cpp#33 ... and I think the robotstxt method for checking makes plausible but simply wrong assumptions about how robotstxt files work. I will fix this within the robotstxt package and from then on default to the much faster spiderbar/rep-cpp-backend for simple path checking. |
Hey,
while integrating spiderbar's
can_fetch()into the robotstxt package I encountered a test case wherecan_fetch()andpaths_allowed(check_method="robotstxt")differ.Consider the following robots.txt file:
User-agent: UniversalRobot/1.0 User-agent: mein-Robot Disallow: /quellen/dtd/ User-agent: * Disallow: /unsinn/ Disallow: /temp/ Disallow: /newsticker.shtmlNow try this:
can_fetch()seems to ignore those rules that are ought to apply to all bots if a specific bot name / user agent is used.