can_fetch() returns TRUE ... #2

petermeissner · 2017-10-22T17:52:13Z

Hey,

while integrating spiderbar's can_fetch() into the robotstxt package I encountered a test case where can_fetch() and paths_allowed(check_method="robotstxt") differ.

Consider the following robots.txt file:

User-agent: UniversalRobot/1.0
User-agent: mein-Robot
Disallow: /quellen/dtd/

User-agent: *
Disallow: /unsinn/
Disallow: /temp/
Disallow: /newsticker.shtml

Now try this:

library(robotstxt)

rtxt <- "# robots.txt zu http://www.example.org/\n\nUser-agent: UniversalRobot/1.0\nUser-agent: mein-Robot\nDisallow: /quellen/dtd/\n\nUser-agent: *\nDisallow: /unsinn/\nDisallow: /temp/\nDisallow: /newsticker.shtml"

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "robotstxt",
  bot            = "*"
)
#> [1] FALSE

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "spiderbar",
  bot            = "*"
)
#> [1] FALSE

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "robotstxt",
  bot            = "mein-Robot"
)
#> [1] FALSE

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "spiderbar",
  bot            = "mein-Robot"
)
#> [1] TRUE

can_fetch() seems to ignore those rules that are ought to apply to all bots if a specific bot name / user agent is used.

The text was updated successfully, but these errors were encountered:

petermeissner · 2018-01-28T19:13:08Z

I discussed this here: seomoz/rep-cpp#33 ... and I think the robotstxt method for checking makes plausible but simply wrong assumptions about how robotstxt files work.

I will fix this within the robotstxt package and from then on default to the much faster spiderbar/rep-cpp-backend for simple path checking.

hrbrmstr self-assigned this Oct 22, 2017

hrbrmstr added the bug label Oct 22, 2017

petermeissner mentioned this issue Jan 25, 2018

Path allowed despite Disallow for * seomoz/rep-cpp#33

Closed

petermeissner closed this as completed Jan 28, 2018

petermeissner mentioned this issue Jan 28, 2018

Overhaul robots.txt interpretation ropensci/robotstxt#22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can_fetch() returns TRUE ... #2

can_fetch() returns TRUE ... #2

petermeissner commented Oct 22, 2017 •

edited

Loading

petermeissner commented Jan 28, 2018

can_fetch() returns TRUE ... #2

can_fetch() returns TRUE ... #2

Comments

petermeissner commented Oct 22, 2017 • edited Loading

petermeissner commented Jan 28, 2018

petermeissner commented Oct 22, 2017 •

edited

Loading