Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can_fetch() returns TRUE ... #2

Closed
petermeissner opened this issue Oct 22, 2017 · 1 comment
Closed

can_fetch() returns TRUE ... #2

petermeissner opened this issue Oct 22, 2017 · 1 comment
Assignees
Labels

Comments

@petermeissner
Copy link
Contributor

petermeissner commented Oct 22, 2017

Hey,

while integrating spiderbar's can_fetch() into the robotstxt package I encountered a test case where can_fetch() and paths_allowed(check_method="robotstxt") differ.

Consider the following robots.txt file:

User-agent: UniversalRobot/1.0
User-agent: mein-Robot
Disallow: /quellen/dtd/

User-agent: *
Disallow: /unsinn/
Disallow: /temp/
Disallow: /newsticker.shtml

Now try this:

library(robotstxt)

rtxt <- "# robots.txt zu http://www.example.org/\n\nUser-agent: UniversalRobot/1.0\nUser-agent: mein-Robot\nDisallow: /quellen/dtd/\n\nUser-agent: *\nDisallow: /unsinn/\nDisallow: /temp/\nDisallow: /newsticker.shtml"

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "robotstxt",
  bot            = "*"
)
#> [1] FALSE

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "spiderbar",
  bot            = "*"
)
#> [1] FALSE

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "robotstxt",
  bot            = "mein-Robot"
)
#> [1] FALSE

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "spiderbar",
  bot            = "mein-Robot"
)
#> [1] TRUE

can_fetch() seems to ignore those rules that are ought to apply to all bots if a specific bot name / user agent is used.

@petermeissner
Copy link
Contributor Author

I discussed this here: seomoz/rep-cpp#33 ... and I think the robotstxt method for checking makes plausible but simply wrong assumptions about how robotstxt files work.

I will fix this within the robotstxt package and from then on default to the much faster spiderbar/rep-cpp-backend for simple path checking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants