Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
A robots.txt parser written in PHP, useful for checking whether or not your bot has access to a given URL.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
/// copyleft 2012 meklu (public domain) rbt_prs is a robots.txt parser written in PHP and intended for checking whether or not your bot can access a given URL. Currently it provides just one function: isUrlBotSafe($url, $your_useragent, $robots_txt, $redirects, $debug) Setting any of the last four parameters to NULL will make them use their default values. The first argument, $url, is the only required argument and means the URL you would like to check. The second argument, $your_useragent, is the user agent string your bot will use. While it isn't mandatory to select one, it is highly advised. To mention this software in your user agent string, you could invoke the function as isUrlBotSafe("http://example.com/", "Fancybot/0.1 (awesome) " . RBT_PRS_UA); Notice the space after the description. The constant RBT_PRS_UA doesn't have one in it. The third argument, $robots_txt, is useful when you want to go through several URLs with just a single download of robots.txt. It allows you to use your own rules on a given URL. You could utilise the argument like this: $myrules = file_get_contents("http://example.com/robots.txt"); isUrlBotSafe("http://example.com/", "Fancybot/0.1", $myrules); isUrlBotSafe("http://example.com/contact", "Fancybot/0.1", $myrules); isUrlBotSafe("http://example.com/blog", "Fancybot/0.1", $myrules); isUrlBotSafe("http://example.com/about", "Fancybot/0.1", $myrules); The fourth argument, $redirects, is only useful when you let the function download the robots.txt. It allows you to handle HTTP redirects in a slightly better way than having to miserably fail when grabbing the file. Values that are 1 or less will disable redirects, whereas FALSE will leave things the way they are. TRUE will set the value to 20. Here's how to use a similar approach with $robots_txt defined. $opts = array('http' => array('method' => 'GET', 'max_redirects' => 20) ); $context = stream_context_create($opts); $myrules = file_get_contents("http://example.com/robots.txt", FALSE, $context); The fifth argument, $debug, will make the function verbose and have it print all sorts of useful analysis on the rules. This piece of software is already in use in real life at http://cheesetalks.twolofbees.com/humble/ and was originally written for the site author just because it was the correct way of doing things.