Permalink
Browse files

add readme / ronn files

  • Loading branch information...
1 parent 1deffba commit 02b4f284f5ff81f32696683e060dbc656c83ce76 @igrigorik committed Jan 21, 2012
Showing with 136 additions and 18 deletions.
  1. +25 −17 README.md
  2. +5 −0 Rakefile
  3. +1 −1 app.yaml
  4. +105 −0 static/index.html
View
@@ -1,26 +1,34 @@
-# Turk: robots.txt permission verifier
+# Can I Crawl (this URL)
-Simple (Golang) HTTP web-service to verify whether a supplied "Agent" is allowed to access the requested URL. Pass in the URL of the resource you want to fetch and the name of your agent and Turk will download, parse the robots.txt file and respond with a 200 if you can proceed, and 400 otherwise.
+Hosted robots.txt permissions verifier.
-```
-$> goinstall github.com/temoto/robotstxt.go
-$> goinstall github.com/kklis/gomemcache
-$>
-$> make && ./turk -host="localhost:9090"
-$>
-$> curl -v "http://127.0.0.1:9090/?agent=Googlebot&url=http://blogspot.com/comment.g"
- < HTTP/1.1 400 Bad Request
+## ENDPOINTS
-$> curl -v "http://127.0.0.1:9090/?agent=Googlebot&url=http://blogspot.com/"
- < HTTP/1.1 200 OK
-```
+- [`/`](http://canicrawl.appspot.com/) This page.
+- [`/check`](http://canicrawl.appspot.com/check) Runs the robots.txt verification check.
-Note: [blogger.com/robots.txt](http://blogger.com/robots.txt) blocks allow agents from fetching `comment.g` resource.
+## Description
-## Notes
+Verifies if the provided URL is allowed to be crawled by your User-Agent. Pass in the destination URL and the service will download, parse and check the [robots.txt](http://www.robotstxt.org/) file for permissions. If you're allowed to continue, it will issue a **3XX** redirect, otherwise a **4XX** code is returned.
-Turk is an experiment with [Go](http://golang.org/). Go's http stack is "async", hence many parallel requests can be processed at the same time. Turk also has naive, unbounded in-memory cache to avoid refetching the same robots.txt data for a given host.
+## Examples
+
+### $ curl -v http://canicrawl.appspot.com/check?url=http://google.com/
+ < HTTP/1.0 302 Found
+ < Location: http://www.google.com/
+
+### $ curl -v http://canicrawl.appspot.com/check?url=http://google.com/search
+ < HTTP/1.0 400 Bad Request
+ < Content-Length: 23
+ {"status":"disallowed"}
+
+### $ curl -H'User-Agent: MyCustomAgent' -v http://canicrawl.appspot.com/check?url=http://google.com/
+ > User-Agent: MyCustomAgent
+ < HTTP/1.0 302 Found
+ < Location: http://www.google.com/
+
+Note: [google.com/robots.txt](http://google.com/robots.txt) disallows requests to _/search_.
### License
-(MIT License) - Copyright (c) 2011 Ilya Grigorik
+MIT License - Copyright (c) 2011 [Ilya Grigorik](http://www.igvita.com/)
View
@@ -0,0 +1,5 @@
+
+desc "Generate ronn file"
+task :docs do
+ exec('cat README.md | ronn -5 -f --style 80c --pipe > static/canicrawl.1.html')
+end
View
@@ -1,5 +1,5 @@
application: canicrawl
-version: 1
+version: 2
runtime: go
api_version: 3
View
@@ -0,0 +1,105 @@
+<!DOCTYPE html>
+<html>
+<head>
+ <meta http-equiv='content-type' value='text/html;charset=utf8'>
+ <meta name='generator' value='Ronn/v0.7.3 (http://github.com/rtomayko/ronn/tree/0.7.3)'>
+ <title>canicrawl(1): Robots.txt Permissions Verifier</title>
+
+ <style type='text/css' media='all'>
+ /* style: man */
+ body#manpage {margin:0}
+ .mp {max-width:100ex;padding:0 9ex 1ex 4ex}
+ .mp p,.mp pre,.mp ul,.mp ol,.mp dl {margin:0 0 20px 0}
+ .mp h2 {margin:10px 0 0 0}
+ .mp > p,.mp > pre,.mp > ul,.mp > ol,.mp > dl {margin-left:8ex}
+ .mp h3 {margin:0 0 0 4ex}
+ .mp dt {margin:0;clear:left}
+ .mp dt.flush {float:left;width:8ex}
+ .mp dd {margin:0 0 0 9ex}
+ .mp h1,.mp h2,.mp h3,.mp h4 {clear:left}
+ .mp pre {margin-bottom:20px}
+ .mp pre+h2,.mp pre+h3 {margin-top:22px}
+ .mp h2+pre,.mp h3+pre {margin-top:5px}
+ .mp img {display:block;margin:auto}
+ .mp h1.man-title {display:none}
+ .mp,.mp code,.mp pre,.mp tt,.mp kbd,.mp samp,.mp h3,.mp h4 {font-family:monospace;font-size:14px;line-height:1.42857142857143}
+ .mp h2 {font-size:16px;line-height:1.25}
+ .mp h1 {font-size:20px;line-height:2}
+ .mp {text-align:justify;background:#fff}
+ .mp,.mp code,.mp pre,.mp pre code,.mp tt,.mp kbd,.mp samp {color:#131211}
+ .mp h1,.mp h2,.mp h3,.mp h4 {color:#030201}
+ .mp u {text-decoration:underline}
+ .mp code,.mp strong,.mp b {font-weight:bold;color:#131211}
+ .mp em,.mp var {font-style:italic;color:#232221;text-decoration:none}
+ .mp a,.mp a:link,.mp a:hover,.mp a code,.mp a pre,.mp a tt,.mp a kbd,.mp a samp {color:#0000ff}
+ .mp b.man-ref {font-weight:normal;color:#434241}
+ .mp pre {padding:0 4ex}
+ .mp pre code {font-weight:normal;color:#434241}
+ .mp h2+pre,h3+pre {padding-left:0}
+ ol.man-decor,ol.man-decor li {margin:3px 0 10px 0;padding:0;float:left;width:33%;list-style-type:none;text-transform:uppercase;color:#999;letter-spacing:1px}
+ ol.man-decor {width:100%}
+ ol.man-decor li.tl {text-align:left}
+ ol.man-decor li.tc {text-align:center;letter-spacing:4px}
+ ol.man-decor li.tr {text-align:right;float:right}
+ </style>
+ <style type='text/css' media='all'>
+ .mp {max-width:150ex}
+ ul {list-style: None; margin-left: 1em!important}
+ .man-navigation {left:151ex}
+ </style>
+</head>
+
+<body id='manpage'>
+<a href="http://github.com/igrigorik/canicrawl"><img style="position: absolute; top: 0; right: 0; border: 0;" src="https://d3nwyuy0nl342s.cloudfront.net/img/7afbc8b248c68eb468279e8c17986ad46549fb71/687474703a2f2f73332e616d617a6f6e6177732e636f6d2f6769746875622f726962626f6e732f666f726b6d655f72696768745f6461726b626c75655f3132313632312e706e67" alt="Fork me on GitHub"></a>
+
+<!-- DOCS -->
+<div class='mp'>
+<h1>Can I Crawl (this URL)</h1>
+<p>Hosted robots.txt permissions verifier.</p>
+
+<h2 id="ENDPOINTS">ENDPOINTS</h2>
+
+<ul>
+<li><a href="http://canicrawl.appspot.com/"><code>/</code></a> This page.</li>
+<li><a href="http://canicrawl.appspot.com/check"><code>/check</code></a> Runs the robots.txt verification check.</li>
+</ul>
+
+
+<h2 id="Description">Description</h2>
+
+<p>Verifies if the provided URL is allowed to be crawled by your User-Agent. Pass in the destination URL and the service will download, parse and check the <a href="http://www.robotstxt.org/">robots.txt</a> file for permissions. If you're allowed to continue, it will issue a <strong>3XX</strong> redirect, otherwise a <strong>4XX</strong> code is returned.</p>
+
+<h2 id="Examples">Examples</h2>
+
+<h3 id="-curl-v-http-canicrawl-appspot-com-check-url-http-www-google-com-">$ curl -v http://canicrawl.appspot.com/check?url=http://google.com/</h3>
+
+<pre><code>&lt; HTTP/1.0 302 Found
+&lt; Location: http://www.google.com/
+</code></pre>
+
+<h3 id="-curl-v-http-canicrawl-appspot-com-check-url-http-www-google-com-search">$ curl -v http://canicrawl.appspot.com/check?url=http://google.com/search</h3>
+
+<pre><code>&lt; HTTP/1.0 400 Bad Request
+&lt; Content-Length: 23
+{"status":"disallowed"}
+</code></pre>
+
+<h3 id="-curl-H-User-Agent-MyCustomAgent-v-http-canicrawl-appspot-com-check-url-http-www-google-com-">$ curl -H'User-Agent: MyCustomAgent' -v http://canicrawl.appspot.com/check?url=http://google.com/</h3>
+
+<pre><code>&gt; User-Agent: MyCustomAgent
+&lt; HTTP/1.0 302 Found
+&lt; Location: http://www.google.com/
+</code></pre>
+
+<p>Note: <a href="http://google.com/robots.txt">google.com/robots.txt</a> disallows requests to <em>/search</em>.</p>
+
+<h2 id="License">License</h2>
+
+<p>MIT License - Copyright (c) 2011 <a href="http://www.igvita.com/">Ilya Grigorik</a></p>
+
+</div>
+
+<!-- END DOCS -->
+
+</body>
+</html>

0 comments on commit 02b4f28

Please sign in to comment.