Skip to content
Find file
Fetching contributors…
Cannot retrieve contributors at this time
106 lines (84 sloc) 4.4 KB
<!DOCTYPE html>
<html>
<head>
<meta http-equiv='content-type' value='text/html;charset=utf8'>
<meta name='generator' value='Ronn/v0.7.3 (http://github.com/rtomayko/ronn/tree/0.7.3)'>
<title>canicrawl(1): Robots.txt Permissions Verifier</title>
<style type='text/css' media='all'>
/* style: man */
body#manpage {margin:0}
.mp {max-width:100ex;padding:0 9ex 1ex 4ex}
.mp p,.mp pre,.mp ul,.mp ol,.mp dl {margin:0 0 20px 0}
.mp h2 {margin:10px 0 0 0}
.mp > p,.mp > pre,.mp > ul,.mp > ol,.mp > dl {margin-left:8ex}
.mp h3 {margin:0 0 0 4ex}
.mp dt {margin:0;clear:left}
.mp dt.flush {float:left;width:8ex}
.mp dd {margin:0 0 0 9ex}
.mp h1,.mp h2,.mp h3,.mp h4 {clear:left}
.mp pre {margin-bottom:20px}
.mp pre+h2,.mp pre+h3 {margin-top:22px}
.mp h2+pre,.mp h3+pre {margin-top:5px}
.mp img {display:block;margin:auto}
.mp h1.man-title {display:none}
.mp,.mp code,.mp pre,.mp tt,.mp kbd,.mp samp,.mp h3,.mp h4 {font-family:monospace;font-size:14px;line-height:1.42857142857143}
.mp h2 {font-size:16px;line-height:1.25}
.mp h1 {font-size:20px;line-height:2}
.mp {text-align:justify;background:#fff}
.mp,.mp code,.mp pre,.mp pre code,.mp tt,.mp kbd,.mp samp {color:#131211}
.mp h1,.mp h2,.mp h3,.mp h4 {color:#030201}
.mp u {text-decoration:underline}
.mp code,.mp strong,.mp b {font-weight:bold;color:#131211}
.mp em,.mp var {font-style:italic;color:#232221;text-decoration:none}
.mp a,.mp a:link,.mp a:hover,.mp a code,.mp a pre,.mp a tt,.mp a kbd,.mp a samp {color:#0000ff}
.mp b.man-ref {font-weight:normal;color:#434241}
.mp pre {padding:0 4ex}
.mp pre code {font-weight:normal;color:#434241}
.mp h2+pre,h3+pre {padding-left:0}
ol.man-decor,ol.man-decor li {margin:3px 0 10px 0;padding:0;float:left;width:33%;list-style-type:none;text-transform:uppercase;color:#999;letter-spacing:1px}
ol.man-decor {width:100%}
ol.man-decor li.tl {text-align:left}
ol.man-decor li.tc {text-align:center;letter-spacing:4px}
ol.man-decor li.tr {text-align:right;float:right}
</style>
<style type='text/css' media='all'>
.mp {max-width:150ex}
ul {list-style: None; margin-left: 1em!important}
.man-navigation {left:151ex}
</style>
</head>
<body id='manpage'>
<a href="http://github.com/igrigorik/canicrawl"><img style="position: absolute; top: 0; right: 0; border: 0;" src="https://d3nwyuy0nl342s.cloudfront.net/img/7afbc8b248c68eb468279e8c17986ad46549fb71/687474703a2f2f73332e616d617a6f6e6177732e636f6d2f6769746875622f726962626f6e732f666f726b6d655f72696768745f6461726b626c75655f3132313632312e706e67" alt="Fork me on GitHub"></a>
<!-- DOCS -->
<div class='mp'>
<h1>Can I Crawl (this URL)</h1>
<p>Hosted robots.txt permissions verifier.</p>
<h2 id="ENDPOINTS">ENDPOINTS</h2>
<ul>
<li><a href="http://canicrawl.appspot.com/"><code>/</code></a> This page.</li>
<li><a href="http://canicrawl.appspot.com/check"><code>/check</code></a> Runs the robots.txt verification check.</li>
</ul>
<h2 id="Description">Description</h2>
<p>Verifies if the provided URL is allowed to be crawled by your User-Agent. Pass in the destination URL and the service will download, parse and check the <a href="http://www.robotstxt.org/">robots.txt</a> file for permissions. If you're allowed to continue, it will issue a <strong>3XX</strong> redirect, otherwise a <strong>4XX</strong> code is returned.</p>
<h2 id="Examples">Examples</h2>
<h3 id="-curl-v-http-canicrawl-appspot-com-check-url-http-www-google-com-">$ curl -v http://canicrawl.appspot.com/check?url=http://google.com/</h3>
<pre><code>&lt; HTTP/1.0 302 Found
&lt; Location: http://www.google.com/
</code></pre>
<h3 id="-curl-v-http-canicrawl-appspot-com-check-url-http-www-google-com-search">$ curl -v http://canicrawl.appspot.com/check?url=http://google.com/search</h3>
<pre><code>&lt; HTTP/1.0 403 Forbidden
&lt; Content-Length: 23
{"status":"disallowed"}
</code></pre>
<h3 id="-curl-H-User-Agent-MyCustomAgent-v-http-canicrawl-appspot-com-check-url-http-www-google-com-">$ curl -H'User-Agent: MyCustomAgent' -v http://canicrawl.appspot.com/check?url=http://google.com/</h3>
<pre><code>&gt; User-Agent: MyCustomAgent
&lt; HTTP/1.0 302 Found
&lt; Location: http://www.google.com/
</code></pre>
<p>Note: <a href="http://google.com/robots.txt">google.com/robots.txt</a> disallows requests to <em>/search</em>.</p>
<h2 id="License">License</h2>
<p>MIT License - Copyright (c) 2011 <a href="http://www.igvita.com/">Ilya Grigorik</a></p>
</div>
<!-- END DOCS -->
</body>
</html>
Something went wrong with that request. Please try again.