Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12

japerk · Aug 11, 2011 · 5da6ffb · 5da6ffb
2 parents bc2d218 + 19e6da5
commit 5da6ffb
Show file tree

Hide file tree

Showing 284 changed files with 3,954 additions and 57,332 deletions.
diff --git a/AUTHORS b/AUTHORS
@@ -27,3 +27,5 @@ Here is the list of the primary authors & contributors:
  * Shuaib Khan
  * Didier Deshommes
  * Vikas Dhiman
+ * Jochen Maes
+ * Darian Moody
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -3,7 +3,7 @@ include AUTHORS
 include INSTALL
 include LICENSE
 include MANIFEST.in
-include scrapy/core/downloader/responsetypes/mime.types
+include scrapy/mime.types
 include scrapyd/default_scrapyd.conf
 recursive-include scrapy/templates *
 recursive-include scrapy/tests/sample_data *

diff --git a/README b/README
@@ -1,4 +1,4 @@
 This is Scrapy, an opensource screen scraping framework written in Python.
 
-For more visit the project home page at http://scrapy.org
+For more info visit the project home page at http://scrapy.org
 
diff --git a/bin/scrapyd b/bin/scrapyd
@@ -0,0 +1,14 @@
+#!/bin/sh
+
+repotac=$(cd $(dirname $0)/../extras; pwd)/scrapyd.tac
+
+if [ -f "$repotac" ]; then
+    tacfile="$repotac"
+elif [ -f "/usr/share/scrapyd/scrapyd.tac" ]; then
+    tacfile="/usr/share/scrapyd/scrapyd.tac"
+else
+    echo "Unable to find scrapy.tac file"
+    exit 1
+fi
+
+twistd -ny "$tacfile"
diff --git a/debian/control b/debian/control
@@ -2,13 +2,14 @@ Source: scrapy-SUFFIX
 Section: python
 Priority: optional
 Maintainer: Insophia Team <info@insophia.com> 
-Build-Depends: debhelper (>= 7.0.50), python (>=2.5), python-twisted
+Build-Depends: debhelper (>= 7.0.50), python (>=2.6), python-twisted, python-w3lib
 Standards-Version: 3.8.4
 Homepage: http://scrapy.org/
 
 Package: scrapy-SUFFIX
 Architecture: all
-Depends: ${python:Depends}, python-libxml2, python-twisted, python-openssl
+Depends: ${python:Depends}, python-libxml2, python-twisted, python-openssl, python-w3lib
+Recommends: python-setuptools
 Conflicts: python-scrapy, scrapy, scrapy-0.11
 Provides: python-scrapy, scrapy
 Description: Python web crawling and scraping framework

diff --git a/debian/scrapy.examples b/debian/scrapy.examples
diff --git a/debian/scrapy.install b/debian/scrapy.install
@@ -1,3 +1,3 @@
-usr/lib/python*/*-packages/scrapy
+usr/lib/python*/*-packages/scrapy*
 usr/bin
 extras/scrapy_bash_completion etc/bash_completion.d/
diff --git a/debian/scrapyd.install b/debian/scrapyd.install
@@ -1,3 +1,2 @@
-usr/lib/python*/*-packages/scrapyd
 debian/scrapyd-files/000-default etc/scrapyd/conf.d
 extras/scrapyd.tac usr/share/scrapyd
diff --git a/docs/_ext/scrapydocs.py b/docs/_ext/scrapydocs.py
@@ -1,3 +1,6 @@
+from docutils.parsers.rst.roles import set_classes
+from docutils import nodes
+
 def setup(app):
     app.add_crossref_type(
         directivename = "setting",
@@ -19,3 +22,10 @@ def setup(app):
         rolename      = "reqmeta",
         indextemplate = "pair: %s; reqmeta",
     )
+    app.add_role('source', source_role)
+
+def source_role(name, rawtext, text, lineno, inliner, options={}, content=[]):
+    url = 'http://dev.scrapy.org/browser/' + text
+    set_classes(options)
+    node = nodes.reference(rawtext, text, refuri=ref, **options)
+    return [node], []
diff --git a/docs/api-stability.rst b/docs/api-stability.rst
diff --git a/docs/experimental/crawlspider-v2.rst b/docs/experimental/crawlspider-v2.rst
diff --git a/docs/experimental/index.rst b/docs/experimental/index.rst
@@ -20,4 +20,3 @@ it's properly merged) . Use at your own risk.
    :maxdepth: 1
 
    djangoitems
-   crawlspider-v2 
diff --git a/docs/faq.rst b/docs/faq.rst
@@ -3,7 +3,7 @@
 Frequently Asked Questions
 ==========================
 
-How does Scrapy compare to BeautifulSoul or lxml?
+How does Scrapy compare to BeautifulSoup or lxml?
 -------------------------------------------------
 
 `BeautifulSoup`_ and `lxml`_ are libraries for parsing HTML and XML. Scrapy is
@@ -84,10 +84,11 @@ How can I simulate a user login in my spider?
 
 See :ref:`topics-request-response-ref-request-userlogin`.
 
-Can I crawl in breadth-first order instead of depth-first order?
-----------------------------------------------------------------
+Does Scrapy crawl in breath-first or depth-first order?
+-------------------------------------------------------
 
-Yes, there's a setting for that: :setting:`SCHEDULER_ORDER`.
+It crawls on breath-first order by default, but you can change it to
+depth-first order by setting the :setting:`DEPTH_PRIORITY` setting to ``-1``.
 
 My Scrapy crawler has memory leaks. What can I do?
 --------------------------------------------------
@@ -115,24 +116,10 @@ Try changing the default `Accept-Language`_ request header by overriding the
 
 .. _Accept-Language: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4
 
-Where can I find some example code using Scrapy?
-------------------------------------------------
-
-Scrapy comes with a built-in, fully functional project to scrape the `Google
-Directory`_. You can find it in the `examples/googledir`_ directory of the
-Scrapy distribution.
-
-Also, there's a site for sharing code snippets (spiders, middlewares,
-extensions) called `Scrapy snippets`_.
-
-Finally, you can find some example code for performing not-so-trivial tasks in
-the `Scrapy Recipes`_ wiki page.
+Where can I find some example Scrapy projects?
+----------------------------------------------
 
-.. _Google Directory: http://www.google.com/dirhp
-.. _examples/googledir: http://dev.scrapy.org/browser/examples/googledir
-.. _Community Spiders: http://dev.scrapy.org/wiki/CommunitySpiders
-.. _Scrapy Recipes: http://dev.scrapy.org/wiki/ScrapyRecipes
-.. _Scrapy snippets: http://snippets.scrapy.org/
+See :ref:`intro-examples`.
 
 Can I run a spider without creating a project?
 ----------------------------------------------
@@ -240,3 +227,48 @@ In order to avoid parsing all the entire feed at once in memory, you can use
 the functions ``xmliter`` and ``csviter`` from ``scrapy.utils.iterators``
 module. In fact, this is what the feed spiders (see :ref:`topics-spiders`) use
 under the cover.
+
+Does Scrapy manage cookies automatically?
+-----------------------------------------
+
+Yes, Scrapy receives and keeps track of cookies sent by servers, and sends them
+back on subsequent requests, like any regular web browser does.
+
+For more info see :ref:`topics-request-response` and :ref:`cookies-mw`.
+
+How can I see the cookies being sent and received from Scrapy?
+--------------------------------------------------------------
+
+Enable the :setting:`COOKIES_DEBUG` setting.
+
+How can I instruct a spider to stop itself?
+-------------------------------------------
+
+Raise the :exc:`~scrapy.exceptions.CloseSpider` exception from a callback. For
+more info see: :exc:`~scrapy.exceptions.CloseSpider`.
+
+How can I prevent my Scrapy bot from getting banned?
+----------------------------------------------------
+
+Some websites implement certain measures to prevent bots from crawling them,
+with varying degrees of sophistication. Getting around those measures can be
+difficult and tricky, and may sometimes require special infrastructure.
+
+Here are some tips to keep in mind when dealing with these kind of sites:
+
+* rotate your user agent from a pool of well-known ones from browsers (google
+  around to get a list of them)
+* disable cookies (see :setting:`COOKIES_ENABLED`) as some sites may use
+  cookies to spot bot behaviour
+* use download delays (2 or higher). See :setting:`DOWNLOAD_DELAY` setting.
+* is possible, use `Google cache`_ to fetch pages, instead of hitting the sites
+  directly
+* use a pool of rotating IPs. For example, the free `Tor project`_.
+
+If you are still unable to prevent your bot getting banned, consider contacting
+`commercial support`_.
+
+.. _user agents: http://en.wikipedia.org/wiki/User_agent
+.. _Google cache: http://www.googleguide.com/cached_pages.html
+.. _Tor project: https://www.torproject.org/
+.. _commercial support: http://scrapy.org/support/