Skip to content

Commit

Permalink
Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12
Browse files Browse the repository at this point in the history
  • Loading branch information
pablohoffman committed Aug 11, 2011
2 parents bc2d218 + 19e6da5 commit 5da6ffb
Show file tree
Hide file tree
Showing 284 changed files with 3,954 additions and 57,332 deletions.
2 changes: 2 additions & 0 deletions AUTHORS
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,5 @@ Here is the list of the primary authors & contributors:
* Shuaib Khan
* Didier Deshommes
* Vikas Dhiman
* Jochen Maes
* Darian Moody
2 changes: 1 addition & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ include AUTHORS
include INSTALL
include LICENSE
include MANIFEST.in
include scrapy/core/downloader/responsetypes/mime.types
include scrapy/mime.types
include scrapyd/default_scrapyd.conf
recursive-include scrapy/templates *
recursive-include scrapy/tests/sample_data *
Expand Down
2 changes: 1 addition & 1 deletion README
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
This is Scrapy, an opensource screen scraping framework written in Python.

For more visit the project home page at http://scrapy.org
For more info visit the project home page at http://scrapy.org

14 changes: 14 additions & 0 deletions bin/scrapyd
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/bin/sh

repotac=$(cd $(dirname $0)/../extras; pwd)/scrapyd.tac

if [ -f "$repotac" ]; then
tacfile="$repotac"
elif [ -f "/usr/share/scrapyd/scrapyd.tac" ]; then
tacfile="/usr/share/scrapyd/scrapyd.tac"
else
echo "Unable to find scrapy.tac file"
exit 1
fi

twistd -ny "$tacfile"
5 changes: 3 additions & 2 deletions debian/control
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,14 @@ Source: scrapy-SUFFIX
Section: python
Priority: optional
Maintainer: Insophia Team <info@insophia.com>
Build-Depends: debhelper (>= 7.0.50), python (>=2.5), python-twisted
Build-Depends: debhelper (>= 7.0.50), python (>=2.6), python-twisted, python-w3lib
Standards-Version: 3.8.4
Homepage: http://scrapy.org/

Package: scrapy-SUFFIX
Architecture: all
Depends: ${python:Depends}, python-libxml2, python-twisted, python-openssl
Depends: ${python:Depends}, python-libxml2, python-twisted, python-openssl, python-w3lib
Recommends: python-setuptools
Conflicts: python-scrapy, scrapy, scrapy-0.11
Provides: python-scrapy, scrapy
Description: Python web crawling and scraping framework
Expand Down
1 change: 0 additions & 1 deletion debian/scrapy.examples

This file was deleted.

2 changes: 1 addition & 1 deletion debian/scrapy.install
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
usr/lib/python*/*-packages/scrapy
usr/lib/python*/*-packages/scrapy*
usr/bin
extras/scrapy_bash_completion etc/bash_completion.d/
1 change: 0 additions & 1 deletion debian/scrapyd.install
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
usr/lib/python*/*-packages/scrapyd
debian/scrapyd-files/000-default etc/scrapyd/conf.d
extras/scrapyd.tac usr/share/scrapyd
10 changes: 10 additions & 0 deletions docs/_ext/scrapydocs.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
from docutils.parsers.rst.roles import set_classes
from docutils import nodes

def setup(app):
app.add_crossref_type(
directivename = "setting",
Expand All @@ -19,3 +22,10 @@ def setup(app):
rolename = "reqmeta",
indextemplate = "pair: %s; reqmeta",
)
app.add_role('source', source_role)

def source_role(name, rawtext, text, lineno, inliner, options={}, content=[]):
url = 'http://dev.scrapy.org/browser/' + text
set_classes(options)
node = nodes.reference(rawtext, text, refuri=ref, **options)
return [node], []
34 changes: 0 additions & 34 deletions docs/api-stability.rst

This file was deleted.

128 changes: 0 additions & 128 deletions docs/experimental/crawlspider-v2.rst

This file was deleted.

1 change: 0 additions & 1 deletion docs/experimental/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,3 @@ it's properly merged) . Use at your own risk.
:maxdepth: 1

djangoitems
crawlspider-v2
74 changes: 53 additions & 21 deletions docs/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Frequently Asked Questions
==========================

How does Scrapy compare to BeautifulSoul or lxml?
How does Scrapy compare to BeautifulSoup or lxml?
-------------------------------------------------

`BeautifulSoup`_ and `lxml`_ are libraries for parsing HTML and XML. Scrapy is
Expand Down Expand Up @@ -84,10 +84,11 @@ How can I simulate a user login in my spider?

See :ref:`topics-request-response-ref-request-userlogin`.

Can I crawl in breadth-first order instead of depth-first order?
----------------------------------------------------------------
Does Scrapy crawl in breath-first or depth-first order?
-------------------------------------------------------

Yes, there's a setting for that: :setting:`SCHEDULER_ORDER`.
It crawls on breath-first order by default, but you can change it to
depth-first order by setting the :setting:`DEPTH_PRIORITY` setting to ``-1``.

My Scrapy crawler has memory leaks. What can I do?
--------------------------------------------------
Expand Down Expand Up @@ -115,24 +116,10 @@ Try changing the default `Accept-Language`_ request header by overriding the

.. _Accept-Language: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4

Where can I find some example code using Scrapy?
------------------------------------------------

Scrapy comes with a built-in, fully functional project to scrape the `Google
Directory`_. You can find it in the `examples/googledir`_ directory of the
Scrapy distribution.

Also, there's a site for sharing code snippets (spiders, middlewares,
extensions) called `Scrapy snippets`_.

Finally, you can find some example code for performing not-so-trivial tasks in
the `Scrapy Recipes`_ wiki page.
Where can I find some example Scrapy projects?
----------------------------------------------

.. _Google Directory: http://www.google.com/dirhp
.. _examples/googledir: http://dev.scrapy.org/browser/examples/googledir
.. _Community Spiders: http://dev.scrapy.org/wiki/CommunitySpiders
.. _Scrapy Recipes: http://dev.scrapy.org/wiki/ScrapyRecipes
.. _Scrapy snippets: http://snippets.scrapy.org/
See :ref:`intro-examples`.

Can I run a spider without creating a project?
----------------------------------------------
Expand Down Expand Up @@ -240,3 +227,48 @@ In order to avoid parsing all the entire feed at once in memory, you can use
the functions ``xmliter`` and ``csviter`` from ``scrapy.utils.iterators``
module. In fact, this is what the feed spiders (see :ref:`topics-spiders`) use
under the cover.

Does Scrapy manage cookies automatically?
-----------------------------------------

Yes, Scrapy receives and keeps track of cookies sent by servers, and sends them
back on subsequent requests, like any regular web browser does.

For more info see :ref:`topics-request-response` and :ref:`cookies-mw`.

How can I see the cookies being sent and received from Scrapy?
--------------------------------------------------------------

Enable the :setting:`COOKIES_DEBUG` setting.

How can I instruct a spider to stop itself?
-------------------------------------------

Raise the :exc:`~scrapy.exceptions.CloseSpider` exception from a callback. For
more info see: :exc:`~scrapy.exceptions.CloseSpider`.

How can I prevent my Scrapy bot from getting banned?
----------------------------------------------------

Some websites implement certain measures to prevent bots from crawling them,
with varying degrees of sophistication. Getting around those measures can be
difficult and tricky, and may sometimes require special infrastructure.

Here are some tips to keep in mind when dealing with these kind of sites:

* rotate your user agent from a pool of well-known ones from browsers (google
around to get a list of them)
* disable cookies (see :setting:`COOKIES_ENABLED`) as some sites may use
cookies to spot bot behaviour
* use download delays (2 or higher). See :setting:`DOWNLOAD_DELAY` setting.
* is possible, use `Google cache`_ to fetch pages, instead of hitting the sites
directly
* use a pool of rotating IPs. For example, the free `Tor project`_.

If you are still unable to prevent your bot getting banned, consider contacting
`commercial support`_.

.. _user agents: http://en.wikipedia.org/wiki/User_agent
.. _Google cache: http://www.googleguide.com/cached_pages.html
.. _Tor project: https://www.torproject.org/
.. _commercial support: http://scrapy.org/support/
Loading

0 comments on commit 5da6ffb

Please sign in to comment.